Recognizing Protein Handshakes with Machine Learning
15 Dec 2021 - 5 min
A team of researchers and research software engineers have developed a toolkit, DeepRank, that enables researchers to easily train deep learning models to predict protein-protein complexes. What does this mean for those outside the world of biology? Well, proteins usually work in teams. At least teams of two. Together, they form a stable complex.
The creation of these protein-protein complexes is a crucial mechanism for the correct functioning of cells that support a wide array of processes such as membrane transport, cell metabolism and muscle contraction. If this handshake goes wrong, the consequences can be dire, and many diseases are related to the incorrect assembly of protein complexes. Correctly predicting these complexes leads to a better understanding of disease and consequently holds the potential of innovative cures. Here’s how the research team did it.
Determining protein-protein complexes is challenging. First, there are no less than an estimated 650, 000 protein-protein complexes at play in the human body. Second, the experimental determination of a single complex is a costly and labour-intensive task, which means that so far, only a very limited number of complexes have been identified in this manner.
Scientists therefore need to rely on computational tools to determine how proteins assemble. One of these tools, HADDOCK, developed by Alexandre Bonvin from Utrecht University, exploits the physical properties of the atoms constituting the proteins to simulate the formation of a complex. While this approach has been vastly successful, biology is more than physics.
To go beyond physical modelling, Alexandre and Li Xue, from Radboud Medical Center, proposed to develop a deep learning pipeline that would learn how to recognize favourable interactions between two proteins. This learning process involved not only the physical properties of the proteins but also the biological relevance of the protein building blocks. That’s where the experts from the Netherlands eScience Center came in. They helped build the pipeline. Working closely with the researchers at Utrecht and Radboud, research software engineers from the eScience Center developed a comprehensive computational pipeline that enables researchers to easily train deep learning models for the prediction of protein-protein complexes.
Inspired by the great success of deep learning models for image recognition, each protein-protein interface is encoded in a multi-channel 3D image. These images are then used as input for a 3D convolutional neural network that progressively learns the patterns characteristic of stable protein-protein complexes. This tool, called DeepRank, has recently been published in Nature Communications and can be easily accessed through GitHub and the eScience Center Research Software Directory.
In machine learning, an even greater challenge than being able to model is having access to high-quality data. Building upon existing datasets, the team assembled about six million distinct protein complexes from 142 protein pairs and computed the relevant features for each complex. This large dataset of about 11 terabytes was then used to train deep learning models, employing the compute infrastructure provided by SURFsara. Following the FAIR principles, this extensive dataset has been made available through SB-Grid for other researchers to train their own models on.
Despite the relatively simple architecture of the trained neural networks, the results presented in the paper show that these models are performing very well. To prove their applicability to real cases, these models were tested to score protein complexes previously generated in the CAPRI competition. This is an annual event where researchers compete to predict the relative positions that two proteins take when forming a protein-protein complex.
The competition was fierce but the deep learning models performed well, taking first place in some cases although they were outperformed in others. This simply illustrates the difficulty of creating a model that performs equally well across the wide variety of protein-protein interfaces. The publication in Nature Communications calls for more work on designing even better solutions.
“Potential applications for DeepRank are wide,” Xue excitedly reports. “DeepRank is currently being used to aid cancer vaccine design and extended to predict pathogenicity of human genetic variants.”
Such an ambitious goal is difficult to reach alone, and the effort of many research groups will be necessary to get there. Thanks to the adoption of software best practices during the development of DeepRank, the tool can easily be reused by other research groups, thus contributing to its sustainability.
“We envision that DeepRank will stimulate community efforts in exploiting deep learning to tackle long-standing challenges in the life sciences,” says Xue.
Interested in learning more about the tool? Read the full publication in Nature Communications.
Want to access the code used for the DeepRank toolkit? Visit https://github.com/DeepRank/deeprank and the data set via SB Grid: https://data.sbgrid.org/dataset/843/