AlphaFold: Machine learning for protein structure prediction
In 2018, a group of computer scientists at DeepMind revealed a new method for protein structure prediction, called AlphaFold. In that year’s CASP competition, which benchmarks the state-of-the-art for protein structure prediction, AlphaFold swept the competition, generating more accurate predictions than any other research group.
AlphaFold has received considerable attention for this achievement, and a few weeks ago they published a scientific paper with the details of their new method. Since protein structure prediction often appears in Foldit puzzles, we wanted to review the AlphaFold method with Foldit players!
This blogpost is meant to summarize this exciting progress from AlphaFold, with an overview of their method, and some thoughts about the expected impact on protein research.
Machine Learning and Neural Nets
AlphaFold comes from DeepMind, a company well-known for tackling hard problems with machine learning algorithms. In 2016, a DeepMind program called AlphaGo famously beat a world-champion player of Go, a classic Chinese board game that is notoriously difficult for computer programs.
Machine learning (ML) is a branch of computer science that deals with self-improving algorithms. An ML algorithm is set up to perform a well-defined task, with a well-defined measure of performance. Over a “training” period, the algorithm is able to evaluate its own performance at the task and iteratively make changes that improve its performance.
One popular type of ML algorithm is a neural net, so called because it is inspired by the organization of neurons in the brain. Just like a web of neurons that communicate through synapses, a neural net is a web of virtual “nodes” that pass signals to one another. Typically, each node performs a simple mathematical operation on received signals (for example, testing if the sum of the signals exceeds some threshold), then passes on the new signal to downstream nodes. Training a neural net involves tuning the operations at each node so that the entire network produces the desired output from the training inputs.
A diagram of a simple neural network (from WikiMedia Commons). Signals are passed between nodes, each of which performs some simple (nonlinear) operation on the received signal and passes on the result. This network contains a single hidden layer of 4 nodes; the AlphaFold neural net contains hundreds of layers with thousands of nodes.
Neural nets have been very useful for abstracting information from complex inputs. A popular application of neural nets is the image recognition problem: the input is a 2D array of colored pixels, and the task is to classify the depicted object.
The AlphaFold algorithm is a neural net, very similar to the kind used for image recognition. In this case, the input is information about the protein sequence, and the task is to predict the distance between each residue in the folded protein.
Predicted Contacts vs. Predicted Distances
Many Foldit players will already be familiar with the concept of predicted contacts. These are residues in a protein that are predicted to be close to one another (“in contact”) in the folded structure, even if they are not neighbors in the protein sequence.
These predictions come from covariance patterns that emerge during evolution. We can observe these patterns by comparing very similar protein sequences in different organisms. For instance, we could compare the hemoglobin sequence in humans, chimps, dogs, mice, etc., and look for positions that tend to co-vary (i.e. two residues that seem to change together, as if they depend on one another). Strong covariance between two residues usually suggests that those residues interact with one another in the folded structure, through side-chain packing, H-bonding, electrostatics, etc.
Cartoon diagram of covariance (from GREMLIN). (Left) In these two related protein structures, the red and green residues interact with one another. When one of these mutates during the course of evolution, its partner may also have to mutate to maintain the interaction. (Right) Even when we don’t know the structure of these proteins, we can see evidence of this interaction when we compare lots of related protein sequences. The two positions in the dashed boxes display strong covariance.
One of the key insights of the AlphaFold group was to take these predictions a step further: Instead of using covariance to predict whether a two residues are “in contact” (a simple yes/no), AlphaFold attempts to predict the distance between the two residues (a range of values between 2 and 20 Å). These predictions are more difficult to make, but successful predictions provide much richer information about the folded protein structure.
We should note that, in 2018, a few other research groups were also using neural networks to predict distances—not just AlphaFold. The second insight of AlphaFold concerns their ability to generate a folded protein structure from predicted distances. They represent each distance prediction as a smooth restraint function, which allows them to employ a simple technique called gradient descent, directly folding the protein into a structure compatible with their predicted distances.
Predicted distances for residue pairs. (a) Similar to a contact map, this plot shows the predicted distance between every pair of residues in the structure. (b) For each pair of residues, the neural net produces a probability distribution of distances for each pair of residues. For the pair of residues marked by the blue star in (a), we can see the probability distribution favors a distance of about 8 Å. (c) The probability distribution is converted to a smooth restraint function, where the lowest point of the function corresponds to the favored distance (in this case, 8 Å). A simple gradient descent algorithm allows AlphaFold to efficiently fold a protein structure that optimizes all of their distance predictions.
Finally, AlphaFold combines their distance predictions with the Rosetta energy function (the same energy function used by Foldit) to refine their final folded structure.
AlphaFold Performance in CASP
The Critical Assessment of protein Structure Prediction (CASP) is an opportunity for different researchers to compare their structure prediction methods in a head-to-head competition. The CASP organizers collect unpublished protein structures and challenge researchers to predict the structures based on their protein sequence. Because the true protein structures are unpublished, all the predictions are “blind,” and all the participants can evaluate their methods on a level playing field, starting from the same information.
AlphaFold’s neural net was able to make remarkably accurate distance predictions for many of the targets of the 2018 CASP competition, and this led them toward protein models that were very similar to the true structure. The best way to visualize AlphaFold’s success is to look at their summed Z-score for all targets in the Free Modeling category.
Rankings from the 2018 CASP Free Modeling category (from CASP13). The y-axis shows the summed Z-score across all targets in the category, with all competing groups on the x-axis. The leftmost bar represents the AlphaFold group.
This is an incredible achievement, and AlphaFold represents a significant step forward in protein structure prediction, but the structure prediction problem is still far from “solved.” For most natural proteins, AlphaFold relies heavily on covariance patterns, and often struggles when the target has very few related sequences (covariance is harder to detect with just a few related sequences). However, even with zero related sequences AlphaFold can still make distance predictions, albeit with lower confidence. AlphaFold showed this by correctly predicting the structure of Foldit3, a protein designed by Foldit players, with no related sequences and no co-variance information!
One scientific limit of AlphaFold is that it suffers from the “black box” problem. Neural nets like the AlphaFold algorithm are considered “black box” techniques because their inner workings are hard to interpret. It is very difficult for us to deconstruct a neural network to figure out exactly what concepts the algorithm is “learning” about proteins. In other words, AlphaFold has improved our ability to predict a protein structure from its sequence; but hasn’t directly increased our understanding of how protein sequence relates to structure.
Impact of AlphaFold
Since AlphaFold’s debut in 2018, many other research groups have begun experimenting with machine learning for predicting residue distances. Just this month, shortly after AlphaFold published their method, researchers at the Baker Lab published trRosetta, which builds on the AlphaFold method (see PDF from the Baker Lab website).
The Baker Lab researchers realized that a neural net could be trained to predict not just the distance between two residues, but also the relative orientation of those two residues. By training an algorithm to predict both distance and orientation between residues, the Baker group was able to make protein models with even greater accuracy!
Building on AlphaFold with trRosetta. (a) The AlphaFold neural net predicts only the distance between residues pairs. We can also train the neural net to predict the orientation of residue pairs (defined by several angles and torsions). (b) These angle and torsion predictions can also be converted into smooth restraint functions, which is key for applying the predictions to a protein model. (c) The orientation predictions improve the accuracy of final protein models for a set of CASP targets.
The CASP competition returns in the summer of 2020, and it will be very exciting to see how other groups have incorporated AlphaFold’s progress into their own prediction methods!
However, Foldit is unlikely to see any immediate changes as a direct result of AlphaFold’s success.
Since Foldit was launched in 2008, our focus has been gradually shifting away from protein structure prediction. The main reason for this is that we think Foldit players have more to contribute in other problems, like protein design or building models into cryoEM data. It’s likely that we can use distance predictions to help with these tasks (for example, to check if distance predictions for a designed sequence are compatible the designed structure), but for now we are still evaluating the most effective ways to use neural nets for these kinds of problems!
Special thanks goes to Baker Lab scientist Ivan Anishchenko for contributions to this blog post!( Posted by bkoep 129 1611 | Fri, 01/31/2020 - 00:25 | 8 comments )