AlphaFold: Machine learning for protein structure prediction
In 2018, a group of computer scientists at DeepMind revealed a new method for protein structure prediction, called AlphaFold. In that year’s CASP competition, which benchmarks the state-of-the-art for protein structure prediction, AlphaFold swept the competition, generating more accurate predictions than any other research group.
AlphaFold has received considerable attention for this achievement, and a few weeks ago they published a scientific paper with the details of their new method. Since protein structure prediction often appears in Foldit puzzles, we wanted to review the AlphaFold method with Foldit players!
This blogpost is meant to summarize this exciting progress from AlphaFold, with an overview of their method, and some thoughts about the expected impact on protein research.
Machine Learning and Neural Nets
AlphaFold comes from DeepMind, a company well-known for tackling hard problems with machine learning algorithms. In 2016, a DeepMind program called AlphaGo famously beat a world-champion player of Go, a classic Chinese board game that is notoriously difficult for computer programs.
Machine learning (ML) is a branch of computer science that deals with self-improving algorithms. An ML algorithm is set up to perform a well-defined task, with a well-defined measure of performance. Over a “training” period, the algorithm is able to evaluate its own performance at the task and iteratively make changes that improve its performance.
One popular type of ML algorithm is a neural net, so called because it is inspired by the organization of neurons in the brain. Just like a web of neurons that communicate through synapses, a neural net is a web of virtual “nodes” that pass signals to one another. Typically, each node performs a simple mathematical operation on received signals (for example, testing if the sum of the signals exceeds some threshold), then passes on the new signal to downstream nodes. Training a neural net involves tuning the operations at each node so that the entire network produces the desired output from the training inputs.
A diagram of a simple neural network (from WikiMedia Commons). Signals are passed between nodes, each of which performs some simple (nonlinear) operation on the received signal and passes on the result. This network contains a single hidden layer of 4 nodes; the AlphaFold neural net contains hundreds of layers with thousands of nodes.
Neural nets have been very useful for abstracting information from complex inputs. A popular application of neural nets is the image recognition problem: the input is a 2D array of colored pixels, and the task is to classify the depicted object.
The AlphaFold algorithm is a neural net, very similar to the kind used for image recognition. In this case, the input is information about the protein sequence, and the task is to predict the distance between each residue in the folded protein.
Predicted Contacts vs. Predicted Distances
Many Foldit players will already be familiar with the concept of predicted contacts. These are residues in a protein that are predicted to be close to one another (“in contact”) in the folded structure, even if they are not neighbors in the protein sequence.
These predictions come from covariance patterns that emerge during evolution. We can observe these patterns by comparing very similar protein sequences in different organisms. For instance, we could compare the hemoglobin sequence in humans, chimps, dogs, mice, etc., and look for positions that tend to co-vary (i.e. two residues that seem to change together, as if they depend on one another). Strong covariance between two residues usually suggests that those residues interact with one another in the folded structure, through side-chain packing, H-bonding, electrostatics, etc.
Cartoon diagram of covariance (from GREMLIN). (Left) In these two related protein structures, the red and green residues interact with one another. When one of these mutates during the course of evolution, its partner may also have to mutate to maintain the interaction. (Right) Even when we don’t know the structure of these proteins, we can see evidence of this interaction when we compare lots of related protein sequences. The two positions in the dashed boxes display strong covariance.
One of the key insights of the AlphaFold group was to take these predictions a step further: Instead of using covariance to predict whether a two residues are “in contact” (a simple yes/no), AlphaFold attempts to predict the distance between the two residues (a range of values between 2 and 20 Å). These predictions are more difficult to make, but successful predictions provide much richer information about the folded protein structure.
We should note that, in 2018, a few other research groups were also using neural networks to predict distances—not just AlphaFold. The second insight of AlphaFold concerns their ability to generate a folded protein structure from predicted distances. They represent each distance prediction as a smooth restraint function, which allows them to employ a simple technique called gradient descent, directly folding the protein into a structure compatible with their predicted distances.
Predicted distances for residue pairs. (a) Similar to a contact map, this plot shows the predicted distance between every pair of residues in the structure. (b) For each pair of residues, the neural net produces a probability distribution of distances for each pair of residues. For the pair of residues marked by the blue star in (a), we can see the probability distribution favors a distance of about 8 Å. (c) The probability distribution is converted to a smooth restraint function, where the lowest point of the function corresponds to the favored distance (in this case, 8 Å). A simple gradient descent algorithm allows AlphaFold to efficiently fold a protein structure that optimizes all of their distance predictions.
Finally, AlphaFold combines their distance predictions with the Rosetta energy function (the same energy function used by Foldit) to refine their final folded structure.
AlphaFold Performance in CASP
The Critical Assessment of protein Structure Prediction (CASP) is an opportunity for different researchers to compare their structure prediction methods in a head-to-head competition. The CASP organizers collect unpublished protein structures and challenge researchers to predict the structures based on their protein sequence. Because the true protein structures are unpublished, all the predictions are “blind,” and all the participants can evaluate their methods on a level playing field, starting from the same information.
AlphaFold’s neural net was able to make remarkably accurate distance predictions for many of the targets of the 2018 CASP competition, and this led them toward protein models that were very similar to the true structure. The best way to visualize AlphaFold’s success is to look at their summed Z-score for all targets in the Free Modeling category.
Rankings from the 2018 CASP Free Modeling category (from CASP13). The y-axis shows the summed Z-score across all targets in the category, with all competing groups on the x-axis. The leftmost bar represents the AlphaFold group.
This is an incredible achievement, and AlphaFold represents a significant step forward in protein structure prediction, but the structure prediction problem is still far from “solved.” For most natural proteins, AlphaFold relies heavily on covariance patterns, and often struggles when the target has very few related sequences (covariance is harder to detect with just a few related sequences). However, even with zero related sequences AlphaFold can still make distance predictions, albeit with lower confidence. AlphaFold showed this by correctly predicting the structure of Foldit3, a protein designed by Foldit players, with no related sequences and no co-variance information!
One scientific limit of AlphaFold is that it suffers from the “black box” problem. Neural nets like the AlphaFold algorithm are considered “black box” techniques because their inner workings are hard to interpret. It is very difficult for us to deconstruct a neural network to figure out exactly what concepts the algorithm is “learning” about proteins. In other words, AlphaFold has improved our ability to predict a protein structure from its sequence; but hasn’t directly increased our understanding of how protein sequence relates to structure.
Impact of AlphaFold
Since AlphaFold’s debut in 2018, many other research groups have begun experimenting with machine learning for predicting residue distances. Just this month, shortly after AlphaFold published their method, researchers at the Baker Lab published trRosetta, which builds on the AlphaFold method (see PDF from the Baker Lab website).
The Baker Lab researchers realized that a neural net could be trained to predict not just the distance between two residues, but also the relative orientation of those two residues. By training an algorithm to predict both distance and orientation between residues, the Baker group was able to make protein models with even greater accuracy!
Building on AlphaFold with trRosetta. (a) The AlphaFold neural net predicts only the distance between residues pairs. We can also train the neural net to predict the orientation of residue pairs (defined by several angles and torsions). (b) These angle and torsion predictions can also be converted into smooth restraint functions, which is key for applying the predictions to a protein model. (c) The orientation predictions improve the accuracy of final protein models for a set of CASP targets.
The CASP competition returns in the summer of 2020, and it will be very exciting to see how other groups have incorporated AlphaFold’s progress into their own prediction methods!
However, Foldit is unlikely to see any immediate changes as a direct result of AlphaFold’s success.
Since Foldit was launched in 2008, our focus has been gradually shifting away from protein structure prediction. The main reason for this is that we think Foldit players have more to contribute in other problems, like protein design or building models into cryoEM data. It’s likely that we can use distance predictions to help with these tasks (for example, to check if distance predictions for a designed sequence are compatible the designed structure), but for now we are still evaluating the most effective ways to use neural nets for these kinds of problems!
Special thanks goes to Baker Lab scientist Ivan Anishchenko for contributions to this blog post!( Posted by bkoep 70 465 | Fri, 01/31/2020 - 00:25 | 8 comments )
The Poly-Proline Helix Design Series
Hi all, neilpg628 here to tell you about a new puzzle series we have planned to introduce a secondary structure to Foldit for the first time!
The poly-proline helix
All of the proteins that we have passed to you before have been composed mainly of α-helices and β-sheets. We want to introduce you to the poly-proline helix, which is much tighter than an α-helix, but is less stable because there is no internal hydrogen bonding between residues.
α-helices and β-sheets have hydrogen bonds which keep the structure together. The poly-proline helix has no such bonds
While these helices are typically made almost-entirely out of proline, they can be made out of other amino acids, as long as the bond geometry is roughly the same as that of a regular poly-proline helix. They are found in many proteins, and we want to incorporate them into Foldit to make your contributions relevant to a wider range of proteins!
Unsatisfied polar atoms
Unlike α-helices, poly-proline helices have polar oxygens pointing out from the protein backbone (see above figure). It will be important to satisfy these polar oxygens with hydrogen bonds, to ensure that any protein incorporating a poly-proline helix stays folded!
These special helices have not been used much in the field of protein design, but they are found throughout nature! Collagen is a protein composed of 3 poly-proline helices, bundled together so that the backbone oxygens can make hydrogen bonds. Collagen has exceptional tensile strength, and is responsible for the toughness of animal connective tissue.
Collagen’s hydrogen bond network makes it extremely stable even though it is composed of these unstable poly-proline helices
New design puzzles with the poly-proline helix
In our first poly-proline helix puzzle, we’ll provide a small 38 residue protein with a frozen poly-proline helix and designable residues on either side. We want you to redesign the starting structure into a compact protein that can support the poly-proline helix. We're starting with a small protein, but we plan to introduce poly-proline helices attached to larger proteins in the future, to see if you can design more complex poly-proline helix proteins!
Promising designs will be tested in the lab for stable folding! This work could open up new opportunities to apply poly-proline helices in environments where they would not normally fold. Check out Puzzle 1763: Poly-Proline Helix Design: Round 1 now!
Happy designing!( Posted by neilpg628 70 1873 | Thu, 11/21/2019 - 19:31 | 8 comments )
The Foldit cryo-EM paper
The latest Foldit research paper, about Cryo-EM Density puzzles, was published today in the journal PLOS Biology! The paper is open-access, meaning that anybody can read and share it for free, from the journal website.
The paper is a formal research article, so it is written in technical language meant for other scientists, and skips over some background info. Below, we cover the main points so that everyone can appreciate this accomplishment by Foldit players!
Electron density in Foldit
The paper is about recent Foldit puzzles in the Electron Density category, where players fold the target protein into a 3D “cloud” of density that maps the shape of the folded protein. The paper reports solutions from Puzzles 1572, 1588, 1598, and 1606.
Foldit Puzzle 1598: Cryo-EM Freestyle with Density
This is not the first time Foldit players have wowed us in an electron density puzzle! Some of you may remember Puzzle 1152: Foldit vs. UMich Electron Density Challenge from back in 2015. In that contest, players built solutions into a high-resolution (1.9 Å) density map from x-ray diffraction experiments. Foldit players outperformed UMich undergraduates, expert crystallographers, and state-of-the-art computer algorithms! Those results were published in a previous paper.
This previous result gave us a clue that electron density might be a sweet spot for Foldit players, so we started to look at other kinds of density maps...
Cryo-electron microscopy (cryo-EM) is another technique for getting density maps and solving protein structures. In a cryo-EM experiment, a sample of protein in solution is spread on a thin metal wafer and quickly cooled to cryogenic temperatures to quench all molecular motion, freezing all of the protein atoms in a sheet of vitreous ice. Then we bombard the frozen sample with a beam of high-energy electrons, which scatter when they collide with the atoms of the protein. A detector measures the electron scattering, and the result is a grainy 2D “micrograph” of the wafer and any proteins on its surface.
Example cryo-EM micrograph of the S. entomophila antifeeding prophage, used to generate the maps for the puzzles in this paper. Used with permission of Ambroise Desfosses and Irina Gutsche (source).
If we collect enough of these raw micrographs (think millions), then we can align all of the individual protein molecules and average them together to get a clearer 2D picture of the protein. Finally, we combine all the 2D images to arrive at a 3D reconstruction of the protein, in the form of a density cloud—very similar to the electron density clouds that we get from x-ray diffraction experiments!
Unlike x-ray diffraction, cryo-EM experiments are fairly easy to set up (no protein crystals needed!). But cryo-EM has been unpopular for protein structure research because it yields a lower-resolution, “blobbier” density cloud than x-ray diffraction. However, that started to change around 2012, when a technological breakthrough gave us improved electron-scattering detectors and higher resolution maps. Since then, cryo-EM has taken off, and the number of new cryo-EM protein structures has been doubling every 2 years (by contrast, new x-ray diffraction structures have plateaued since 2013).
Cryo-EM and Foldit players
Even with the recent improvements, cryo-EM maps are not quite as clear as x-ray diffraction maps. The highest resolution typically achieved by cryo-EM is about 3.0 Å. Since covalently bonded atoms are separated by < 2 Å, that means we still can’t make out the positions of individual atoms simply by looking at the map. Instead, we have to infer the positions of the atoms, using our knowledge of physics and protein structure to find a plausible model that fits the map.
Building a plausible protein structure into a low-resolution map is difficult and prone to errors. If a microscopist focuses too much on fitting the density cloud, they might end up with a strained (high energy) model that is physically unrealistic. On the other hand, a computer algorithm that optimizes energy can have a hard time fitting a model into the density map.
This is where Foldit players come in! We know from previous work that Foldit players are adept at interpreting density maps; and the Foldit score function should help guide players toward plausible, low energy models.
In Puzzles 1572, 1588, 1598, and 1606, we provided Foldit players with cryo-EM maps for four proteins that make up the S. entomophila antifeeding prophage (a complex needle-like structure used by bacteria to inject toxins into a target cell). We then compared Foldit player solutions with those of expert microscopists and a handful of automated algorithms.
Comparison of solutions from different methods. (Top) The top Foldit solution from Puzzle 1588 and the model built by the scientist. They look pretty similar when you look this zoomed out, but looking closer: (Bottom) Subtle deviations in the models can yield significant results. In the bottom-right image, an automated algorithm (magenta) had trouble matching the density, and left some regions of the map completely empty.
Foldit players take gold!
In each of the four puzzles, Foldit player solutions had the best balance of plausibility and fit-to-density! If you’re curious, the scientists came in second, and the algorithms came in last (but there was a lot of variance between different algorithms).
Foldit players achieve plausibility and high fit-to-density for AFP7 (Puzzle 1588). (Left) Microscopists build strained models that have many clashes. (Right) Automatic algorithms like Rosetta and Phenix build models with poor fit-to-density (according to three different measures of map correlation). Foldit players build realistic models with few clashes, and still fit the density with a high map correlation.
We also want to point out that the Foldit rankings were incredibly accurate in these puzzles! As most players are aware, the best-scoring solution in Foldit is not necessarily the most accurate scientifically (because the Foldit score function is not a perfect reflection of reality). This is why we run our scientific analysis on all of the high-scoring solutions, to see what actually looks best against the scientific data: sometimes it’s rank #2, and sometimes it’s rank #20. However, in all four of these cryo-EM puzzles, the #1 top-scoring Foldit solution also had the best scientific evaluation!
This is important because it supports the accuracy of the Foldit score function. Foldit players can have more confidence that when their score goes up, so does the scientific value of their solution. It should also give more confidence to other scientists that might want to collaborate with Foldit players in the future. We hope this is just the beginning for Foldit cryo-EM!
Finally, we want to thank all the Foldit players that participated in these cryo-EM puzzles! Even if you didn’t work directly on the models presented in the paper, your folding helps to drive the competition that leads to high-scoring solutions. We love to see Foldit players continuing to share ideas and set high standards for each other! Some of the Foldit players who worked on the solutions in the paper have written up their folding strategies, which you can read in the paper supplement.( Posted by bkoep 70 465 | Mon, 11/11/2019 - 19:29 | 6 comments )
The Aflatoxin Challenge Returns!
The Aflatoxin Challenge is back! Since we left off last November, the Siegel Lab at UC Davis has been hard at work testing designs from Foldit players. Unfortunately, they ran into a major setback (all too common in scientific research), and had to go back to the drawing board to rethink their strategy. But they are back now with a new enzyme scaffold that is better suited to degrade the aflatoxin molecule, and they're asking Foldit players to redesign the enzyme so that it can bind aflatoxin more strongly!
Aflatoxin contamination in the food supply chain has resulted in health issues approaching epidemic status in developing countries, and vast food stores are deemed unsafe for consumption in regulated markets. There remain no effective means of aflatoxin removal that also maintain the food quality required for commercial products. Using modern synthetic biology tools, a UC Davis team of scientists in collaboration with the Mars Global Food Safety Center have spearheaded efforts to develop novel remediation tools.
In 2017, the Siegel Lab characterized a diverse panel of ~50 hydrolytic enzymes for expression and solubility. Then, a consortium with Mars, UC Davis, UW, Northeastern, ThermoFisher, FAO and PACA was developed around Foldit, so that citizen-scientist Foldit players might engineer new functionality into these hydrolytic enzymes and allow them to degrade the harmful aflatoxin molecule.
After the first 12 design rounds in Foldit, >500 designed proteins were tested—but not a single active enzyme was found! The UC Davis team went back and retested some fundamental assumptions that had been made when looking at the hydrolytic enzymes. They found that, at neutral pH, hydrolysis is not thermodynamically favorable for aflatoxin B1, and therefore it would have been impossible to develop a hydrolytic catalyst.
An alternative reaction
With this knowledge in hand, a new class of enzymes was targeted that catalyzes oxidative reactions, and requires nothing beyond O2. A set of ~20 diverse naturally occurring oxidative enzymes were synthesized and characterized. In initial activity screens, 2 of these were found to degrade all detectable aflatoxin. Today, we are restarting the Aflatoxin Challenge with a new Round 13 puzzle, in which players can redesign one of these active enzymes to improve hypothesized interactions with the aflatoxin molecule.
There is still a long way to go before this enzyme is efficient and specific enough for use in industrial settings. We are looking to the Foldit community to help us redesign the binding pocket. We hope Foldit players can introduce new packing interactions and hydrogen bonds with aflatoxin, to stabilize its hypothesized orientation and prime it for oxidation. We look forward to seeing what Foldit players can come up with! Play the new Aflatoxin Challenge: Round 13 puzzle now!
As in the previous aflatoxin puzzles, all Foldit player designs will be public domain. By participating in these Aflatoxin Challenge puzzles, the players agree that all player designs will be available permanently in the public domain, and the players will not seek intellectual property protection over the designs created as part of the challenge.( Posted by bkoep 70 465 | Mon, 09/30/2019 - 16:42 | 0 comments )
Protein Design Critique: IL-7R Binder Redesign
You’re doing great so far! I've looked at your solutions from first 4 puzzle rounds, and I think a lot of your designs are going to work! I just wanted to remind everyone that, in addition to the Foldit score you get on each puzzle, in the end you'll also get a binding score based on our testing of these designs in the lab!
Designing a protein to fold precisely is a difficult problem! When we test your protein, we are testing whether the sequence you chose folds into the shape of your solution. In Foldit, you can change your solution into whatever shape you want, but in the lab your sequence might not fold into the shape you wanted. It took scientists decades to figure out the shape a given protein sequence folds into (they call this the Protein Folding Problem). The good news for you though is that the Protein Folding Problem has a really simple answer:
I want to emphasize a few guidelines you can use to ensure your designed fold is the most favorable state:
· Secondary structure - use lots of alpha-helices or beta sheets
· Puzzle score - try to have the best score for your chosen fold
· Short loops - you'll need to use loops, but keep them as short as possible
Next I’ll show some examples and give my thoughts on a few designs from Foldit players. Please note that all of these designs have been chosen because they showcase a single weakness in an otherwise excellent design. We don't mean to disparage anyone's designs—on the contrary, the solutions highlighted in this critique are among our favorites!
A study of two 3-helix bundles
While both of these structures emphasize secondary structure and well-packed cores, design A is more likely to fold because of its shorter loops.
The reason we prefer secondary structure to loops is that loops typically have many alternate conformations (decoys) that score the same or even better than the design model. Shorter loops mean fewer decoys and a better chance of folding as intended. For instance, one can imagine how the loop of design B could misfold so that the third helix is on the wrong side of the bundle.
Bad beta-sheet, better beta-sheet, best beta-sheet
Beta sheets are a tricky secondary structure, because they require distant parts of the protein chain to come together. The point I want to highlight here again is that shorter loops are almost always better. In design C, there are too many loop residues between the helices and sheets. These loop residues are likely to rearrange themselves in real life.
Design D has shorter loops, but I still see a few backbone H-bond pairs that are unsatisfied here. (Also, I'm not so sure about that ARG / GLU zipper there. ARG / GLU like to form helices, so I'd probably go with HIS / THR...)
Design E is an optimized Baker Lab design (not from the IL-7R series), but I wanted to include it to demonstrate my point. Look at how short those loops are! This is a difficult fold to master, but FoldIt players like challenges, right?
4-helix bundles, the good, the bad, and the ugly
When it comes to 4-helical bundles (and really all designed proteins), the name of the game is compact. You want your design to resemble a ball with all portions stabilized by at least 2 other secondary structures. Design H fails just that; it's too long and unsupported. This structure will almost certainly fold into something more compact in real life.
Design G also fails this rule, as it's leaving a large portion of the structure thin and unsupported. Those two helices would have been better on top of the protein like the good example is doing here.
Yes, design F would be better if the helices were longer, but we didn't give players enough residues for that (unfortunately, we're limited to small proteins for our lab experiment). If you run out of residues for good helix packing, you can try beta-sheets. Although, previous experiments have shown that helices are more robust than beta-sheets. So if the choice is between an okay beta-sheet and an okay helix, I'd go for the helix.
Don't try to make additional target contacts
First let me say that these designs are very interesting in that they make additional contacts with the target. Especially in design I, I'm not even sure I could design that with all the tools I have! But, I want to remind everyone that in this design challenge, folding is more important than binding.
You've already been given two helices that are guaranteed to bind the IL-7R. If you can just fold the rest of the protein into a stable fold then you'll have a binder!
Great 3-helix bundle, but that long loop isn't going to fly
Finally, one more design to really hammer home the message of shorter loops. Design K looks great with three well packed helices, but look a little closer and you'll see that a long loop is required to stretch back and meet the third helix. I'll admit, this protein has a chance to work, but with a loop that long, who knows where the final helix will actually fold...
Posted by bcov 70 1873 |
Fri, 08/16/2019 - 17:57 |
We have a lot more puzzles planned for this series, and we look forward to seeing more designs from Foldit players! Round 5 just closed, and we'll get started on the analysis of those solutions right away. In the mean time, check out the Round 6 puzzle, which is online now!