First look at coronavirus solutions
Today was the final day for Puzzle 1805b: Coronavirus Spike Protein Binder Design! If you didn't get a chance to participate, or if you have more ideas to try, don't worry. There is a new and improved Round 2 puzzle where you can continue to work on your designs!
Now that our first puzzle has closed, scientists at the Institute for Protein Design at the University of Washington School of Medicine will take a close look at Foldit players' solutions. At first this will involve an intensive computational analysis of Foldit players' designs. We will try to assess both whether the designed proteins will fold correctly and if they might bind to the coronavirus target.
Promising solutions will then advance to laboratory testing, where we will manufacture select Foldit player-designed proteins and test to see if they stick to coronavirus spike protein. (Don't worry, scientists can safely experiment with the coronavirus spike protein without exposing ourselves to live virus)
In the mean time, we've already had a chance to look at some initial solutions from Foldit players. Below, we take a brief look at some of our favorite solutions so far, what we like about them, and what can be improved.
In Foldit, orange sidechains are hydrophobic. This means that they like to be buried, away from the water that surrounds the outside of the protein. Proteins naturally tend to fold up in ways that bury these orange hydrophobics in the protein core. This is called the hydrophobic effect.
This same hydrophobic effect can also drive protein molecules to stick together! If two proteins each have small, complementary patches of orange hydrophobic sidechains on their surface (exposed to surrounding water), then the two proteins will tend to stick together in order to hide these sidechains.
For a designed protein to fold, we need to make sure it has a significant core with lots of buried orange sidechains. And for it to bind against the coronavirus target, it will also need to bury orange sidechains at the binding interface. However, if the designed protein has too many orange sidechains on its surface, it will misfold.
Below is an excellent designed protein! It has a significant core of buried orange hydrophobics, and the surface mostly consists of blue sidechains. In addition, almost all of the residues in the design form alpha-helices; this is a very stable configuration for the protein backbone. So, if we were to synthesize this protein in the lab, there's a good chance it would fold up into this desired shape!
At the binding interface (on the right), we can see that this design makes close contacts with two bulky hydrophobic sidechains (highlighted in purple) on the coronavirus target. This will likely help to bury the bulky hydrophobics away from the surrounding water, and may result in tight binding between the designed protein and coronavirus target!
The coronavirus spike protein is an especially difficult target to bind because there are not very many orange hydrophobics on its surface. In Foldit, blue sidechains are hydrophilic. These sidechains have polar oxygen and nitrogen atoms that can make very stable hydrogen bonds with the water surrounding the protein. For this reason, blue sidechains normally like to be exposed on the surface of the protein.
If we want to bind to the coronavirus target, our designed protein will probably have to bury some of these blue sidechains away from water. In other words, our designed binder will disrupt the stable hydrogen bonds that the blue sidechains normally make with water. The only way to compensate for this is by making hydrogen bonds to all of the oxygens and nitrogens that are buried at the interface.
In Foldit, you can see polar oxygen and nitrogen atoms by setting your View Settings to Hydro/Score+CPK coloring. This will color all oxygen and nitrogen atoms red and blue. Polar oxygen and nitrogens on the coronavirus target will need to be matched with polar atoms on the design to make hydrogen bonds!
Below is another excellent design from Foldit players! Again, this design has lots of orange sidechains buried in the core of the protein, with blue sidechains on the surface. This design also has lots of structure: the protein forms beta-sheets in addition to alpha-helices, which is another stable arrangement. So we think this design is likely to fold up correctly if tested in the lab!
If we look at the binding interface for this design (on the right), we see some very nice hydrogen bonding with some polar oxygens on the coronavirus target! Since these oxygens can normally make hydrogen bonds with surrounding water, it's very important that our designed binder can make these replacement hydrogen bonds.
However, if we look closer, we can see that our design introduces new polar atoms that are buried at the binding site, and not all of them make hydrogen bonds! The hydrogens numbered 1 through 4 are not making hydrogen bonds, and this could interfere with binding. This designed binder will tend to float away from the coronavirus target, so that all of these polar atoms can make hydrogen bonds with the surrounding water.
This last binder design also looks very promising at first glance. We see that the designed protein itself has lots of orange sidechains buried in the core, with blue sidechains on the surface, so it is likely to fold up correctly. We also see that there are lots of orange sidechains that are buried at the interface with the coronavirus target, so this should result in really tight binding between the design and the target!
However, this design binds to the wrong side of the coronavirus target! We see that this protein is designed next to the frozen section of the coronavirus protein, away from the flexible sidechains at the target binding site. If we overlay this design with the normal human receptor (highlighted in purple), we see that there is no overlap between the design and the human receptor!
This means that the coronavirus protein is capable of binding to both the design and the human receptor at the same time. So even if this design binds to coronavirus protein, it will probably not block the infection pathway of the virus.
In the new Round 2 puzzle, we amended the coronavirus target so that these off-target residues do not contribute to your Foldit score. In order to get the best score (and design an effective antiviral protein) players should focus on the flexible blue and orange sidechains at the normal binding interface.
Good luck, and happy folding!( Posted by bkoep 87 723 | Thu, 03/05/2020 - 23:16 | 4 comments )
AlphaFold: Machine learning for protein structure prediction
In 2018, a group of computer scientists at DeepMind revealed a new method for protein structure prediction, called AlphaFold. In that year’s CASP competition, which benchmarks the state-of-the-art for protein structure prediction, AlphaFold swept the competition, generating more accurate predictions than any other research group.
AlphaFold has received considerable attention for this achievement, and a few weeks ago they published a scientific paper with the details of their new method. Since protein structure prediction often appears in Foldit puzzles, we wanted to review the AlphaFold method with Foldit players!
This blogpost is meant to summarize this exciting progress from AlphaFold, with an overview of their method, and some thoughts about the expected impact on protein research.
Machine Learning and Neural Nets
AlphaFold comes from DeepMind, a company well-known for tackling hard problems with machine learning algorithms. In 2016, a DeepMind program called AlphaGo famously beat a world-champion player of Go, a classic Chinese board game that is notoriously difficult for computer programs.
Machine learning (ML) is a branch of computer science that deals with self-improving algorithms. An ML algorithm is set up to perform a well-defined task, with a well-defined measure of performance. Over a “training” period, the algorithm is able to evaluate its own performance at the task and iteratively make changes that improve its performance.
One popular type of ML algorithm is a neural net, so called because it is inspired by the organization of neurons in the brain. Just like a web of neurons that communicate through synapses, a neural net is a web of virtual “nodes” that pass signals to one another. Typically, each node performs a simple mathematical operation on received signals (for example, testing if the sum of the signals exceeds some threshold), then passes on the new signal to downstream nodes. Training a neural net involves tuning the operations at each node so that the entire network produces the desired output from the training inputs.
A diagram of a simple neural network (from WikiMedia Commons). Signals are passed between nodes, each of which performs some simple (nonlinear) operation on the received signal and passes on the result. This network contains a single hidden layer of 4 nodes; the AlphaFold neural net contains hundreds of layers with thousands of nodes.
Neural nets have been very useful for abstracting information from complex inputs. A popular application of neural nets is the image recognition problem: the input is a 2D array of colored pixels, and the task is to classify the depicted object.
The AlphaFold algorithm is a neural net, very similar to the kind used for image recognition. In this case, the input is information about the protein sequence, and the task is to predict the distance between each residue in the folded protein.
Predicted Contacts vs. Predicted Distances
Many Foldit players will already be familiar with the concept of predicted contacts. These are residues in a protein that are predicted to be close to one another (“in contact”) in the folded structure, even if they are not neighbors in the protein sequence.
These predictions come from covariance patterns that emerge during evolution. We can observe these patterns by comparing very similar protein sequences in different organisms. For instance, we could compare the hemoglobin sequence in humans, chimps, dogs, mice, etc., and look for positions that tend to co-vary (i.e. two residues that seem to change together, as if they depend on one another). Strong covariance between two residues usually suggests that those residues interact with one another in the folded structure, through side-chain packing, H-bonding, electrostatics, etc.
Cartoon diagram of covariance (from GREMLIN). (Left) In these two related protein structures, the red and green residues interact with one another. When one of these mutates during the course of evolution, its partner may also have to mutate to maintain the interaction. (Right) Even when we don’t know the structure of these proteins, we can see evidence of this interaction when we compare lots of related protein sequences. The two positions in the dashed boxes display strong covariance.
One of the key insights of the AlphaFold group was to take these predictions a step further: Instead of using covariance to predict whether a two residues are “in contact” (a simple yes/no), AlphaFold attempts to predict the distance between the two residues (a range of values between 2 and 20 Å). These predictions are more difficult to make, but successful predictions provide much richer information about the folded protein structure.
We should note that, in 2018, a few other research groups were also using neural networks to predict distances—not just AlphaFold. The second insight of AlphaFold concerns their ability to generate a folded protein structure from predicted distances. They represent each distance prediction as a smooth restraint function, which allows them to employ a simple technique called gradient descent, directly folding the protein into a structure compatible with their predicted distances.
Predicted distances for residue pairs. (a) Similar to a contact map, this plot shows the predicted distance between every pair of residues in the structure. (b) For each pair of residues, the neural net produces a probability distribution of distances for each pair of residues. For the pair of residues marked by the blue star in (a), we can see the probability distribution favors a distance of about 8 Å. (c) The probability distribution is converted to a smooth restraint function, where the lowest point of the function corresponds to the favored distance (in this case, 8 Å). A simple gradient descent algorithm allows AlphaFold to efficiently fold a protein structure that optimizes all of their distance predictions.
Finally, AlphaFold combines their distance predictions with the Rosetta energy function (the same energy function used by Foldit) to refine their final folded structure.
AlphaFold Performance in CASP
The Critical Assessment of protein Structure Prediction (CASP) is an opportunity for different researchers to compare their structure prediction methods in a head-to-head competition. The CASP organizers collect unpublished protein structures and challenge researchers to predict the structures based on their protein sequence. Because the true protein structures are unpublished, all the predictions are “blind,” and all the participants can evaluate their methods on a level playing field, starting from the same information.
AlphaFold’s neural net was able to make remarkably accurate distance predictions for many of the targets of the 2018 CASP competition, and this led them toward protein models that were very similar to the true structure. The best way to visualize AlphaFold’s success is to look at their summed Z-score for all targets in the Free Modeling category.
Rankings from the 2018 CASP Free Modeling category (from CASP13). The y-axis shows the summed Z-score across all targets in the category, with all competing groups on the x-axis. The leftmost bar represents the AlphaFold group.
This is an incredible achievement, and AlphaFold represents a significant step forward in protein structure prediction, but the structure prediction problem is still far from “solved.” For most natural proteins, AlphaFold relies heavily on covariance patterns, and often struggles when the target has very few related sequences (covariance is harder to detect with just a few related sequences). However, even with zero related sequences AlphaFold can still make distance predictions, albeit with lower confidence. AlphaFold showed this by correctly predicting the structure of Foldit3, a protein designed by Foldit players, with no related sequences and no co-variance information!
One scientific limit of AlphaFold is that it suffers from the “black box” problem. Neural nets like the AlphaFold algorithm are considered “black box” techniques because their inner workings are hard to interpret. It is very difficult for us to deconstruct a neural network to figure out exactly what concepts the algorithm is “learning” about proteins. In other words, AlphaFold has improved our ability to predict a protein structure from its sequence; but hasn’t directly increased our understanding of how protein sequence relates to structure.
Impact of AlphaFold
Since AlphaFold’s debut in 2018, many other research groups have begun experimenting with machine learning for predicting residue distances. Just this month, shortly after AlphaFold published their method, researchers at the Baker Lab published trRosetta, which builds on the AlphaFold method (see PDF from the Baker Lab website).
The Baker Lab researchers realized that a neural net could be trained to predict not just the distance between two residues, but also the relative orientation of those two residues. By training an algorithm to predict both distance and orientation between residues, the Baker group was able to make protein models with even greater accuracy!
Building on AlphaFold with trRosetta. (a) The AlphaFold neural net predicts only the distance between residues pairs. We can also train the neural net to predict the orientation of residue pairs (defined by several angles and torsions). (b) These angle and torsion predictions can also be converted into smooth restraint functions, which is key for applying the predictions to a protein model. (c) The orientation predictions improve the accuracy of final protein models for a set of CASP targets.
The CASP competition returns in the summer of 2020, and it will be very exciting to see how other groups have incorporated AlphaFold’s progress into their own prediction methods!
However, Foldit is unlikely to see any immediate changes as a direct result of AlphaFold’s success.
Since Foldit was launched in 2008, our focus has been gradually shifting away from protein structure prediction. The main reason for this is that we think Foldit players have more to contribute in other problems, like protein design or building models into cryoEM data. It’s likely that we can use distance predictions to help with these tasks (for example, to check if distance predictions for a designed sequence are compatible the designed structure), but for now we are still evaluating the most effective ways to use neural nets for these kinds of problems!
Special thanks goes to Baker Lab scientist Ivan Anishchenko for contributions to this blog post!( Posted by bkoep 87 723 | Fri, 01/31/2020 - 00:25 | 8 comments )
The Poly-Proline Helix Design Series
Hi all, neilpg628 here to tell you about a new puzzle series we have planned to introduce a secondary structure to Foldit for the first time!
The poly-proline helix
All of the proteins that we have passed to you before have been composed mainly of α-helices and β-sheets. We want to introduce you to the poly-proline helix, which is much tighter than an α-helix, but is less stable because there is no internal hydrogen bonding between residues.
α-helices and β-sheets have hydrogen bonds which keep the structure together. The poly-proline helix has no such bonds
While these helices are typically made almost-entirely out of proline, they can be made out of other amino acids, as long as the bond geometry is roughly the same as that of a regular poly-proline helix. They are found in many proteins, and we want to incorporate them into Foldit to make your contributions relevant to a wider range of proteins!
Unsatisfied polar atoms
Unlike α-helices, poly-proline helices have polar oxygens pointing out from the protein backbone (see above figure). It will be important to satisfy these polar oxygens with hydrogen bonds, to ensure that any protein incorporating a poly-proline helix stays folded!
These special helices have not been used much in the field of protein design, but they are found throughout nature! Collagen is a protein composed of 3 poly-proline helices, bundled together so that the backbone oxygens can make hydrogen bonds. Collagen has exceptional tensile strength, and is responsible for the toughness of animal connective tissue.
Collagen’s hydrogen bond network makes it extremely stable even though it is composed of these unstable poly-proline helices
New design puzzles with the poly-proline helix
In our first poly-proline helix puzzle, we’ll provide a small 38 residue protein with a frozen poly-proline helix and designable residues on either side. We want you to redesign the starting structure into a compact protein that can support the poly-proline helix. We're starting with a small protein, but we plan to introduce poly-proline helices attached to larger proteins in the future, to see if you can design more complex poly-proline helix proteins!
Promising designs will be tested in the lab for stable folding! This work could open up new opportunities to apply poly-proline helices in environments where they would not normally fold. Check out Puzzle 1763: Poly-Proline Helix Design: Round 1 now!
Happy designing!( Posted by neilpg628 87 1034 | Thu, 11/21/2019 - 19:31 | 8 comments )
The Foldit cryo-EM paper
The latest Foldit research paper, about Cryo-EM Density puzzles, was published today in the journal PLOS Biology! The paper is open-access, meaning that anybody can read and share it for free, from the journal website.
The paper is a formal research article, so it is written in technical language meant for other scientists, and skips over some background info. Below, we cover the main points so that everyone can appreciate this accomplishment by Foldit players!
Electron density in Foldit
The paper is about recent Foldit puzzles in the Electron Density category, where players fold the target protein into a 3D “cloud” of density that maps the shape of the folded protein. The paper reports solutions from Puzzles 1572, 1588, 1598, and 1606.
Foldit Puzzle 1598: Cryo-EM Freestyle with Density
This is not the first time Foldit players have wowed us in an electron density puzzle! Some of you may remember Puzzle 1152: Foldit vs. UMich Electron Density Challenge from back in 2015. In that contest, players built solutions into a high-resolution (1.9 Å) density map from x-ray diffraction experiments. Foldit players outperformed UMich undergraduates, expert crystallographers, and state-of-the-art computer algorithms! Those results were published in a previous paper.
This previous result gave us a clue that electron density might be a sweet spot for Foldit players, so we started to look at other kinds of density maps...
Cryo-electron microscopy (cryo-EM) is another technique for getting density maps and solving protein structures. In a cryo-EM experiment, a sample of protein in solution is spread on a thin metal wafer and quickly cooled to cryogenic temperatures to quench all molecular motion, freezing all of the protein atoms in a sheet of vitreous ice. Then we bombard the frozen sample with a beam of high-energy electrons, which scatter when they collide with the atoms of the protein. A detector measures the electron scattering, and the result is a grainy 2D “micrograph” of the wafer and any proteins on its surface.
Example cryo-EM micrograph of the S. entomophila antifeeding prophage, used to generate the maps for the puzzles in this paper. Used with permission of Ambroise Desfosses and Irina Gutsche (source).
If we collect enough of these raw micrographs (think millions), then we can align all of the individual protein molecules and average them together to get a clearer 2D picture of the protein. Finally, we combine all the 2D images to arrive at a 3D reconstruction of the protein, in the form of a density cloud—very similar to the electron density clouds that we get from x-ray diffraction experiments!
Unlike x-ray diffraction, cryo-EM experiments are fairly easy to set up (no protein crystals needed!). But cryo-EM has been unpopular for protein structure research because it yields a lower-resolution, “blobbier” density cloud than x-ray diffraction. However, that started to change around 2012, when a technological breakthrough gave us improved electron-scattering detectors and higher resolution maps. Since then, cryo-EM has taken off, and the number of new cryo-EM protein structures has been doubling every 2 years (by contrast, new x-ray diffraction structures have plateaued since 2013).
Cryo-EM and Foldit players
Even with the recent improvements, cryo-EM maps are not quite as clear as x-ray diffraction maps. The highest resolution typically achieved by cryo-EM is about 3.0 Å. Since covalently bonded atoms are separated by < 2 Å, that means we still can’t make out the positions of individual atoms simply by looking at the map. Instead, we have to infer the positions of the atoms, using our knowledge of physics and protein structure to find a plausible model that fits the map.
Building a plausible protein structure into a low-resolution map is difficult and prone to errors. If a microscopist focuses too much on fitting the density cloud, they might end up with a strained (high energy) model that is physically unrealistic. On the other hand, a computer algorithm that optimizes energy can have a hard time fitting a model into the density map.
This is where Foldit players come in! We know from previous work that Foldit players are adept at interpreting density maps; and the Foldit score function should help guide players toward plausible, low energy models.
In Puzzles 1572, 1588, 1598, and 1606, we provided Foldit players with cryo-EM maps for four proteins that make up the S. entomophila antifeeding prophage (a complex needle-like structure used by bacteria to inject toxins into a target cell). We then compared Foldit player solutions with those of expert microscopists and a handful of automated algorithms.
Comparison of solutions from different methods. (Top) The top Foldit solution from Puzzle 1588 and the model built by the scientist. They look pretty similar when you look this zoomed out, but looking closer: (Bottom) Subtle deviations in the models can yield significant results. In the bottom-right image, an automated algorithm (magenta) had trouble matching the density, and left some regions of the map completely empty.
Foldit players take gold!
In each of the four puzzles, Foldit player solutions had the best balance of plausibility and fit-to-density! If you’re curious, the scientists came in second, and the algorithms came in last (but there was a lot of variance between different algorithms).
Foldit players achieve plausibility and high fit-to-density for AFP7 (Puzzle 1588). (Left) Microscopists build strained models that have many clashes. (Right) Automatic algorithms like Rosetta and Phenix build models with poor fit-to-density (according to three different measures of map correlation). Foldit players build realistic models with few clashes, and still fit the density with a high map correlation.
We also want to point out that the Foldit rankings were incredibly accurate in these puzzles! As most players are aware, the best-scoring solution in Foldit is not necessarily the most accurate scientifically (because the Foldit score function is not a perfect reflection of reality). This is why we run our scientific analysis on all of the high-scoring solutions, to see what actually looks best against the scientific data: sometimes it’s rank #2, and sometimes it’s rank #20. However, in all four of these cryo-EM puzzles, the #1 top-scoring Foldit solution also had the best scientific evaluation!
This is important because it supports the accuracy of the Foldit score function. Foldit players can have more confidence that when their score goes up, so does the scientific value of their solution. It should also give more confidence to other scientists that might want to collaborate with Foldit players in the future. We hope this is just the beginning for Foldit cryo-EM!
Finally, we want to thank all the Foldit players that participated in these cryo-EM puzzles! Even if you didn’t work directly on the models presented in the paper, your folding helps to drive the competition that leads to high-scoring solutions. We love to see Foldit players continuing to share ideas and set high standards for each other! Some of the Foldit players who worked on the solutions in the paper have written up their folding strategies, which you can read in the paper supplement.( Posted by bkoep 87 723 | Mon, 11/11/2019 - 19:29 | 6 comments )
The Aflatoxin Challenge Returns!
The Aflatoxin Challenge is back! Since we left off last November, the Siegel Lab at UC Davis has been hard at work testing designs from Foldit players. Unfortunately, they ran into a major setback (all too common in scientific research), and had to go back to the drawing board to rethink their strategy. But they are back now with a new enzyme scaffold that is better suited to degrade the aflatoxin molecule, and they're asking Foldit players to redesign the enzyme so that it can bind aflatoxin more strongly!
Aflatoxin contamination in the food supply chain has resulted in health issues approaching epidemic status in developing countries, and vast food stores are deemed unsafe for consumption in regulated markets. There remain no effective means of aflatoxin removal that also maintain the food quality required for commercial products. Using modern synthetic biology tools, a UC Davis team of scientists in collaboration with the Mars Global Food Safety Center have spearheaded efforts to develop novel remediation tools.
In 2017, the Siegel Lab characterized a diverse panel of ~50 hydrolytic enzymes for expression and solubility. Then, a consortium with Mars, UC Davis, UW, Northeastern, ThermoFisher, FAO and PACA was developed around Foldit, so that citizen-scientist Foldit players might engineer new functionality into these hydrolytic enzymes and allow them to degrade the harmful aflatoxin molecule.
After the first 12 design rounds in Foldit, >500 designed proteins were tested—but not a single active enzyme was found! The UC Davis team went back and retested some fundamental assumptions that had been made when looking at the hydrolytic enzymes. They found that, at neutral pH, hydrolysis is not thermodynamically favorable for aflatoxin B1, and therefore it would have been impossible to develop a hydrolytic catalyst.
An alternative reaction
With this knowledge in hand, a new class of enzymes was targeted that catalyzes oxidative reactions, and requires nothing beyond O2. A set of ~20 diverse naturally occurring oxidative enzymes were synthesized and characterized. In initial activity screens, 2 of these were found to degrade all detectable aflatoxin. Today, we are restarting the Aflatoxin Challenge with a new Round 13 puzzle, in which players can redesign one of these active enzymes to improve hypothesized interactions with the aflatoxin molecule.
There is still a long way to go before this enzyme is efficient and specific enough for use in industrial settings. We are looking to the Foldit community to help us redesign the binding pocket. We hope Foldit players can introduce new packing interactions and hydrogen bonds with aflatoxin, to stabilize its hypothesized orientation and prime it for oxidation. We look forward to seeing what Foldit players can come up with! Play the new Aflatoxin Challenge: Round 13 puzzle now!
As in the previous aflatoxin puzzles, all Foldit player designs will be public domain. By participating in these Aflatoxin Challenge puzzles, the players agree that all player designs will be available permanently in the public domain, and the players will not seek intellectual property protection over the designs created as part of the challenge.( Posted by bkoep 87 723 | Mon, 09/30/2019 - 16:42 | 0 comments )