This is part two of a three-part blog post about the problem of protein design. Part one can be found here.
In this blog post, we’re going to introduce a physics concept called a partition function, which will help us think quantitatively about the energy landscape described in the previous blog post. Partition functions come from a branch of physics called statistical mechanics, which uses probability to describe how large ensembles of molecules behave—typically in conditions of thermodynamic equilibrium. Before we get into partition functions, let's review what it means to be "at thermodynamic equilibrium."
Proteins at equilibrium
Consider a test tube containing a sample of protein in water. Let’s say the test tube contains a billion billion (1018) identical protein molecules, which is about the amount of protein we generally work with in the lab. If we let the test tube sit long enough at room temperature, the contents of the test tube will reach thermodynamic equilibrium. At equilibrium, the temperature is the same everywhere in the test tube, and the molecules in the test tube have settled into their most balanced configuration. Or, in other words, all the thermal motion in the test tube is distributed as evenly as possible among all the molecules in the test tube. This is important because it means the behavior of all 1018 protein molecules should be consistent throughout the test tube. And this allows us to treat all of the protein molecules together as an ensemble. Even if we can’t point to an individual protein molecule in the sample and describe exactly what that molecule is doing, we can use statistical mechanics to describe the ensemble as a whole.
Suppose our protein is stable at room temperature. In that case, we expect most of the protein molecules in our test tube will be folded. We say most of the molecules are folded, but not all of them—because in reality the ensemble will adopt a distribution of states, corresponding to all the different valleys in our protein’s energy landscape. Remember, our test tube contains 1018 protein molecules, and at room temperature there is a considerable amount of thermal motion in the test tube jostling around all of the protein molecules. So, even though the conditions are consistent throughout the test tube, there will still be a certain amount of deviation among the molecules of the ensemble. By chance some individual protein molecules will be jostled out of the folded state and into other regions of the energy landscape.
In our test tube, suppose that the distribution includes:
- A hundred protein molecules that are completely unfolded.
- A thousand molecules that are in a partially folded state.
- A million molecules in a mostly-but-not-completely folded state.
- A billion molecules that are misfolded into a completely different structure.
Well, even a billion misfolded molecules is just a tiny fraction of the total 1018 molecules in our test tube, so overall the ensemble is still overwhelmingly (99.9999999%) in the folded state.
As long as we’re talking about distributions, we should clarify that there are different ways to visualize distributions of proteins. Above, we just described a distribution in terms of numbers of molecules. In this way, we can visualize how a large ensemble of molecules is divided among the states at a single instant. In addition, we can also think of the same distribution in terms of time, and visualize how a single molecule will divide its time among the different states. It’s important to remember that each protein molecule is constantly being jostled around by thermal motion, and will fold and re-fold as it jumps randomly around the energy landscape, spending some amount of time in each state according to our distribution.
Even as individual molecules jump around the different states, we expect them to do so in such a way that the overall distribution remains constant, so long as we are at equilibrium. In our test tube, for example, it’s likely that one of the hundred unfolded molecules will be jostled into the folded state; but it is equally likely that one of the 1018 folded proteins will be jostled into the unfolded state, and the two transitions cancel each other out—the ensemble looks the same. This is the essence of thermodynamic equilibrium: at equilibrium, molecules may still exchange between different states, but the exchange is balanced so that the overall distribution does not change.
From here onward, we’ll talk about distributions in terms of probabilities, but you can think about these probabilities either as fractions of molecules in an ensemble, or as fractions of time—both are equally valid!
The partition function
The true distribution of states depends on the energy of each state. If we know the energies of all the possible states for a protein, then we can use a partition function to determine exactly what this distribution looks like. From statistical mechanics, we know that the probability of a state decreases exponentially with its energy, and we can write the partition function:
where P(s) is the probability of state s, Es is the energy of state s, k is the Boltzmann constant, and T is the absolute temperature. The number Z is simply used to normalize the probabilities so that all of the probabilities sum to 100% (as probabilities should), so Z is equal to the total sum of the exponentials for all the states:
Don’t worry if these equations look like gibberish to you! You don’t need to understand them to follow along—but we wanted to include them for the sake of completeness.
The essential thing to know about the partition function is that the relationship between probability and energy is exponential. This means that (at room temperature) a small energy difference of 1.4 kcal/mol (roughly 14 Foldit points) translates to a ten-fold change in probability. To produce the supposed distribution in our test tube above, the several states must have the following energies relative to the folded state:
We see that states with higher energy (and lower Foldit score) have lower probabilities, and will account for smaller portions of the partition function. Going back to our energy landscape analogy, we’ve just described an energy landscape with five different valleys, with precise depths for each valley. For example, the Folded valley is “deeper” than the Misfolded valley by 12.6 kcal/mol. So, according to our partition function, if there are 1018 molecules in the Folded state at room temperature, then we can expect to find about 1 billion molecules in the Misfolded state.
Partition functions for Foldit designs
The partition function can help us quantify the differences between energy landscapes. In the previous blog post, we examined two designs by fiendish_ghoul and saw how our current protein design strategy (i.e. optimizing for absolute energy) can lead to both good and bad energy landscapes. Let’s return to those two designs:
These energy landscape scatter plots were described previously, and show how Foldit players’ solutions in De-novo Freestyle puzzles can tell us about the energy landscape of a protein. Here, we’ve clustered the solutions from those puzzles to define several distinct states for each protein, which represent individual valleys in their energy landscapes. The ten lowest-energy states are highlighted as multicolored points in the energy landscape.
Now that we have a set of decoy states and their relative energies, we can calculate the partition function for each of these proteins:
The partition function is illustrated here as a multi-colored bar plot, where each bar represents the probability for one of the ten states identified in the De-novo Freestyle energy landscape. Note that the height of the bars is shown on a logarithmic scale, so a bar that reaches 10-2 has a probability of 1 in 100, or 1%; a bar that reaches 10-6 has a probability of 1 in a million, or 0.0001%. We see that for the protein on the left, the partition function is dominated by the folded state (in blue), with a probability of virtually 100, or 100%. The next most probable state (orange) has a probability of less than 10-12 (1 in a trillion), so it doesn’t even show up on this scale. On the other hand, we see a very different picture for the protein on the right. The partition function shows that the ensemble of molecules will be distributed more evenly across a number of different states. In fact, the blue state has a probability of about 54%, the orange state 32%, the green state 12%, the red state 1.3%, and so forth.
We happen to have experimental data for each of these proteins, which seem to support the energy landscapes and partition functions that Foldit players discovered in the De-novo Freestyle puzzles. Below are raw data from circular dichroism (CD) experiments, which tell us about the amount of structure in these proteins:
The CD spectrum for the left protein has a shape that is characteristic of a fully-folded protein with both α-helices and β-sheets (we also have a crystal structure of this protein, so we know for certain it is well-folded). Note the wide, flat trough between 208 and 222 nm, and the peak on the left side of the trace where the measured ellipticity is positive for wavelengths below 200 nm. The CD spectrum on the right has a slightly different shape. There is still a partial trough between 208 and 222 nm, so there is definitely some amount of secondary structure (probably helices and maybe sheets) in this protein. However, the measured ellipticity is still negative at shorter wavelengths below 200 nm (admittedly, the signal is much noisier in this spectrum, so the exact shape of the curve is difficult to discern). This indicates that a significant portion of this protein is unstructured, which we would expect from a protein with an unfunnelled landscape and many low-energy decoy states.
So, how can we use these partition functions to improve designed proteins? Check back Thursday for the last blog post in this series, where we’ll propose a possible strategy for using partition functions to improve protein designs in Foldit!
Edit: Read more in Part 3 of this blog series.( Posted by bkoep 172 3336 | Mon, 08/27/2018 - 19:51 | 7 comments )
The problem of protein design
This is the first of a three-part blog post. In the first part, we’re going to review the concept of energy landscapes, which some of you may already be familiar with. In the second part, we’ll discuss how a concept from physics, called a partition function, can help us think about energy landscapes. In the last part, we’ll propose a way that we might use these concepts of energy landscapes and partition functions to improve protein design in Foldit.
The energy landscape
There’s a problem with the way we currently design proteins in Foldit—and not just in Foldit, but also in Rosetta. In fact, it’s a problem in any protein design strategy that optimizes the absolute energy of the design. This strategy is the premise of a Foldit design puzzle. The Foldit score measures the absolute energy of a solution (with a negative multiplier), so that when players compete to find solutions with the highest score, they are actually competing to find solutions with the lowest absolute energy.
However, the success of a protein design (i.e. whether or not the protein folds) does not depend only on the absolute energy of the design. Rather it depends on the protein’s energy landscape. The energy landscape is a concept we use to think about all the possible ways that a string of amino acids can fold. As any Foldit player knows, there are a lot of different ways to fold up a string of amino acids, and they all have different energies (or Foldit scores). We can imagine the energy landscape as a surface where every (x,y) coordinate represents a different fold, or state, and the height of the surface (the z-coordinate) represents the energy of that state. In some places there will be hills, which represent states with a high energy (low Foldit score), and in other places there will be valleys, where folds have a low energy (high Foldit score).
Conceptual illustration of a protein energy landscape, from Dill, K.A. and MacCallum, J.L. (2012)
One of the reasons we like the analogy of energy landscapes is that we intuitively understand how things tend to “prefer” low points in the landscape. If you place an object randomly on the energy landscape, it will tend to slide downhill, from a high-energy state to a low-energy state. If we consider the effect of thermal motion that is constantly jostling around the object (imagine a Mexican jumping bean that randomly jumps around the landscape), then the object will explore all the different valleys of the energy landscape. Nevertheless, the Mexican jumping bean will spend the most time in the deepest valleys of the landscape.
A protein behaves the same way in its energy landscape. At room temperature, there is a considerable amount of thermal motion that allows the protein to explore its energy landscape, although the protein will spend the most time in the states with lowest energy. Every amino acid sequence has a different energy landscape, with different valleys in different places. When you mutate amino acids in a Foldit puzzle to find higher scores for your design, what you’re really doing is looking for an energy landscape where your design is in a deeper valley. However, the Foldit score only tells you about the energy of your designed folded state—or the “depth” of your desired valley. What we’re not considering in Foldit is the rest of the landscape, and whether there might be other low-energy “decoy states”—other deep valleys for your protein to explore.
This is a difficult problem to solve because the energy landscape for a protein is vast. It’s difficult to account for the decoy states because we don’t know what they might look like. We don’t know where to search in the energy landscape for other low-energy valleys, and the landscape is too big to search exhaustively.
The search for decoys
As many of you are probably aware, a lot of the recent De-novo Freestyle prediction puzzles have targeted Foldit player-designed proteins. The purpose of these puzzles is to look for low-energy decoy states, or alternative valleys in the energy landscape. We already run Foldit designs through Rosetta@home to look for decoy states—and for the most part, Rosetta@home seems to do a pretty good job. But occasionally Foldit players find solutions that Rosetta@home misses.
In the following example we're going to pick on fiendish_ghoul, because this energy landscape problem is clearly illustrated by two of their designs, shown below:
The protein on the left is a design originally from Puzzle 1331; the protein on the right is a design from Puzzle 1239. Beneath each cartoon protein structure is a scatter plot with the results from corresponding De-novo Freestyle puzzles that we posted using the sequence of each design. Each black point represents a solution, plotted with respect to its RMSD to the folded state (x-axis) and its energy (y-axis). Together these points give us a profile of the energy landscape for each protein. We see that the design on the left has a “funnelled” landscape, such that the lowest-energy solutions are those close to the folded state (RMSD close to zero) and solutions very different from the folded state (large RMSD) all have higher energies. In the design on the right, however, Foldit players identified a number of decoy states that are very different from the folded state (large RMSD), and have energy just as low as the folded state. These decoy states (marked with colored circles in the scatter plot) appear as “valleys” in the energy landscape of the protein.
The cartoon structures of these decoy states are shown below using the same rainbow coloring as above, with the N-terminus of the protein colored blue, and the C-terminus of the protein colored red:
In each of the decoy structures, all of the α-helices and β-strands are there, but it appears there is some ambiguity about where the helices should go. According to the solutions from the De-novo Freestyle puzzle, the three α-helices can fold in different arrangements around the central β-sheet, and all of these arrangements have similar energies. Since all of these states have similar energy, the protein will not have a strong preference for any single one of them.
Both of fiendish_ghoul's proteins were designed by optimizing their absolute energy, but the protein on the right has a problematic energy landscape. If we made these proteins in the lab, we would expect the protein on the left to be well-folded, and to spend most of its time in the designed state, since it appears to be the only deep valley in the landscape. However, we would expect the protein on the right to be poorly-folded, and to spend its time sampling all the different decoy states discovered by Foldit players.
Check back on Monday for the next blog post, where we’ll discuss these energy landscapes in more detail!bkoep 172 3336 | Fri, 08/24/2018 - 21:17 | 3 comments )
WeFold paper on CASP11 is now published!
The latest WeFold paper titled: "An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12" was just published in Nature's online open access journal: Scientific Reports.
This publication describes the results and analysis of the CASP11 and CASP12 WeFold coopetition (cooperation and competition), highlighting lessons learned and improvements over the first WeFold attempt from CASP10.
Foldit Players consortium are listed as co-authors, and the Acknowledgements section of the paper begins with: "The authors would like to acknowledge the collaboration of hundreds of thousands of citizen scientists who contributed millions of decoys through the Foldit game."
Congratulations to all of you and keep up the great folding!( Posted by beta_helix 172 15968 | Tue, 07/03/2018 - 22:12 | 7 comments )
WeFold paper on CASP11 has just been accepted!
The paper is titled: "An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12"
and just as with the first WeFold paper about CASP10, Foldit played a very large part in CASP11!
In order to publish these results, however, we must now abide by the new authorship policy that journals have now implemented requiring author names and affiliations for all authors. Previously, we had used "Foldit Players" (or Players, F.) to represent all of you.
Similar to Foldit's previous publication in Nature Communications, for this paper you will all be under the group consortium: "Foldit Players", and anyone who played a CASP11 puzzle has the option to list their complete real name (we cannot use Foldit usernames).
If you played one of the CASP11 puzzles and would like your full real name to be included in the group consortium list for this paper, please follow the directions below in the comments by Saturday June 2nd at 11:59pm GMT
We would like to emphasize that this is completely voluntary, as we will of course also have a statement in the acknowledgements thanking all Foldit players, just not by name.( Posted by beta_helix 172 15968 | Thu, 05/24/2018 - 02:35 | 6 comments )
Foldit's 10 Year Anniversary!
Today marks the 10-year anniversary of Foldit’s launch on May 9, 2008!
In the past decade, Foldit players have advanced protein science by accurately predicting the structure of a viral protein1, by developing an algorithm for protein modeling2, and by redesigning a protein enzyme with improved activity3. Foldit players have shown that they can refine protein models better than sophisticated computer programs4, and that they can interpret electron density maps as well as expert crystallographers5. We have high hopes for the next 10 years of Foldit, and can't wait to see what Foldit players will discover next!
Protein Design in Foldit
Most recently, Foldit players have been designing brand new proteins from scratch. The ability to design proteins is a big milestone for Foldit players, and we’re excited about the new types of problems that we can start to tackle with protein design in Foldit! This achievement has been a long time in the making—below you can review previous blog posts to follow this progress over the last four years. Play the latest design puzzle now!
Nov. 1, 2013 - First batch of Foldit player-designed proteins selected for testing
Mar. 25, 2014 - Improvements in Foldit player-designed proteins
Jun. 18, 2014 - First positive testing results for a Foldit player-designed protein
Feb. 10, 2015 - First alpha/beta Foldit designs selected for testing
Feb. 28, 2017 - Better backbones yield promising alpha/beta designs
Mar. 1, 2017 - Diverse player designs fold up in the wet lab
Apr. 15, 2017 - Protein crystallography of a Foldit player design
May 30, 2017 - X-ray diffraction of a protein crystal
A high-resolution crystal structure (cyan) aligned with the design model (green) shows that this protein folds up just as it was designed by Waya, Galaxie, and Susume. The protein backbone aligns to the design with a Cα RMSD of 1.1 Å, and the sidechains in the protein core pack just as intended.
Small Molecule Design in Foldit
We’re also excited to ramp up small-molecule design in Foldit, allowing Foldit players to create new ligands that could bind to protein targets! Play the latest small-molecule design puzzle now!
New tools allow Foldit players to build small molecules that can bind to protein targets
We'd like to thank all the Foldit players that have contributed to Foldit over the last 10 years! None of this would have been possible without you! Happy folding!
1. Khatib, F. et al. Crystal structure of a monomeric retroviral protease solved by protein folding game players. Nat Struct Mol Biol 18, 1175–1177 (2011).
2. Khatib, F. F. et al. Algorithm discovery by protein folding game players. Proc Natl Acad Sci U S A 108, 18949–18953 (2011).
3. Eiben, C. B. et al. Increased Diels-Alderase activity through backbone remodeling guided by Foldit players. Nature Biotechnology 30, 190–192 (2012).
4. Cooper, S. et al. Predicting protein structures with a multiplayer online game. Nature 466, 756–760 (2010).
5. Horowitz, S. et al. Determining crystal structures through crowdsourcing and coursework. Nat Commun 7, 12549 (2016).