Protein Design Partition Tournament
We are announcing a friendly protein design tournament for Foldit players! This tournament is something of an experiment, and we are not sure exactly how it will unfold. Participation in the tournament is completely voluntary, and we will continue to post regular Foldit puzzles during the tournament, for those who do not wish to participate. There will be no prizes (except for bragging rights and a rare new Foldit Achievement), and we cannot guarantee that any scientific results will arise from the tournament. The main purpose of the tournament is to have fun folding proteins, and to inspire Foldit players to think differently about protein design; but we do also think the tournament could lead to higher quality protein designs.
Unlike our regular design puzzles, in which players compete to design proteins with the best absolute energy, this tournament is designed to reward proteins with the best energy landscape. For more discussion about energy landscapes, see parts one and two of this blog series.
The tournament will take place in two phases, over the next 6 weeks.
Phase One: Defense
The first phase of the tournament will take the form of a Foldit design puzzle, similar to the regular Monomer Design puzzles that are posted every week. Players will have two weeks to craft their best protein design, which will have to defend itself in Phase Two. To enter the tournament, players must share their chosen design with scientists, using the Upload for Scientists button in the Save Solution menu. When you save your solution, give it the title ‘Tournament Submission’ in the Save Solution dialog box. Every player is allowed one tournament submission. If a player submits multiple solutions, only the most recent solution will be accepted. For the sake of competition logistics, team play will not be allowed in the tournament; only soloist solutions will be accepted.
The Phase One design puzzle will include two objectives:
Residue Count: Designs may contain 70-100 residues, at a cost of 32 points per residue.
Secondary Structure: Designs may be up to 10% α-helix.* Additional helices will be penalized at 10 points per residue.
*We will accept Phase One submissions regardless of their secondary structure content. However, we’d like to discourage players from submitting helical bundles and ferredoxin-like folds (a.k.a. "surfing hotdogs") that typically score well in regular design puzzles. Rather, this is a chance for players to showcase designs that aren’t normally competitive in regular design puzzles.
Phase Two: Offense
Twenty Foldit player submissions will be selected to advance to Phase Two:
Five will be the five top-scoring submissions from Phase One.
Five will be hand-selected by the Foldit team, on merits of creativity and plausibility.
Ten will be chosen at random from the remaining submissions.
For each of the 20 selections, we will create a special Partition Contest using the selected protein design. Each contest will be set up as prediction puzzle, similar to the regular De-novo Freestyle puzzles, except that the starting structure will be the fully-folded design. All 20 Partition Contests will be open to the entire Foldit community and will remain online for four weeks, during which time the selected designs will be vulnerable to “challenge.” Any Foldit player can challenge a design by joining its Partition Contest and attempting to refold the design into another high-scoring decoy structure.
The Phase Two contests will include an RMSD Objective: All solutions must differ from the starting model with an RMSD of at least 2.5 Å.
Ultimately, each design in the tournament will be evaluated by its partition function (described in the previous blog post), based on the decoys found by challengers in the Phase Two contests.
By challenging a design and finding a high-scoring decoy, you show that your opponent's sequence does not have 100% probability of adopting the folded structure, and that its partition function must be shared with your decoy structure. You effectively stake a claim in the partition function of that design; the higher the score of your decoy, the larger your claim in the opponent's partition function.
A player may make multiple challenges against a single design; in some cases, it may be more effective to make many moderate-scoring challenges rather than a single high-scoring challenge. In order to calculate the partition function for a design, we will cluster all of the contest solutions to identify representative states. Then, we’ll use the partition function to determine the probability of each state.
Unfortunately, we cannot calculate the partition function on the fly, so players can only estimate how well a design is resisting challenge by following the Contest leaderboards. However, we will post weekly updates throughout Phase Two, with updated partition functions for all 20 Contests.
The champion of the tournament will be the protein design with the highest probability, as determined by its partition function.
There will also be Achievements for the most effective challengers, who are able to stake the greatest claims in the partition functions of their opponents.
Finally, we’d like to point out that, while players may be tempted to aim for a high-ranking design in Phase One, what really counts is how well each design can withstand challenges in Phase Two. If you design a high-scoring protein in Phase One, but its sequence is also compatible with many high-scoring decoy structures, then in Phase Two challengers will easily find high-scoring decoys and stake large claims in your design’s partition function.
The Phase One design puzzle is online now! Happy folding!( Posted by bkoep 127 1652 | Thu, 08/30/2018 - 20:54 | 27 comments )
This is part two of a three-part blog post about the problem of protein design. Part one can be found here.
In this blog post, we’re going to introduce a physics concept called a partition function, which will help us think quantitatively about the energy landscape described in the previous blog post. Partition functions come from a branch of physics called statistical mechanics, which uses probability to describe how large ensembles of molecules behave—typically in conditions of thermodynamic equilibrium. Before we get into partition functions, let's review what it means to be "at thermodynamic equilibrium."
Proteins at equilibrium
Consider a test tube containing a sample of protein in water. Let’s say the test tube contains a billion billion (1018) identical protein molecules, which is about the amount of protein we generally work with in the lab. If we let the test tube sit long enough at room temperature, the contents of the test tube will reach thermodynamic equilibrium. At equilibrium, the temperature is the same everywhere in the test tube, and the molecules in the test tube have settled into their most balanced configuration. Or, in other words, all the thermal motion in the test tube is distributed as evenly as possible among all the molecules in the test tube. This is important because it means the behavior of all 1018 protein molecules should be consistent throughout the test tube. And this allows us to treat all of the protein molecules together as an ensemble. Even if we can’t point to an individual protein molecule in the sample and describe exactly what that molecule is doing, we can use statistical mechanics to describe the ensemble as a whole.
Suppose our protein is stable at room temperature. In that case, we expect most of the protein molecules in our test tube will be folded. We say most of the molecules are folded, but not all of them—because in reality the ensemble will adopt a distribution of states, corresponding to all the different valleys in our protein’s energy landscape. Remember, our test tube contains 1018 protein molecules, and at room temperature there is a considerable amount of thermal motion in the test tube jostling around all of the protein molecules. So, even though the conditions are consistent throughout the test tube, there will still be a certain amount of deviation among the molecules of the ensemble. By chance some individual protein molecules will be jostled out of the folded state and into other regions of the energy landscape.
In our test tube, suppose that the distribution includes:
- A hundred protein molecules that are completely unfolded.
- A thousand molecules that are in a partially folded state.
- A million molecules in a mostly-but-not-completely folded state.
- A billion molecules that are misfolded into a completely different structure.
Well, even a billion misfolded molecules is just a tiny fraction of the total 1018 molecules in our test tube, so overall the ensemble is still overwhelmingly (99.9999999%) in the folded state.
As long as we’re talking about distributions, we should clarify that there are different ways to visualize distributions of proteins. Above, we just described a distribution in terms of numbers of molecules. In this way, we can visualize how a large ensemble of molecules is divided among the states at a single instant. In addition, we can also think of the same distribution in terms of time, and visualize how a single molecule will divide its time among the different states. It’s important to remember that each protein molecule is constantly being jostled around by thermal motion, and will fold and re-fold as it jumps randomly around the energy landscape, spending some amount of time in each state according to our distribution.
Even as individual molecules jump around the different states, we expect them to do so in such a way that the overall distribution remains constant, so long as we are at equilibrium. In our test tube, for example, it’s likely that one of the hundred unfolded molecules will be jostled into the folded state; but it is equally likely that one of the 1018 folded proteins will be jostled into the unfolded state, and the two transitions cancel each other out—the ensemble looks the same. This is the essence of thermodynamic equilibrium: at equilibrium, molecules may still exchange between different states, but the exchange is balanced so that the overall distribution does not change.
From here onward, we’ll talk about distributions in terms of probabilities, but you can think about these probabilities either as fractions of molecules in an ensemble, or as fractions of time—both are equally valid!
The partition function
The true distribution of states depends on the energy of each state. If we know the energies of all the possible states for a protein, then we can use a partition function to determine exactly what this distribution looks like. From statistical mechanics, we know that the probability of a state decreases exponentially with its energy, and we can write the partition function:
where P(s) is the probability of state s, Es is the energy of state s, k is the Boltzmann constant, and T is the absolute temperature. The number Z is simply used to normalize the probabilities so that all of the probabilities sum to 100% (as probabilities should), so Z is equal to the total sum of the exponentials for all the states:
Don’t worry if these equations look like gibberish to you! You don’t need to understand them to follow along—but we wanted to include them for the sake of completeness.
The essential thing to know about the partition function is that the relationship between probability and energy is exponential. This means that (at room temperature) a small energy difference of 1.4 kcal/mol (roughly 14 Foldit points) translates to a ten-fold change in probability. To produce the supposed distribution in our test tube above, the several states must have the following energies relative to the folded state:
We see that states with higher energy (and lower Foldit score) have lower probabilities, and will account for smaller portions of the partition function. Going back to our energy landscape analogy, we’ve just described an energy landscape with five different valleys, with precise depths for each valley. For example, the Folded valley is “deeper” than the Misfolded valley by 12.6 kcal/mol. So, according to our partition function, if there are 1018 molecules in the Folded state at room temperature, then we can expect to find about 1 billion molecules in the Misfolded state.
Partition functions for Foldit designs
The partition function can help us quantify the differences between energy landscapes. In the previous blog post, we examined two designs by fiendish_ghoul and saw how our current protein design strategy (i.e. optimizing for absolute energy) can lead to both good and bad energy landscapes. Let’s return to those two designs:
These energy landscape scatter plots were described previously, and show how Foldit players’ solutions in De-novo Freestyle puzzles can tell us about the energy landscape of a protein. Here, we’ve clustered the solutions from those puzzles to define several distinct states for each protein, which represent individual valleys in their energy landscapes. The ten lowest-energy states are highlighted as multicolored points in the energy landscape.
Now that we have a set of decoy states and their relative energies, we can calculate the partition function for each of these proteins:
The partition function is illustrated here as a multi-colored bar plot, where each bar represents the probability for one of the ten states identified in the De-novo Freestyle energy landscape. Note that the height of the bars is shown on a logarithmic scale, so a bar that reaches 10-2 has a probability of 1 in 100, or 1%; a bar that reaches 10-6 has a probability of 1 in a million, or 0.0001%. We see that for the protein on the left, the partition function is dominated by the folded state (in blue), with a probability of virtually 100, or 100%. The next most probable state (orange) has a probability of less than 10-12 (1 in a trillion), so it doesn’t even show up on this scale. On the other hand, we see a very different picture for the protein on the right. The partition function shows that the ensemble of molecules will be distributed more evenly across a number of different states. In fact, the blue state has a probability of about 54%, the orange state 32%, the green state 12%, the red state 1.3%, and so forth.
We happen to have experimental data for each of these proteins, which seem to support the energy landscapes and partition functions that Foldit players discovered in the De-novo Freestyle puzzles. Below are raw data from circular dichroism (CD) experiments, which tell us about the amount of structure in these proteins:
The CD spectrum for the left protein has a shape that is characteristic of a fully-folded protein with both α-helices and β-sheets (we also have a crystal structure of this protein, so we know for certain it is well-folded). Note the wide, flat trough between 208 and 222 nm, and the peak on the left side of the trace where the measured ellipticity is positive for wavelengths below 200 nm. The CD spectrum on the right has a slightly different shape. There is still a partial trough between 208 and 222 nm, so there is definitely some amount of secondary structure (probably helices and maybe sheets) in this protein. However, the measured ellipticity is still negative at shorter wavelengths below 200 nm (admittedly, the signal is much noisier in this spectrum, so the exact shape of the curve is difficult to discern). This indicates that a significant portion of this protein is unstructured, which we would expect from a protein with an unfunnelled landscape and many low-energy decoy states.
So, how can we use these partition functions to improve designed proteins? Check back Thursday for the last blog post in this series, where we’ll propose a possible strategy for using partition functions to improve protein designs in Foldit!
Edit: Read more in Part 3 of this blog series.( Posted by bkoep 127 1652 | Mon, 08/27/2018 - 19:51 | 7 comments )
The problem of protein design
This is the first of a three-part blog post. In the first part, we’re going to review the concept of energy landscapes, which some of you may already be familiar with. In the second part, we’ll discuss how a concept from physics, called a partition function, can help us think about energy landscapes. In the last part, we’ll propose a way that we might use these concepts of energy landscapes and partition functions to improve protein design in Foldit.
The energy landscape
There’s a problem with the way we currently design proteins in Foldit—and not just in Foldit, but also in Rosetta. In fact, it’s a problem in any protein design strategy that optimizes the absolute energy of the design. This strategy is the premise of a Foldit design puzzle. The Foldit score measures the absolute energy of a solution (with a negative multiplier), so that when players compete to find solutions with the highest score, they are actually competing to find solutions with the lowest absolute energy.
However, the success of a protein design (i.e. whether or not the protein folds) does not depend only on the absolute energy of the design. Rather it depends on the protein’s energy landscape. The energy landscape is a concept we use to think about all the possible ways that a string of amino acids can fold. As any Foldit player knows, there are a lot of different ways to fold up a string of amino acids, and they all have different energies (or Foldit scores). We can imagine the energy landscape as a surface where every (x,y) coordinate represents a different fold, or state, and the height of the surface (the z-coordinate) represents the energy of that state. In some places there will be hills, which represent states with a high energy (low Foldit score), and in other places there will be valleys, where folds have a low energy (high Foldit score).
Conceptual illustration of a protein energy landscape, from Dill, K.A. and MacCallum, J.L. (2012)
One of the reasons we like the analogy of energy landscapes is that we intuitively understand how things tend to “prefer” low points in the landscape. If you place an object randomly on the energy landscape, it will tend to slide downhill, from a high-energy state to a low-energy state. If we consider the effect of thermal motion that is constantly jostling around the object (imagine a Mexican jumping bean that randomly jumps around the landscape), then the object will explore all the different valleys of the energy landscape. Nevertheless, the Mexican jumping bean will spend the most time in the deepest valleys of the landscape.
A protein behaves the same way in its energy landscape. At room temperature, there is a considerable amount of thermal motion that allows the protein to explore its energy landscape, although the protein will spend the most time in the states with lowest energy. Every amino acid sequence has a different energy landscape, with different valleys in different places. When you mutate amino acids in a Foldit puzzle to find higher scores for your design, what you’re really doing is looking for an energy landscape where your design is in a deeper valley. However, the Foldit score only tells you about the energy of your designed folded state—or the “depth” of your desired valley. What we’re not considering in Foldit is the rest of the landscape, and whether there might be other low-energy “decoy states”—other deep valleys for your protein to explore.
This is a difficult problem to solve because the energy landscape for a protein is vast. It’s difficult to account for the decoy states because we don’t know what they might look like. We don’t know where to search in the energy landscape for other low-energy valleys, and the landscape is too big to search exhaustively.
The search for decoys
As many of you are probably aware, a lot of the recent De-novo Freestyle prediction puzzles have targeted Foldit player-designed proteins. The purpose of these puzzles is to look for low-energy decoy states, or alternative valleys in the energy landscape. We already run Foldit designs through Rosetta@home to look for decoy states—and for the most part, Rosetta@home seems to do a pretty good job. But occasionally Foldit players find solutions that Rosetta@home misses.
In the following example we're going to pick on fiendish_ghoul, because this energy landscape problem is clearly illustrated by two of their designs, shown below:
The protein on the left is a design originally from Puzzle 1331; the protein on the right is a design from Puzzle 1239. Beneath each cartoon protein structure is a scatter plot with the results from corresponding De-novo Freestyle puzzles that we posted using the sequence of each design. Each black point represents a solution, plotted with respect to its RMSD to the folded state (x-axis) and its energy (y-axis). Together these points give us a profile of the energy landscape for each protein. We see that the design on the left has a “funnelled” landscape, such that the lowest-energy solutions are those close to the folded state (RMSD close to zero) and solutions very different from the folded state (large RMSD) all have higher energies. In the design on the right, however, Foldit players identified a number of decoy states that are very different from the folded state (large RMSD), and have energy just as low as the folded state. These decoy states (marked with colored circles in the scatter plot) appear as “valleys” in the energy landscape of the protein.
The cartoon structures of these decoy states are shown below using the same rainbow coloring as above, with the N-terminus of the protein colored blue, and the C-terminus of the protein colored red:
In each of the decoy structures, all of the α-helices and β-strands are there, but it appears there is some ambiguity about where the helices should go. According to the solutions from the De-novo Freestyle puzzle, the three α-helices can fold in different arrangements around the central β-sheet, and all of these arrangements have similar energies. Since all of these states have similar energy, the protein will not have a strong preference for any single one of them.
Both of fiendish_ghoul's proteins were designed by optimizing their absolute energy, but the protein on the right has a problematic energy landscape. If we made these proteins in the lab, we would expect the protein on the left to be well-folded, and to spend most of its time in the designed state, since it appears to be the only deep valley in the landscape. However, we would expect the protein on the right to be poorly-folded, and to spend its time sampling all the different decoy states discovered by Foldit players.
Check back on Monday for the next blog post, where we’ll discuss these energy landscapes in more detail!bkoep 127 1652 | Fri, 08/24/2018 - 21:17 | 3 comments )
WeFold paper on CASP11 is now published!
The latest WeFold paper titled: "An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12" was just published in Nature's online open access journal: Scientific Reports.
This publication describes the results and analysis of the CASP11 and CASP12 WeFold coopetition (cooperation and competition), highlighting lessons learned and improvements over the first WeFold attempt from CASP10.
Foldit Players consortium are listed as co-authors, and the Acknowledgements section of the paper begins with: "The authors would like to acknowledge the collaboration of hundreds of thousands of citizen scientists who contributed millions of decoys through the Foldit game."
Congratulations to all of you and keep up the great folding!( Posted by beta_helix 127 7154 | Tue, 07/03/2018 - 22:12 | 8 comments )
WeFold paper on CASP11 has just been accepted!
The paper is titled: "An analysis and evaluation of the WeFold collaborative for protein structure prediction and its pipelines in CASP11 and CASP12"
and just as with the first WeFold paper about CASP10, Foldit played a very large part in CASP11!
In order to publish these results, however, we must now abide by the new authorship policy that journals have now implemented requiring author names and affiliations for all authors. Previously, we had used "Foldit Players" (or Players, F.) to represent all of you.
Similar to Foldit's previous publication in Nature Communications, for this paper you will all be under the group consortium: "Foldit Players", and anyone who played a CASP11 puzzle has the option to list their complete real name (we cannot use Foldit usernames).
If you played one of the CASP11 puzzles and would like your full real name to be included in the group consortium list for this paper, please follow the directions below in the comments by Saturday June 2nd at 11:59pm GMT
We would like to emphasize that this is completely voluntary, as we will of course also have a statement in the acknowledgements thanking all Foldit players, just not by name.( Posted by beta_helix 127 7154 | Thu, 05/24/2018 - 02:35 | 6 comments )