Partition Tournament Update: Initial Results
Phase Two of the Protein Design Partition Tournament has been underway for one week, now. Thank you, to all the Foldit players that have been participating! All 20 Partition Puzzles have seen initial challenges, and already we’re seeing that some designs are more resistant than others.
To recap: the Partition Puzzles will be evaluated by their partition functions, which we explained in a previous blog post. In order to calculate the partition function for a puzzle, we cluster all of players solutions to identify decoy states. The relative energy of those decoy states tells us how the ensemble will be distributed over those states.
In order to rank the tournament designs, we’re using a "Partition Score” to summarize how well each design resists challenges from other players. This score measures how much of the partition function has been claimed by challengers, on a logarithmic scale, where a higher score indicates a more stable protein. For example, a Partition Score of 12.0 means that challengers have claimed 10-12 (one trillionth) of the partition function; likewise, we could say the design is about 1012 times more likely to fold into the designed structure than all of the decoys combined.
As of 18:00 GMT September 24, the rankings are as follows:
Three puzzles have a Partition Score of 0.00, meaning that challengers have found decoys that score even better than the original design. We are closing these puzzles early, so that challengers can focus on the other Partition Puzzles. Below, we'll take a closer look at the solutions from these puzzles, and what features might have led to their demise. If you want to skip ahead, the summary updates for all Partition Puzzles are at the bottom of this blog post.
Protein design pitfalls
First, in Partition Puzzle: MicElephant, we see that challengers have found a decoy with an energy that is lower than the design by about 6 kcal/mol (+60 Foldit points).
Taking a closer look at the decoy structure, we see that challengers have refolded one of the β-strands into a helix. In the original design, this strand was designed with all polar residues. Without any hydrophobic residues to be buried, this section of the protein will happily flop around in solution. You might think that it would be bad to break the ladder-like hydrogen bonds that were designed between this strand and its partner—however, these surface residues can compensate by making new hydrogen bonds with the surrounding water.
Next, in Partition Puzzle: ManVsYard, challengers have found a decoy that is lower in energy by about 4 kcal/mol (+40 Foldit points), as well as a second decoy with an energy that is about the same as the designed structure.
The low-scoring decoy exposes the short helix that was packed at the top of the β-strands. Again, we see that this section was designed with entirely polar residues, which are happy to interact with the surrounding water instead of the protein. In the second decoy, challengers have completely refolded a β-strand into an α-helix. This is a little surprising, since the designed residues in this section form an alternating pattern of polar and non-polar residues (which normally favors β-strands). However, this strand may not have been packed tightly enough in the protein core; it is able to make equally favorable interactions in the protein core when refolded as a helix.
Finally, in Partition Puzzle: manu8170, challengers found two decoys that are lower in energy by about 5 kcal/mol (+50 Foldit points).
The decoys differ from the design structure in a couple regions. First we see that the terminal helix can be unwound, and is "fraying" at the end. In the original design, these end residues (again, all polar residues) were floating off in solvent, and did not make very strong interactions with the rest of the protein; these residues can be unfolded relatively easy. We also see that this design has long loops that are made mostly of polar residues, and do not form regular secondary structure. Such polar, unstructured loops are expected to be flexible and disordered in solution; challengers were able to refold the loops while keeping the rest of the structure intact.
The clear message from these early results is that long stretches of polar residues can be easily unfolded. The burial of non-polar residues is the main driving force of protein folding, and it's important that every region of a protein contributes to its core!
Partition Puzzles Summary (September 24)
Below are the energy landscapes and partition functions for the remaining 17 Partition Puzzles, as of 18:00 GMT September 24. For an explanation of these plots, see this previous blog post.
Some puzzles have seen very little activity, and their rankings may be artificially boosted due to a lack of challengers. There is still plenty of opportunity to find high-scoring decoys in your opponent's Partition Puzzles!bkoep 78 544 | Mon, 09/24/2018 - 21:18 | 6 comments )
Partition Tournament: Phase Two
Thanks to all the players that submitted designs for the Protein Design Partition Tournament! At the bottom of this post are the submissions that were selected for Phase Two of the tournament.
We’ve created 20 new Partition Puzzles for these selections, where the starting structure for each puzzle is the original design. Each partition puzzle has an RMSD objective, which requires all solutions to have RMSD > 2.5 Å relative to the starting structure. This means that players will have to significantly change the starting structure in order for scores to register. The starting solution is marked with secondary structure predictions from PSIPRED. In some cases the PSIPRED predictions do not agree with the designed structure, and might suggest ways that challengers can refold the starting structure. Remember, the tournament submissions will ultimately be evaluated by how well they withstand challenges, as measured by their partition functions.
These puzzles are available in the regular puzzle menu and anyone can play them! They are listed as “Expert” difficulty; the only reason for this is so they are sorted below other regular puzzles (since Foldit sorts the Puzzle Menu by difficulty), to reduce clutter for players who do not wish to participate. The puzzles will remain online for four weeks, and expire on 23:00 GMT October 15.
We do not expect everyone to play all 20 partition puzzles! These puzzles are all “voluntary”, for now, and each puzzle is worth zero points. However, as the tournament progresses, at least a couple of these puzzles will be re-released as regular puzzles (for points), to spur more challenges from the greater Foldit community. Sharing will be enabled in the re-released puzzles, so anyone who participates early in a partition puzzle will have a leg up if it is re-released later as a regular puzzle. We will also present new Achievements, at the end of the tournament, to those players that are most successful in challenging others’ designs.
We encourage all players to check out a couple partition puzzles below, and at least poke around with some of the designs you find most interesting. By playing the partition puzzles you reveal the energy landscapes of these proteins, which helps us determine whether the designs are likely to fold as intended. We also hope that, by playing these puzzles, players might be able to learn from the designs of others. What features of a protein design make it easy or difficult to challenge? This is something of an open question in protein design research, and we think Foldit players could provide some insight about this aspect of protein design.
Check out the partition puzzles, which are linked below!
Randomly-selected designsbkoep 78 544 | Mon, 09/17/2018 - 22:07 | 0 comments )
Protein Design Partition Tournament
We are announcing a friendly protein design tournament for Foldit players! This tournament is something of an experiment, and we are not sure exactly how it will unfold. Participation in the tournament is completely voluntary, and we will continue to post regular Foldit puzzles during the tournament, for those who do not wish to participate. There will be no prizes (except for bragging rights and a rare new Foldit Achievement), and we cannot guarantee that any scientific results will arise from the tournament. The main purpose of the tournament is to have fun folding proteins, and to inspire Foldit players to think differently about protein design; but we do also think the tournament could lead to higher quality protein designs.
Unlike our regular design puzzles, in which players compete to design proteins with the best absolute energy, this tournament is designed to reward proteins with the best energy landscape. For more discussion about energy landscapes, see parts one and two of this blog series.
The tournament will take place in two phases, over the next 6 weeks.
Phase One: Defense
The first phase of the tournament will take the form of a Foldit design puzzle, similar to the regular Monomer Design puzzles that are posted every week. Players will have two weeks to craft their best protein design, which will have to defend itself in Phase Two. To enter the tournament, players must share their chosen design with scientists, using the Upload for Scientists button in the Save Solution menu. When you save your solution, give it the title ‘Tournament Submission’ in the Save Solution dialog box. Every player is allowed one tournament submission. If a player submits multiple solutions, only the most recent solution will be accepted. For the sake of competition logistics, team play will not be allowed in the tournament; only soloist solutions will be accepted.
The Phase One design puzzle will include two objectives:
Residue Count: Designs may contain 70-100 residues, at a cost of 32 points per residue.
Secondary Structure: Designs may be up to 10% α-helix.* Additional helices will be penalized at 10 points per residue.
*We will accept Phase One submissions regardless of their secondary structure content. However, we’d like to discourage players from submitting helical bundles and ferredoxin-like folds (a.k.a. "surfing hotdogs") that typically score well in regular design puzzles. Rather, this is a chance for players to showcase designs that aren’t normally competitive in regular design puzzles.
Phase Two: Offense
Twenty Foldit player submissions will be selected to advance to Phase Two:
Five will be the five top-scoring submissions from Phase One.
Five will be hand-selected by the Foldit team, on merits of creativity and plausibility.
Ten will be chosen at random from the remaining submissions.
For each of the 20 selections, we will create a special Partition Contest using the selected protein design. Each contest will be set up as prediction puzzle, similar to the regular De-novo Freestyle puzzles, except that the starting structure will be the fully-folded design. All 20 Partition Contests will be open to the entire Foldit community and will remain online for four weeks, during which time the selected designs will be vulnerable to “challenge.” Any Foldit player can challenge a design by joining its Partition Contest and attempting to refold the design into another high-scoring decoy structure.
The Phase Two contests will include an RMSD Objective: All solutions must differ from the starting model with an RMSD of at least 2.5 Å.
Ultimately, each design in the tournament will be evaluated by its partition function (described in the previous blog post), based on the decoys found by challengers in the Phase Two contests.
By challenging a design and finding a high-scoring decoy, you show that your opponent's sequence does not have 100% probability of adopting the folded structure, and that its partition function must be shared with your decoy structure. You effectively stake a claim in the partition function of that design; the higher the score of your decoy, the larger your claim in the opponent's partition function.
A player may make multiple challenges against a single design; in some cases, it may be more effective to make many moderate-scoring challenges rather than a single high-scoring challenge. In order to calculate the partition function for a design, we will cluster all of the contest solutions to identify representative states. Then, we’ll use the partition function to determine the probability of each state.
Unfortunately, we cannot calculate the partition function on the fly, so players can only estimate how well a design is resisting challenge by following the Contest leaderboards. However, we will post weekly updates throughout Phase Two, with updated partition functions for all 20 Contests.
The champion of the tournament will be the protein design with the highest probability, as determined by its partition function.
There will also be Achievements for the most effective challengers, who are able to stake the greatest claims in the partition functions of their opponents.
Finally, we’d like to point out that, while players may be tempted to aim for a high-ranking design in Phase One, what really counts is how well each design can withstand challenges in Phase Two. If you design a high-scoring protein in Phase One, but its sequence is also compatible with many high-scoring decoy structures, then in Phase Two challengers will easily find high-scoring decoys and stake large claims in your design’s partition function.
The Phase One design puzzle is online now! Happy folding!( Posted by bkoep 78 544 | Thu, 08/30/2018 - 20:54 | 27 comments )
This is part two of a three-part blog post about the problem of protein design. Part one can be found here.
In this blog post, we’re going to introduce a physics concept called a partition function, which will help us think quantitatively about the energy landscape described in the previous blog post. Partition functions come from a branch of physics called statistical mechanics, which uses probability to describe how large ensembles of molecules behave—typically in conditions of thermodynamic equilibrium. Before we get into partition functions, let's review what it means to be "at thermodynamic equilibrium."
Proteins at equilibrium
Consider a test tube containing a sample of protein in water. Let’s say the test tube contains a billion billion (1018) identical protein molecules, which is about the amount of protein we generally work with in the lab. If we let the test tube sit long enough at room temperature, the contents of the test tube will reach thermodynamic equilibrium. At equilibrium, the temperature is the same everywhere in the test tube, and the molecules in the test tube have settled into their most balanced configuration. Or, in other words, all the thermal motion in the test tube is distributed as evenly as possible among all the molecules in the test tube. This is important because it means the behavior of all 1018 protein molecules should be consistent throughout the test tube. And this allows us to treat all of the protein molecules together as an ensemble. Even if we can’t point to an individual protein molecule in the sample and describe exactly what that molecule is doing, we can use statistical mechanics to describe the ensemble as a whole.
Suppose our protein is stable at room temperature. In that case, we expect most of the protein molecules in our test tube will be folded. We say most of the molecules are folded, but not all of them—because in reality the ensemble will adopt a distribution of states, corresponding to all the different valleys in our protein’s energy landscape. Remember, our test tube contains 1018 protein molecules, and at room temperature there is a considerable amount of thermal motion in the test tube jostling around all of the protein molecules. So, even though the conditions are consistent throughout the test tube, there will still be a certain amount of deviation among the molecules of the ensemble. By chance some individual protein molecules will be jostled out of the folded state and into other regions of the energy landscape.
In our test tube, suppose that the distribution includes:
- A hundred protein molecules that are completely unfolded.
- A thousand molecules that are in a partially folded state.
- A million molecules in a mostly-but-not-completely folded state.
- A billion molecules that are misfolded into a completely different structure.
Well, even a billion misfolded molecules is just a tiny fraction of the total 1018 molecules in our test tube, so overall the ensemble is still overwhelmingly (99.9999999%) in the folded state.
As long as we’re talking about distributions, we should clarify that there are different ways to visualize distributions of proteins. Above, we just described a distribution in terms of numbers of molecules. In this way, we can visualize how a large ensemble of molecules is divided among the states at a single instant. In addition, we can also think of the same distribution in terms of time, and visualize how a single molecule will divide its time among the different states. It’s important to remember that each protein molecule is constantly being jostled around by thermal motion, and will fold and re-fold as it jumps randomly around the energy landscape, spending some amount of time in each state according to our distribution.
Even as individual molecules jump around the different states, we expect them to do so in such a way that the overall distribution remains constant, so long as we are at equilibrium. In our test tube, for example, it’s likely that one of the hundred unfolded molecules will be jostled into the folded state; but it is equally likely that one of the 1018 folded proteins will be jostled into the unfolded state, and the two transitions cancel each other out—the ensemble looks the same. This is the essence of thermodynamic equilibrium: at equilibrium, molecules may still exchange between different states, but the exchange is balanced so that the overall distribution does not change.
From here onward, we’ll talk about distributions in terms of probabilities, but you can think about these probabilities either as fractions of molecules in an ensemble, or as fractions of time—both are equally valid!
The partition function
The true distribution of states depends on the energy of each state. If we know the energies of all the possible states for a protein, then we can use a partition function to determine exactly what this distribution looks like. From statistical mechanics, we know that the probability of a state decreases exponentially with its energy, and we can write the partition function:
where P(s) is the probability of state s, Es is the energy of state s, k is the Boltzmann constant, and T is the absolute temperature. The number Z is simply used to normalize the probabilities so that all of the probabilities sum to 100% (as probabilities should), so Z is equal to the total sum of the exponentials for all the states:
Don’t worry if these equations look like gibberish to you! You don’t need to understand them to follow along—but we wanted to include them for the sake of completeness.
The essential thing to know about the partition function is that the relationship between probability and energy is exponential. This means that (at room temperature) a small energy difference of 1.4 kcal/mol (roughly 14 Foldit points) translates to a ten-fold change in probability. To produce the supposed distribution in our test tube above, the several states must have the following energies relative to the folded state:
We see that states with higher energy (and lower Foldit score) have lower probabilities, and will account for smaller portions of the partition function. Going back to our energy landscape analogy, we’ve just described an energy landscape with five different valleys, with precise depths for each valley. For example, the Folded valley is “deeper” than the Misfolded valley by 12.6 kcal/mol. So, according to our partition function, if there are 1018 molecules in the Folded state at room temperature, then we can expect to find about 1 billion molecules in the Misfolded state.
Partition functions for Foldit designs
The partition function can help us quantify the differences between energy landscapes. In the previous blog post, we examined two designs by fiendish_ghoul and saw how our current protein design strategy (i.e. optimizing for absolute energy) can lead to both good and bad energy landscapes. Let’s return to those two designs:
These energy landscape scatter plots were described previously, and show how Foldit players’ solutions in De-novo Freestyle puzzles can tell us about the energy landscape of a protein. Here, we’ve clustered the solutions from those puzzles to define several distinct states for each protein, which represent individual valleys in their energy landscapes. The ten lowest-energy states are highlighted as multicolored points in the energy landscape.
Now that we have a set of decoy states and their relative energies, we can calculate the partition function for each of these proteins:
The partition function is illustrated here as a multi-colored bar plot, where each bar represents the probability for one of the ten states identified in the De-novo Freestyle energy landscape. Note that the height of the bars is shown on a logarithmic scale, so a bar that reaches 10-2 has a probability of 1 in 100, or 1%; a bar that reaches 10-6 has a probability of 1 in a million, or 0.0001%. We see that for the protein on the left, the partition function is dominated by the folded state (in blue), with a probability of virtually 100, or 100%. The next most probable state (orange) has a probability of less than 10-12 (1 in a trillion), so it doesn’t even show up on this scale. On the other hand, we see a very different picture for the protein on the right. The partition function shows that the ensemble of molecules will be distributed more evenly across a number of different states. In fact, the blue state has a probability of about 54%, the orange state 32%, the green state 12%, the red state 1.3%, and so forth.
We happen to have experimental data for each of these proteins, which seem to support the energy landscapes and partition functions that Foldit players discovered in the De-novo Freestyle puzzles. Below are raw data from circular dichroism (CD) experiments, which tell us about the amount of structure in these proteins:
The CD spectrum for the left protein has a shape that is characteristic of a fully-folded protein with both α-helices and β-sheets (we also have a crystal structure of this protein, so we know for certain it is well-folded). Note the wide, flat trough between 208 and 222 nm, and the peak on the left side of the trace where the measured ellipticity is positive for wavelengths below 200 nm. The CD spectrum on the right has a slightly different shape. There is still a partial trough between 208 and 222 nm, so there is definitely some amount of secondary structure (probably helices and maybe sheets) in this protein. However, the measured ellipticity is still negative at shorter wavelengths below 200 nm (admittedly, the signal is much noisier in this spectrum, so the exact shape of the curve is difficult to discern). This indicates that a significant portion of this protein is unstructured, which we would expect from a protein with an unfunnelled landscape and many low-energy decoy states.
So, how can we use these partition functions to improve designed proteins? Check back Thursday for the last blog post in this series, where we’ll propose a possible strategy for using partition functions to improve protein designs in Foldit!( Posted by bkoep 78 544 | Mon, 08/27/2018 - 19:51 | 6 comments )
The problem of protein design
This is the first of a three-part blog post. In the first part, we’re going to review the concept of energy landscapes, which some of you may already be familiar with. In the second part, we’ll discuss how a concept from physics, called a partition function, can help us think about energy landscapes. In the last part, we’ll propose a way that we might use these concepts of energy landscapes and partition functions to improve protein design in Foldit.
The energy landscape
There’s a problem with the way we currently design proteins in Foldit—and not just in Foldit, but also in Rosetta. In fact, it’s a problem in any protein design strategy that optimizes the absolute energy of the design. This strategy is the premise of a Foldit design puzzle. The Foldit score measures the absolute energy of a solution (with a negative multiplier), so that when players compete to find solutions with the highest score, they are actually competing to find solutions with the lowest absolute energy.
However, the success of a protein design (i.e. whether or not the protein folds) does not depend only on the absolute energy of the design. Rather it depends on the protein’s energy landscape. The energy landscape is a concept we use to think about all the possible ways that a string of amino acids can fold. As any Foldit player knows, there are a lot of different ways to fold up a string of amino acids, and they all have different energies (or Foldit scores). We can imagine the energy landscape as a surface where every (x,y) coordinate represents a different fold, or state, and the height of the surface (the z-coordinate) represents the energy of that state. In some places there will be hills, which represent states with a high energy (low Foldit score), and in other places there will be valleys, where folds have a low energy (high Foldit score).
Conceptual illustration of a protein energy landscape, from Dill, K.A. and MacCallum, J.L. (2012)
One of the reasons we like the analogy of energy landscapes is that we intuitively understand how things tend to “prefer” low points in the landscape. If you place an object randomly on the energy landscape, it will tend to slide downhill, from a high-energy state to a low-energy state. If we consider the effect of thermal motion that is constantly jostling around the object (imagine a Mexican jumping bean that randomly jumps around the landscape), then the object will explore all the different valleys of the energy landscape. Nevertheless, the Mexican jumping bean will spend the most time in the deepest valleys of the landscape.
A protein behaves the same way in its energy landscape. At room temperature, there is a considerable amount of thermal motion that allows the protein to explore its energy landscape, although the protein will spend the most time in the states with lowest energy. Every amino acid sequence has a different energy landscape, with different valleys in different places. When you mutate amino acids in a Foldit puzzle to find higher scores for your design, what you’re really doing is looking for an energy landscape where your design is in a deeper valley. However, the Foldit score only tells you about the energy of your designed folded state—or the “depth” of your desired valley. What we’re not considering in Foldit is the rest of the landscape, and whether there might be other low-energy “decoy states”—other deep valleys for your protein to explore.
This is a difficult problem to solve because the energy landscape for a protein is vast. It’s difficult to account for the decoy states because we don’t know what they might look like. We don’t know where to search in the energy landscape for other low-energy valleys, and the landscape is too big to search exhaustively.
The search for decoys
As many of you are probably aware, a lot of the recent De-novo Freestyle prediction puzzles have targeted Foldit player-designed proteins. The purpose of these puzzles is to look for low-energy decoy states, or alternative valleys in the energy landscape. We already run Foldit designs through Rosetta@home to look for decoy states—and for the most part, Rosetta@home seems to do a pretty good job. But occasionally Foldit players find solutions that Rosetta@home misses.
In the following example we're going to pick on fiendish_ghoul, because this energy landscape problem is clearly illustrated by two of their designs, shown below:
The protein on the left is a design originally from Puzzle 1331; the protein on the right is a design from Puzzle 1239. Beneath each cartoon protein structure is a scatter plot with the results from corresponding De-novo Freestyle puzzles that we posted using the sequence of each design. Each black point represents a solution, plotted with respect to its RMSD to the folded state (x-axis) and its energy (y-axis). Together these points give us a profile of the energy landscape for each protein. We see that the design on the left has a “funnelled” landscape, such that the lowest-energy solutions are those close to the folded state (RMSD close to zero) and solutions very different from the folded state (large RMSD) all have higher energies. In the design on the right, however, Foldit players identified a number of decoy states that are very different from the folded state (large RMSD), and have energy just as low as the folded state. These decoy states (marked with colored circles in the scatter plot) appear as “valleys” in the energy landscape of the protein.
The cartoon structures of these decoy states are shown below using the same rainbow coloring as above, with the N-terminus of the protein colored blue, and the C-terminus of the protein colored red:
In each of the decoy structures, all of the α-helices and β-strands are there, but it appears there is some ambiguity about where the helices should go. According to the solutions from the De-novo Freestyle puzzle, the three α-helices can fold in different arrangements around the central β-sheet, and all of these arrangements have similar energies. Since all of these states have similar energy, the protein will not have a strong preference for any single one of them.
Both of fiendish_ghoul's proteins were designed by optimizing their absolute energy, but the protein on the right has a problematic energy landscape. If we made these proteins in the lab, we would expect the protein on the left to be well-folded, and to spend most of its time in the designed state, since it appears to be the only deep valley in the landscape. However, we would expect the protein on the right to be poorly-folded, and to spend its time sampling all the different decoy states discovered by Foldit players.
Check back on Monday for the next blog post, where we’ll discuss these energy landscapes in more detail!( Posted by bkoep 78 544 | Fri, 08/24/2018 - 21:17 | 3 comments )