The energy landscape optimization paper
This blog post is a walk-through for an upcoming paper, showing how researchers at UW and Harvard developed a new method for protein design. This research relied heavily on the work of Foldit players, who will be listed as authors on the paper. (If you have played a Monomer Design puzzle in Foldit, you can opt in to the author list here).
The paper has not yet gone through peer review, but a pre-print draft of the paper is already available online. (Edit: See final publication by PNAS here.) The paper is written for highly-specialized academics with a scientific background, but we think its content can be appreciated by anyone with an interest in protein folding and design.
Below we discuss some of the background for this research, take a look at the figures, and review the main points of the paper.
What is an energy landscape?
The title of the paper is Protein sequence design by explicit energy landscape optimization. Before we jump in, we will need to make sure we understand the idea of an energy landscape. We’ve discussed energy landscapes previously, so let’s recap:
There are a lot of possible ways that an unfolded protein might fold up (think of all the different knots you can tie with a shoestring). Each of these possible folds has some amount of free energy, which depends on the amount of clashing, voids, H-bonding, etc. The lower the free energy, the more stable the fold.
In an energy landscape, we like to imagine all of these possibilities laid out on a grid, like a map of possible folds. Then we imagine that the depth at each point of the map corresponds to the energy of each fold. There will be deep valleys and wells where we have stable, low energy folds; and there will be hills and peaks where we have unstable, high energy folds. This map is our energy landscape.
An unfolded protein will naturally fold into its most stable structure. This is the structure with lowest free energy (the deepest point in the landscape).
Every protein sequence has a different energy landscape. Most random protein sequences have a featureless landscape with many, many shallow wells of similar depth. These sequences will not have a strong preference for any particular fold, and they will be poorly folded in real life.
On the other hand, a well-designed protein sequence will have an energy landscape with a single deep well. This sequence has an overwhelming preference for the low energy fold at this well, and the sequence will be well-folded in real life.
Normally, we try to approximate the energy landscape of a sequence by folding the sequence into thousands and thousands of different structures, and calculating the energy of each one (details here). Even though this only gives us a partial view of the energy landscape, it is computationally intensive, and it takes some 10,000s of CPU hours to compute. (A big thanks goes to Rosetta@home volunteers for providing this CPU power!)
Because energy landscapes are expensive to compute, most protein design methods focus on just the design structure, and ignore the rest of the landscape. We only try to reduce the free energy of our design, and we cross our fingers that the energy landscape has no other energy wells. This is sometimes effective, but it can lead to an energy landscape that has multiple low-energy wells (which means the protein could fold into an unintended structure).
Ideally, we would like a design method that considers the entire energy landscape, but without requiring thousands and thousands of energy calculations.
Figure 1. Energy landscapes and trRosetta. (A) An energy landscape visualizes the energy (depth) across all different folds, or "conformations." Suppose that we want to design a protein with fold P. Most design methods optimize the free energy of fold P and arrive at sequence B (green). Since these methods are blind to the rest of the energy landscape, sequence B might have a landscape with alternative energy wells. A better design method would consider the entire energy landscape to produce sequence A (blue), which has a single low-energy well. (B) The trRosetta neural network takes an input sequence and makes predictions about how the residues will be oriented in the folded structure. This new work shows that trRosetta predictions serve as a good proxy for the energy landscape. The neural network can optimize the sequence to improve the match between the predictions and the desired structure, molding the landscape to favor our desired structure.
Neural networks and sequence likelihood
trRosetta (transform-restrained Rosetta) is a machine learning program developed after the breakthrough AlphaFold program (details in this blog post). The input for trRosetta is a 1D protein sequence, and the output is the predicted distance and orientation between every pair of residues in the 3D folded protein structure.
Previously, researchers at the Baker Lab showed that these distance and orientation predictions are good for protein structure prediction problems. The orientations help us generate a complete 3D model of the folded protein, which accurately shows how the protein will actually fold.
In the new paper, researchers turn trRosetta on its head to evaluate and design proteins. Rather than use the predictions to generate a structure for the input sequence, they compare the distance and orientation predictions to the intended structure, and calculate the sequence likelihood for that structure.
This sequence likelihood score tells us whether trRosetta thinks the design sequence is a good match for the design structure. If the intended distances and orientations of the design structure are a close match to the trRosetta predictions, then the sequence likelihood for that structure will be high. If the design structure is a poor match to the predictions, the sequence likelihood will be low.
Predicting energy landscapes
The researchers used sequence likelihood to show that trRosetta can predict useful information about the entire energy landscape of a protein sequence -- not just information about the preferred structure.
To show this, they used a dataset of energy landscapes for >4000 Foldit designs, which have been accumulated from several years of Foldit design puzzles. This dataset represents about 100 million CPU hours of energy landscape calculations! They divided this dataset into favorable and unfavorable energy landscapes.
First, they calculated the sequence likelihood just for the design structure. They found that the sequence likelihood of a design is a good predictor of whether a design has a favorable or unfavorable energy landscape. Importantly, trRosetta sequence likelihood was a much better predictor than just the Rosetta energy (or Foldit score) of the design. Since trRosetta takes just a couple minutes to run, this could cut down the need to run expensive landscape calculations!
Next, the researchers calculated sequence likelihoods for many different structures across the energy landscape of each design. They found that these likelihoods accurately reflect the shape of the landscapes.
For example, when they looked at a favorable energy landscape with a single energy well, they saw that models within the well had a high sequence likelihood, and models outside the well had low likelihood.
They also looked closely at a few special cases, where an energy landscape shows two competing energy wells. One of these wells represents the intended design fold, and the other well represents a decoy fold that is equally stable. We expect that a protein sequence with this kind of energy landscape is equally probable to fold into the design fold or the decoy fold. This is correctly reflected in the sequence likelihood scores, which are reduced for the design fold, and are comparable between design and decoy folds.
Figure 2. trRosetta predicts information about energy landscapes. (A) Histogram of sequence likelihood (left) and Rosetta energy (right) for 4200 Foldit designs. The distribution of favorable landscapes is shown in blue, and unfavorable landscapes in gray. There is significant overlap in the distributions of Rosetta energy, showing that Rosetta energy is a poor predictor of the whole energy landscape. Sequence likelihood is a better predictor, with less overlap between blue and gray distributions. (B) Energy landscape plots for Foldit designs, with color gradient showing the trRosetta sequence likelihood of models across the landscape. At the top, a landscape with a single well has very high sequence likelihood within the well. Below, landscapes with multiple wells have weaker, more dispersed likelihood. Cartoon illustrations show the design and decoy folds X and Y. On the right, example bimodal distributions show the “ambivalency” of trRosetta distance predictions when a landscape has two energy wells.
This is all well and good. We’ve seen that trRosetta is really useful for predicting theoretical energy landscapes, and can help us cut down on computational work. But does it actually reflect physical reality? A more stringent challenge would compare trRosetta against real experimental data from lab testing.
Last year we published the experimental testing results for 145 Foldit player designs. When the researchers checked this data, they found that trRosetta sequence likelihood was a good predictor of success in the lab!
Figure 3A-B. trRosetta predicts experimental testing results. (A) When we look at the testing results for 30,000 IPD-designed proteins, we see that trRosetta sequence likelihood correlates well with folding stability (as approximated by protease resistance). By contrast, Rosetta energy of the design is poorly correlated with this stability measure. (B) Histogram of sequence likelihood (left) and Rosetta energy (right) for 145 experimentally-tested Foldit designs. Successful designs are in blue, and failures in gray. Sequence likelihood is a better predictor and energy alone, with less overlap between the success and failure distributions.
Optimizing the energy landscape
Finally, the researchers put trRosetta to the test, to see if it could actually redesign proteins to have favorable energy landscapes.
From the 4000 Foldit designs, they selected a representative set of 200 models and used trRosetta to redesign their sequences. Remember that, in Foldit, the original designs were made to optimize the energy (the Foldit score) of just the target fold. Now, trRosetta is trying to optimize the entire energy landscape, which encompasses the energies of all possible folds.
The results were surprising: although trRosetta was good for eliminating decoys and coarsely sculpting the energy landscape, the resulting landscapes lacked a sharp, deep energy well that we like to see for a stable, well-folded protein design. Instead, a combination of trRosetta (optimizing the landscape) and traditional design (optimizing the design energy) yielded the best energy landscapes, with a single deep energy well.
Figure 3C-D. Redesigning proteins with trRosetta energy landscape optimization. (C) Example energy landscapes for two redesigned Foldit proteins. Redesign with trRosetta alone produces a landscape with a single shallow well, and Rosetta lowers the energy without favoring a single energy well. Combining both approaches gives a favorable energy landscape with a single deep energy well. (D) The quality of energy landscapes across all 200 redesigned proteins. The colored lines show how many redesigns (y-axis) meet a threshold for energy landscape quality (x-axis; increasingly stringent threshold). Traditional Rosetta redesign (green) is susceptible to low energy decoys, and less than 50% of redesigns pass the lowest threshold; however, Rosetta redesigns that do pass have very deep energy wells and also tend to pass higher thresholds. trRosetta (purple) improves landscapes that fail the low-quality threshold, but cannot achieve deep energy wells that meet a high-quality threshold. A hybrid approach, in magenta, achieves the best of both worlds.
What does this mean for Foldit?
In all Foldit design puzzles so far, we’ve seen that players are very good at optimizing the score of their designs. But the real challenge of protein design is how to account for the rest of the energy landscape, and we still haven’t found a good way to do this in Foldit.
Some players probably remember the 2018 Foldit Partition Tournament, which challenged players to explore the energy landscape of each others’ designs. That showed some promise, but still was time-consuming and low-throughput (we generated only 20 landscapes in 6 weeks).
trRosetta offers a fast alternative for predicting energy landscapes, and we may be able to combine it with normal Foldit scoring. trRosetta might be able to report the sequence likelihood of a Foldit solution, and even suggest mutations to improve its energy landscape.
One disadvantage with machine learning programs like trRosetta is that they are “opaque” and sometimes difficult to make sense of. We can’t really say why trRosetta makes certain suggestions, or ask which design features are causing problems. That could make it difficult to reconcile trRosetta suggestions with Foldit score components like clashing and H-bonding.
Another shortcoming of trRosetta is that it cannot suggest how to refold the protein backbone to improve an energy landscape. Some protein backbones are inherently more difficult to design than others (or even impossible). Finding designable backbones is an important aspect of protein design, and we think that’s a particular strength for Foldit players.
Still, trRosetta is clearly a useful tool for protein design, and we’ll be looking at ways to incorporate trRosetta into Foldit. Maybe players could find new and unexpected ways to use feedback from neural networks!( Posted by bkoep 70 465 | Fri, 07/31/2020 - 21:00 | 8 comments )