## Partition functions

This is part two of a three-part blog post about the problem of protein design. Part one can be found here.

In this blog post, we’re going to introduce a physics concept called a **partition function**, which will help us think quantitatively about the energy landscape described in the previous blog post. Partition functions come from a branch of physics called statistical mechanics, which uses probability to describe how large ensembles of molecules behave—typically in conditions of thermodynamic equilibrium. Before we get into partition functions, let's review what it means to be "at thermodynamic equilibrium."

### Proteins at equilibrium

Consider a test tube containing a sample of protein in water. Let’s say the test tube contains a billion billion (10^{18}) identical protein molecules, which is about the amount of protein we generally work with in the lab. If we let the test tube sit long enough at room temperature, the contents of the test tube will reach **thermodynamic equilibrium**. At equilibrium, the temperature is the same everywhere in the test tube, and the molecules in the test tube have settled into their most balanced configuration. Or, in other words, all the thermal motion in the test tube is distributed as evenly as possible among all the molecules in the test tube. This is important because it means the behavior of all 10^{18} protein molecules should be consistent throughout the test tube. And this allows us to treat all of the protein molecules together as an ensemble. Even if we can’t point to an individual protein molecule in the sample and describe exactly what that molecule is doing, we can use statistical mechanics to describe the ensemble as a whole.

Suppose our protein is stable at room temperature. In that case, we expect most of the protein molecules in our test tube will be folded. We say *most* of the molecules are folded, but not *all* of them—because in reality the ensemble will adopt a distribution of states, corresponding to all the different valleys in our protein’s energy landscape. Remember, our test tube contains 10^{18} protein molecules, and at room temperature there is a considerable amount of thermal motion in the test tube jostling around all of the protein molecules. So, even though the conditions are consistent throughout the test tube, there will still be a certain amount of deviation among the molecules of the ensemble. By chance some individual protein molecules will be jostled out of the folded state and into other regions of the energy landscape.

In our test tube, suppose that the distribution includes:

- A hundred protein molecules that are completely unfolded.
- A thousand molecules that are in a partially folded state.
- A million molecules in a mostly-but-not-completely folded state.
- A billion molecules that are misfolded into a completely different structure.

Well, even a billion misfolded molecules is just a tiny fraction of the total 10^{18} molecules in our test tube, so overall the ensemble is still overwhelmingly (99.9999999%) in the folded state.

As long as we’re talking about distributions, we should clarify that there are different ways to visualize distributions of proteins. Above, we just described a distribution in terms of *numbers of molecules*. In this way, we can visualize how a large ensemble of molecules is divided among the states at a single instant. In addition, we can also think of the same distribution in terms of *time*, and visualize how a single molecule will divide its time among the different states. It’s important to remember that each protein molecule is constantly being jostled around by thermal motion, and will fold and re-fold as it jumps randomly around the energy landscape, spending some amount of time in each state according to our distribution.

Even as individual molecules jump around the different states, we expect them to do so in such a way that the overall distribution remains constant, so long as we are at equilibrium. In our test tube, for example, it’s likely that one of the hundred unfolded molecules will be jostled into the folded state; but it is equally likely that one of the 10^{18} folded proteins will be jostled into the unfolded state, and the two transitions cancel each other out—the ensemble looks the same. This is the essence of thermodynamic equilibrium: at equilibrium, molecules may still exchange between different states, but the exchange is balanced so that the overall distribution does not change.

From here onward, we’ll talk about distributions in terms of probabilities, but you can think about these probabilities either as fractions of molecules in an ensemble, or as fractions of time—both are equally valid!

### The partition function

The true distribution of states depends on the energy of each state. If we know the energies of all the possible states for a protein, then we can use a **partition function** to determine exactly what this distribution looks like. From statistical mechanics, we know that the probability of a state decreases exponentially with its energy, and we can write the partition function:

where *P*(*s*) is the probability of state *s*, *E _{s}* is the energy of state

*s*,

*k*is the Boltzmann constant, and

*T*is the absolute temperature. The number

*Z*is simply used to normalize the probabilities so that all of the probabilities sum to 100% (as probabilities should), so

*Z*is equal to the total sum of the exponentials for all the states:

Don’t worry if these equations look like gibberish to you! You don’t need to understand them to follow along—but we wanted to include them for the sake of completeness.

The essential thing to know about the partition function is that the relationship between probability and energy is exponential. This means that (at room temperature) a small energy difference of 1.4 kcal/mol (roughly 14 Foldit points) translates to a ten-fold change in probability. To produce the supposed distribution in our test tube above, the several states must have the following energies relative to the folded state:

We see that states with higher energy (and lower Foldit score) have lower probabilities, and will account for smaller portions of the partition function. Going back to our energy landscape analogy, we’ve just described an energy landscape with five different valleys, with precise depths for each valley. For example, the Folded valley is “deeper” than the Misfolded valley by 12.6 kcal/mol. So, according to our partition function, if there are 10^{18} molecules in the Folded state at room temperature, then we can expect to find about 1 billion molecules in the Misfolded state.

### Partition functions for Foldit designs

The partition function can help us quantify the differences between energy landscapes. In the previous blog post, we examined two designs by fiendish_ghoul and saw how our current protein design strategy (i.e. optimizing for absolute energy) can lead to both good and bad energy landscapes. Let’s return to those two designs:

These energy landscape scatter plots were described previously, and show how Foldit players’ solutions in De-novo Freestyle puzzles can tell us about the energy landscape of a protein. Here, we’ve clustered the solutions from those puzzles to define several distinct states for each protein, which represent individual valleys in their energy landscapes. The ten lowest-energy states are highlighted as multicolored points in the energy landscape.

Now that we have a set of decoy states and their relative energies, we can calculate the partition function for each of these proteins:

The partition function is illustrated here as a multi-colored bar plot, where each bar represents the probability for one of the ten states identified in the De-novo Freestyle energy landscape. Note that the height of the bars is shown on a logarithmic scale, so a bar that reaches 10^{-2} has a probability of 1 in 100, or 1%; a bar that reaches 10^{-6} has a probability of 1 in a million, or 0.0001%. We see that for the protein on the left, the partition function is dominated by the folded state (in blue), with a probability of virtually 10^{0}, or 100%. The next most probable state (orange) has a probability of less than 10^{-12} (1 in a trillion), so it doesn’t even show up on this scale. On the other hand, we see a very different picture for the protein on the right. The partition function shows that the ensemble of molecules will be distributed more evenly across a number of different states. In fact, the blue state has a probability of about 54%, the orange state 32%, the green state 12%, the red state 1.3%, and so forth.

We happen to have experimental data for each of these proteins, which seem to support the energy landscapes and partition functions that Foldit players discovered in the De-novo Freestyle puzzles. Below are raw data from circular dichroism (CD) experiments, which tell us about the amount of structure in these proteins:

The CD spectrum for the left protein has a shape that is characteristic of a fully-folded protein with both α-helices and β-sheets (we also have a crystal structure of this protein, so we know for certain it is well-folded). Note the wide, flat trough between 208 and 222 nm, and the peak on the left side of the trace where the measured ellipticity is positive for wavelengths below 200 nm. The CD spectrum on the right has a slightly different shape. There is still a partial trough between 208 and 222 nm, so there is definitely some amount of secondary structure (probably helices and maybe sheets) in this protein. However, the measured ellipticity is still negative at shorter wavelengths below 200 nm (admittedly, the signal is much noisier in this spectrum, so the exact shape of the curve is difficult to discern). This indicates that a significant portion of this protein is unstructured, which we would expect from a protein with an unfunnelled landscape and many low-energy decoy states.

So, how can we use these partition functions to improve designed proteins? Check back Thursday for the last blog post in this series, where we’ll propose a possible strategy for using partition functions to improve protein designs in Foldit!

( Posted by bkoep 68 522 | Mon, 08/27/2018 - 19:51 | 6 comments )All of these principles also apply to RNA molecules, which (like proteins) can fold up in different ways with different energies.

If you're familiar with 'ensemble diversity', we could say the protein on the left has much lower ensemble diversity than the protein on the right.

In the original blog post, there was an error in the two equations given for the partition function and the number *Z*. Originally, these equations were posted with positive *E _{s}* in the exponential, which would yield higher probabilities for states with higher energy; this is incorrect.

Instead, the energy *E _{s}* in the exponential should be multiplied by -1, so that the probability of a state

*decreases*with higher energies. The two equations have been corrected above; the text of the blog post is unchanged.

```
I made the following chart based on the above discussion:
Foldit pts gained probability factor %gain
-------------------------------------------------------------
224 10^(224/14)=10^16 (10^16-1)*100%~10^18%
210 10^(210/14)=10^15 (10^15-1)*100%~10^17%
168 10^(168/14)=10^12 (10^12-1)*100%~10^14%
126 10^(126/14)=10^9 (10^9-1)*100%~10^11%
-------------------------------------------------------------
14 10^(14/14)=10 (10-1)*100%=900%
10 10^(10/14)=5.179 (5.179-1)*100%=418%
7 10^(7/14)=3.162 (3.162-1)*100%=216%
5 10^(5/14)=2.276 (2.276-1)*100%=128%
-------------------------------------------------------------
2.8 10^(1/5)=1.585 (1.585-1)*100%=58.5%
2 10^(2/14)=1.389 (1.389-1)*100%=38.9%
1.4 10^(1/10)=1.259 (1.259-1)*100%=25.9%
1 10^(1/14)=1.179 (1.179-1)*100%=17.9%
-------------------------------------------------------------
0.7 10^(1/20)=1.122 (1.122-1)*100%=12.2%
0.5 10^(1/28)=1.086 (1.086-1)*100%=8.6%
0.28 10^(1/50)=1.047 (1.047-1)*100%=4.7%
0.2 10^(1/70)=1.033 (1.033-1)*100%=3.3%
0.14 10^(1/100)=1.023 (1.023-1)*100%=2.3%
-------------------------------------------------------------
0.1 10^(1/140)=1.017 (1.017-1)*100%=1.7%
0.07 10^(1/200)=1.012 (1.012-1)*100%=1.2%
0.05 10^(1/280)=1.008 (1.008-1)*100%=0.8%
0.028 10^(1/500)=1.005 (1.005-1)*100%=0.5%
0 10^(0/14)=1 (1-1)*100%=0%
From the above, gaining 1 Foldit point gives a %gain of 17.9%.
If you think of the %gains above as interest rates at a bank,
17.9% would be quite generous!
```

```
If one defines y as the Foldit points gained,
the probability factors above all obey 10^(y/14),
and the %gains above all obey (10^(y/14)-1)*100%.
If x=y/14, one can write the probability factor as
10^(y/14)=10^x=(e^ln(10))^x=e^(ln(10)x)=e^(ln(10)y/14).
Next, if x<<1 or y<<14, one can approximate the
probability factor as 1+ln(10)x = 1+ln(10)y/14
or 1+2.302585x ~ 1+0.1644704y, which lets the
%gain be ln(10)100x% = ln(10)100y/14%
or 230.2585x% ~ 16.44704y%.
In practice, x<=1/50 or y<=14/50=0.28
let the above approximations work well
while x>=1/28 or y>=14/28=0.5 do not.
```

Is this the same concept as ensemble diversity in eterna?