Protein folding -- a step closer?

The Holy Grail of biochemistry is what’s known as the “protein folding problem”. We’ve long known that most proteins’ function is based on their 3-dimentional structures (i.e. their folds), and that these structures are dependent on their linear amino acid sequences. What we haven’t known is how exactly their amino acid sequences determine their structures. This would be an extremely useful thing to know, because currently the only way to determine a protein’s 3D structure is through X-ray crystallography or NMR, both of which are expensive, time-consuming, and aren’t even guaranteed to work. If the protein folding problem were solved, the thousands and thousands of proteins that we’ve sequenced could have their structures deciphered almost overnight, revolutionizing our understanding of how proteins function. Of course it probably won’t happen that way; the protein folding problem will probably be slowly chipped away at rather than solved in one fell swoop, and crystallographers and NMR spectroscopists will remain gainfully employed for the foreseeable future.

But two papers published in a recent issue of Nature from the lab of Rama Ranganathan knocked a chip out of the protein folding problem. Their results show us that protein folding might be simpler than we think.

In the first paper (Socolich et al), they attempt to figure out just how many and what kind of interactions are responsible for a protein adopting its fold. As proteins collapse into a globular state, various amino acids interact with each other in order to hold the protein in a certain shape. But how many interactions might this be, and how deep do they go? Socolich et al used an evolution-based method called statistical coupling analysis (SCA). This technique works by aligning a large number of homologous protein sequences and looking for those amino acids that are “statistically coupled” – that is, those which do not vary randomly with respect to each other, but rather tend to coevolve with each other. These amino acids, they reason, must interact with each other in some important fashion, because you rarely find one changed without its partner being changed accordingly. What’s really interesting is that SCA studies seem to indicate that a relatively small number of amino acid interactions are responsible for a protein’s fold, and that it takes only a small amount of sequence information to specify such a fold.

Socolich et al decided to put this idea to the test by generating artificial proteins using only information available from the SCA. If the SCA results were accurately representing protein folds as “low sequence information” entities, then it should be possible to make artificial proteins that folded properly by restricting themselves to said information. The researchers used a short protein domain known as the WW domain, which works well because it has a large number of sequenced members. (In natural proteins, WW domains typically bind proline-rich sequences.) To do this, they produced two computer algorithms to generate artificial sequences. The first (called IC, for site-independent conservation) only takes the pattern of amino acid conservation into account and shuffles the amino acids at each position according to how often they occur. For this algorithm, there is no statistical coupling – each position is allowed to vary without regard to other positions. The second (called CC, for coupled conservation) kept the conservation pattern but also took statistical coupling into account.

FIGURE 1. SCA-based protein design. a, Structure of a representative WW domain (Nedd4.3, Protein Data Bank 1I5H) in complex with a target peptide (in stick representation). The two canonical tryptophans are shown as space-filling side chains. The figure was prepared using PyMol51. b, SCA conservation scores for each position in the WW alignment in arbitrary units of statistical energy12. Position numbers (x axis) and the secondary structure diagram at the top coincide with matrix columns in c−e. c, A matrix representation of statistical coupling values from perturbation analysis of five positions (rows) in the WW domain MSA. d, The matrix for an alignment of IC sequences, built by randomly selecting amino acids at each site from the observed frequency distributions in the natural alignment. e, The matrix for an alignment of CC sequences, derived from a design algorithm where both the conservation pattern and the pattern of statistical couplings in the natural alignment are preserved. Scale bar shows the SCA coevolution score, ranging from 0 (blue) to 2 (red).

The researchers then generated protein libraries based on these two sets of possible proteins, along with a library of natural WW domains and randomly generated proteins as positive and negative controls, respectively. They found that most of the natural proteins folded properly. (There was no guarantee that this would be the case, since the proteins were being expressed in bacteria.) Not surprisingly, none of the random proteins were observed to fold properly. Also, none of the proteins that had only their amino acids patterns conserved, but no statistical coupling (the IC group), were able to fold either. What is remarkable is that a large percentage (12 out of 43) of the statistically coupled proteins (the CC group) were found to be folded, and were within the range of thermal stability for native WW domains. This is in spite of the fact that they had low average sequence identity (35%) to the native proteins. These results show that protein folds can be specified by a fairly small amount of information – the amino acid conservation pattern along with a small number of statistically coupled residues. High sequence identity is apparently not necessary.

But they didn’t stop there. In the second paper (Russ et al), the researchers decided to test their artificial proteins for function. Just because the proteins folded properly there is no guarantee that they’re functional. And it’s functionality, afterall, that is selected for. Amazingly, they found that the binding function of the artificial proteins was indistinguishable from that of the natural proteins – they bound proline target sequences just as the native proteins did, and could even be assigned to the same functional classes. So we see that not only is proper folding specified by the conservation pattern and a small number of coevolving residues, but that function is as well.

These results are interesting because by providing insight into what it takes for a protein to fold properly, they bring us a step closer to solving the protein folding problem. It had long been thought that conserved residues were crucial for determining a protein’s structure. And while this is certainly true, it was never by itself adequate to explain just what it took for a protein to fold. For one thing, there is the odd fact that even in large proteins, only a small handful of residues are strictly conserved. Most conserved residues will occasionally vary. If you see a residue that’s conserved 80% of the time, that’s a good reason to think it’s important. But why is it allowed to vary the other 20% of the time? It has long been presumed that interactions between amino acids will allow some residues to change while requiring others to remain the same, and that this is very much context dependent, such that even closely related proteins will differ in which mutations they will or won’t tolerate. But this raises the question: how much additional sequence information, aside from conserved residues, is necessary to determine a protein’s structure? If the results here are any indication, not much. It may be that a small number of coupled, coevolving residues contain all the additional sequence information that we need. If this is the case, the whole business might be far simpler than previously thought.

I will add one caveat though: just because these results work well with WW domains, there is no guarantee that they’ll work with other types of proteins. Far too often have people thought that the solution to the protein folding problem was within reach, only to see their carefully crafted program work well with the particular protein family that it was fine-tuned to work with, yet fail miserably with everything else. We’ll need plenty of additional research to see if this one pans out. And I should point out that de novo structural prediction wasn’t what the researchers were going for here. It’s just that their research provides the type of knowledge that we’ll need if such predictions are ever going to be reliable.

Finally, to tie this in with the whole cre/ID business, the seeming difficulty with which proteins fold has at times been used to argue that new folds could never evolve, or that protein folds could never have arisen de novo. These claims are similar to the type that William Dembski has made about protein function, which don’t stand up well to scrutiny. But such arguments are always easier to sell when applied to things we know little about, like the protein folding problem, since one can always appeal to “high information” content and whatnot to say that protein folds could never evolve. Unless, of course, it turns out that they don’t require high information content.


Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R. “Evolutionary information for specifying a protein fold.” Nature. 2005 Sep 22;437(7058):512-8.

Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R. “Natural-like function in artificial WW domains.” Nature. 2005 Sep 22;437(7058):579-83.

See also the Nature News and Views commentary:

Jeffery W. Kelly. “Structural biology: Form and function instructions”. Nature 437, 486-487 (22 September 2005)