De novo origination of a gene encoding a functional protein

A recurrent theme amongst ID proponents is the supposed difficulty of protein evolution, especially as it relates to the origination of new protein-coding genes. This is, I suspect, a key reason why ID proponents such as Paul Nelson are so enamoured of ORFans, and a foundational principle for the application of ID theory to evolution (the idea being that protein-coding genes are possessed of Complex Specified Information, and thus cannot arise by natural processes). Thus, studies that pertain to the origins of new protein-coding genes are going to factor largely in the scientific aspect of the ID debate, especially since ID proponents insist that new protein-coding genes cannot arise “by chance”.

It is in this context that a recent study by Jing Cai and colleagues is of interest. The title of the article suffices to explain the study – “De novo Origination of a New Protein-Coding Gene in Saccharomyces cerevisiae”. What these authors describe is a series of studies of a yeast gene, BSC4. This gene was originally identified as a candidate containing a so-called read-through translation termination (or stop) codon. This gene was studied in more depth, whereupon Cai et al. found that the protein encoded by this gene was novel in genome databases, not resembling any other protein in any organism. Importantly, this includes the genomes of related Saccharomyces species; this indicates that this protein in S. cerevisiae arose relatively recently, after this species diverged from its close relatives.

Of course, all of these results are the outcomes of “sequence-gazing” – compilation and analysis of genome sequence datasets. Cai et al. determined to test whether BSC4 was a true protein-coding gene, or just an unexpressed sequence that just happens to contain an open-reading frame. To this end, Cai et al. performed a number of studies:

  • They analyzed the BSC4 coding region in a number of S. cerevisiae isolates, and found evidence for purifying selection amongst these genes. (In other words, they found that non-synonymous, or amino acid-changing, mutations were fixed much less frequently than synonymous mutations.) This result is expected if the gene is expressed as protein (and, of course, if the protein has a function that is acted upon by natural selection).
  • To buttress this conclusion, these authors then searched databases of peptides (1) obtained by tandem mass spectrometry (MS/MS) analysis of yeast proteins; this effort yielded 29 peptides that corresponded to the BSC4 gene. This is strong evidence for the proposition that the BSC4 gene is expressed as protein, as this is the only way that these peptides could be found in MS/MS studies.
  • That the BSC4 gene is expressed as RNA was confirmed by reverse transcription/polymerase chain reaction (RT/PCR) studies.
  • Finally, these authors noted that a systematic screen for synthetic lethality indicated genetic interactions between BSC4 and both DUN1 and RPN4, two other yeast proteins. As with the MS/MS and RT/PCR studies, synthetic lethality is best explained if the two partners are both expressed as proteins.

As a package, these results make a very compelling case that the BSC4 gene is expressed as protein, AND that it has some biological function.

What then of the matter of the origins of the BSC4 gene? The following bullets summarize some pertinent items.

  • Fig. 3A from the paper (reproduced below) shows the arrangement of the BSC4 gene in S. cerevisiae, and the corresponding genetic locus from six close relatives. It also shows that this region is a recent development, as it is absent from more distantly-related fungi (Ylip. Ncra, and Spom at the bottom of the figure). This figure describes the history of the locus – it arose via some sort of rearrangement sometime before the seven yeast species diverged from one another, but after they diverged from other fungal lineages. (Importantly, the BSC4 gene did not “come from” another pre-existing protein-coding gene, either in yeast or any other organism. This argues against horizontal gene flow and/or gene duplication as a source of the BSC4 protein-coding region.)
  • In those different species (S. bayanus, S. paradoxus, etc.) that possess the same locus that contains BSC4 in S. cerevisiae, there is considerable nucleotide sequence divergence, more than is usually seen in protein-coding genes. However, there is a degree of conservation consistent with a function for this region in these yeast species as encoding so-called non-coding RNAs.
  • There is a ca. 100 bp portion of the BSC4 gene that is about 50% identical amongst four different yeast species (S. bayanus, S. mikatae, S. paradoxus, and S. cerevisiae). This extended region of identity in four species strongly suggests that the region of interest existed in the common ancestor of the four species. This is important, as it means that the BSC4 gene did not arise by some other insertion, duplication, or rearrangement; rather, it is the product of the accumulation of point mutations.
  • This region is transcribed (as determined by RT/PCR analysis) in these four species. This supports the proposition that these regions encode non-coding RNAs.

Putting these observations together, it is apparent that the LYP1-ALP1 region (including the intergenic region of interest) in yeasts arose by a rearrangement that produced these two genes and an intergenic region that is transcribed but (except for S. cerevisiae) has no protein-coding potential. At some point after this founding rearrangement, and after the S. cerevisiae lineage diverged from the other yeast lineages, part of the intergenic region picked up translation initiation and termination codons (one of which is leaky, the property that allowed the identification of BSC4 in the first place). The result was a new protein-coding region (BSC4), the product of which either possessed from the outset or gained, via additional mutation, the functions inferred by the synthetic lethality results.

(While this issue is not discussed at much length, I suspect that the authors would argue that de novo origination is a better explanation than gene loss in this case, since the latter would invoke gene loss in at least six yeast species. This rather high rate of loss is not consistent with the observation that different geographic isolates of S. cerevisiae, many of which have been isolated for millennia, all retain the BSC4 gene.)

Finally, it is of interest to note that the antecedent of BSC4 gene in other yeast species seems to encode so-called non-coding RNAs. This raises the interesting possibility that non-coding RNAs may be sources of novel protein-coding genes, and suggests that one step of a hypothetical pathway for the de novo origination of new proteins - the “creation” of transcription regulatory sequences - may not be an issue (or at least a conceptual hurdle of any import). Also, it helps to point out the BSC4 region does not seem to be RRP6-dependent (see this essay), and thus the BSC4 is not derived from these so-called Cryptic Unstable Transcripts


(1) A fairly new approach to studying genomes and their encoded products is to isolate collections of proteins by various means, digest them (or chop them up) with proteases such as trypsin, and then analyze the peptides by tandem mass spec-mass spec (MS/MS). MS/MS provides, among other things, the exact (to the hydrogen atom) atomic mass of the peptide; combined with good calculators, it is possible to assign a unique amino acid sequence to most such masses. Thanks to amazing advances in the hardware, it’s possible to perform this with complex mixtures (such as crude lysates) and get an idea of the scope of the expressed proteins in a compartment or cell. Cai et al. refer to such collections in their paper. An overview of the approach is here and the database is here.

Reference for Cai et al:

Cai J, Zhao R, Jiang H, Wang W. 2008. De Novo Origination of a New Protein-Coding Gene in _Saccharomyces cerevisiae._Genetics 179, 487-496.

A slightly different version may also be found at The RNA Underworld.