An argument is ORFaned

Paul Nelson has a “new” argument against common descent. It revolves around the discovery of ORFans, “orphan Open Reading Frames”, ie stretches of DNA that appear to code for a protein (an Open Reading Frame, ORF), but that we have no current idea of what the protein is or does, or what other proteins it is related to (hence ORFan). A powerpoint presentation from one of Dr. Nelson’s talks that mentions ORFans is here. ORFans also loom large in Dr. Nelsons rather forceful commentary on a post by Sahotra Sarkar describing a debate between them.

Are ORFan’s a significant problem for evolution? No, not in the least. The ORFan story, while still not completely understood, represents a good example of how science works, and why it’s a good idea to actually understand evolutionary biology before you criticise it (and why it’s a good idea to not stop reading in 2003).

Paul Nelson thinks that ORFans are a problem for common descent as they represent “discontinuities” in common descent.

Orphan genes – open reading frames with no detectable similarity to any other known sequence – constitute a surprisingly high percentage of the genomes of fully-sequenced organisms.

According to the Theory of Common Descent, all proteins are derived from other proteins, and ultimately from the minimal set present in the LUCA [Last Universal Common Ancestor], by descent-with-modification relationships (e.g., gene duplication).

There are two claims here: 1) ORFans have no similarity to other sequences and 2) Common descent assumes all (or a very high proportion) of current proteins all originated with the LUCA.

Claim 1 is deeply misleading and claim 2 is wrong. We fully expect a reasonable proportion of new genes to be generated de novo during evolution. We even have examples of proteins that are so generated. The most famous of these is the nylonase gene, which allows bacteria to metabolise the artificial polymer nylon. This was produced by a mutation in a piece of non-coding “Junk DNA” which generated a transcribable protein (Okada et al, 1983). The sperm-specific dynein intermediate chain gene was generated by a fusion mutation between two genes (so strictly speaking it falls under the gene duplication rubric), but the coding region of the new Sdic gene is generated from the non-coding intronic regions, so protein homology studies would have a hard time identifying it (Nurminsky et al 1998). Formation of new genes poses no problem for evolutionary biology or common descent.

Lets look at claim 1 in more detail. Nelson gives the impression that these are all single genes with no relation to any other genes. In fact many ORFans actually come in families, and many genes with no apparent relationship to other genes at the time of discovery have often had relatives found after a while. In my own signal transduction field, a coding sequence originally thought to be an ORFan was finally identified as being opioid-receptor like, and its ligand found (called ironically Orphainin), and is now a drug target for analgesics. There are many instances where ORFans have been found to be related to extant genes. When H. influenzae was first sequenced, 64% of its ORF’s were ORFans, now, only 5.2% are.

There are of course things called singleton ORFans. Unique genes that do not currently seem to be related to existing genes per se. In prokaryotes, something like 14% of all bacterial genes are currently singleton ORFans, but this may be expected to decrease as we sequence more genomes (as with H. influenzae). Also we may be missing some related proteins, as ORFans may have diverged so much during evolution we can’t currently identify their nearest relatives. Improved detection algorithms, against a background of improved gene databases, will reduce the number of ORFans.

I’ll remind you again that as well as these singleton ORFans, there are ORFans that are limited to closely related organisms, and ORFans that are found in families of organisms (just as if they were related by *gasp* common descent).

Lets look at Nelsons treatment of this in a little more detail. He quotes Siew N, Fischer D (Twenty thousand ORFan microbial protein families for the biologist? Structure. 2003 Jan;11(1):7-9) a three page mini-Review.

“The Total Number of ORFans in Microbial Fully Sequenced Genomes Continues to Grow (Fig. 1, Siew and Fischer 2003, p. 8)”

Unfortunately, he ignores or overlooks the full paper from this group (Siew N, Fischer D. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins. 2003 Nov 1;53(2):241-51.)

From Siew and Fischer, Proteins: “We have shown that the number of ORFans is currently growing, whereas their fraction among ORFs is slowly diminishing.”

ie as you sequence more genomes, you find more relatives. This is what you would expect on the basis of evolutionary theory, and does not represent a problem for common descent (remember H. influenzae going from 64% of its ORF’s being ORFans, to only 5.2% now as more genomes were added, this is the universal pattern for all organisms).

Again he quotes Siew and Fischer minireview:

“If proteins in different organisms have descended from common ancestral proteins by duplication and adaptive variation, why is it that so many today show no similarity to each other?”

But they also provide answers to their own questions, they say “There are two noteworthy observations about our ORFan database. The first is that over half of the ORFans are shorter than 150 residues. Possible explanations for this bias could be that some of the shorter ORFans may not correspond to expressed proteins [27], or that their abundance is a result of a limitation of computational sequence comparison; it is harder for current tools to detect sequence similarity for short sequences.”

In the larger paper, they also say:

“It is probable that some of the short ORFs are the result of random distributions of nucleotides, or of sequencing errors that lead to frame shifts and to wrong stop codons”

and again:

“Another possible reason for the abundance of short ORFans could be technical: It may be more difficult for sequence comparison programs such as BLAST22 to find significant matches for shorter sequences (see the work of Mackiewiez et al.26 for yet another possible explanation).” And there is evidence that a significant fraction of ORFans represent unrecognized divergen proteins (see below). They also say “ORFans may correspond to highly divergent sequences that actually belong to known families (but are beyond recognition capabilities of current tools),2 or to sequences that correspond to new, unique, single-member families.”

Nelson also cites Siew N, and Fischer D, (Unravelling the ORFan puzzle, Comparative and Functional Genomics 2003, 4, 432 – 441.) in his blog entry as evidence of “A world class puzzle” (Siew and Fischer actually say it “entails interesting evolutionary puzzles”). But he ignores their analysis

The authors say this:

“We propose the following model to explain the origin and abundance of ORFans and PCOs, which is somewhat consistent to the models discussed above. Many ORFans may have been generated as the result of a number of possible evolutionary events, which may include horizontal transfer, rapid evolution and gene-loss. ORFans (and other ORFs) without selection pressure have been deleted throughout microbial deletion mechanisms, and thus, microbial genomes are kept at ‘reasonable sizes’ [43]. ORFans that have retained or acquired an important function are kept, thus creating new sequence families with a seed of a single ORFan.”

Again, while we have no definitive answer to ORFans, they represent no threat to common descent, and we have several entirely reasonable explanations (that are proposed in the very publications Dr. Nelson cites).

Lets summarise the main explanations: 1) Some ORFans may be artefacts. 2) Some ORFans may have relatives, but we haven’t sampled enough genomes yet. 3) Some ORFans may have relatives, but our tools aren’t good enough to detect these relatives yet. 4) Some ORFans may be de novo generated proteins.

Now that was the state of play in 2003 (and remember, there was evidence for these explanations even then). Unsurprisingly, the field has moved on a bit and these explanations have been tested. Incidentally, these explanations were not pulled out of thin air, but had supporting evidence. Dr. Nelson doesn’t mention these explanations in the powerpoint slides. Lets look at the explanations and some recent evidence.

1) Some ORFans may be artefacts: As noted above, many ORFans are very short, 100-150 codons long. It is likely that many of these represent database or annotation errors. Also, in any genome, one would expect some random ORFs being formed. Fukuchi S and Nishikawa K. (Estimation of the number of authentic orphan genes in bacterial genomes. DNA Res. 2004 Aug 31;11(4):219-31, 311-313.) closely examined sequences and estimated that about half of all short ORFans are sequencing or other errors.

2) Some ORFans may have relatives, but we haven’t sampled enough genomes yet. While we have something like 150 complete bacterial genomes sequenced, there are many, many more bacteria that are not yet sequenced, and will have genomes quite divergent from the human pathogens that form the majority of current sequences. This will be especially important as horizontal transfer from a distantly related bacteria that has not been sequenced will look like an ORFan (until that distantly related bacteria is sequenced). A recent paper shows that many E. Coli ORFans are the result of horizontal gene transfer from bacteriophages (Daubin and Ochman, 2004; bacteriophages are viruses, which is why they don’t turn up in bacterial database comparisons).

3) Some ORFans may have relatives, but our tools aren’t good enough to detect these relatives yet. Siew and Fischer, not content to rest on their laurels having posed an interesting puzzle, have tried to solve on aspect of it. Using improved fold recognition software, and a larger database of fold family structures, they have found that in Bacillus sp, some related ORFans are members of the of the alpha/beta hydrolase superfamily, and most likely derive from the haloperoxidases (Siew et al., 2005).

So evolutionary biologists have proposed a puzzle, suggested solutions to that puzzle, tested these solutions and largely confirmed them. Testing is by no means over yet, but all the evidence so far confirms that ORFans pose no threat to evolutionary biology. Indeed, if a large proportion of non-artefactual orphans are due to horizontal transfer from bacteriophages, as recent experiments suggest (Daubin and Ochman, 2004), then they may prove to be a valuable tool in understanding the phylogeny of bacteria, in the same way that families of LINES, SINES and pseudo genes have been. Far from being a threat to common descent, the patterns seen of the nested hierarchies of singleton, lineage specific and family specific ORFans are those you would expect from common descent. Some (very small number of) ORFans are also going to be de novo generated proteins. Biology is quite happy with the generation of new genes, it’s a process we have seen and we don’t demand all proteins come from the LUCA.

In summary: Dr. Nelson has relied on some short review papers from 2003 to claim that ORFan genes are a threat to common descent. In fact, the data from these review papers, let alone the other research papers from this time, are fully compatible with common descent. To claim otherwise is disingenuous in the extreme. Papers published since these 2003 reviews were published have confirmed the major explanations for the origin of these ORFans, and supported the common descent model. Dr. Nelson would do well to examine recent literature, rather than selectively rely on old reviews. Even a cursory glance at the literature of the past two years would show that evolutionary explanations would suffice.

References: Daubin V, Ochman H. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res. 2004 Jun;14(6):1036-42. Fukuchi S, Nishikawa K. Estimation of the number of authentic orphan genes in bacterial genomes. DNA Res. 2004 Aug 31;11(4):219-31, 311-313. Nurminsky DI, Nurminskaya MV, De Aguiar D, Hartl DL. Selective sweep of a newly evolved sperm-specific gene in Drosophila.Nature. 1998 Dec 10;396(6711):572-5. Okada H, Negoro S, Kimura H, Nakamura S. Evolutionary adaptation of plasmid-encoded enzymes for degrading nylon oligomers. Nature. 1983 Nov 10-16;306(5939):203-6.
Siew N, Fischer D. Twenty thousand ORFan microbial protein families for the biologist? Structure. 2003 Jan;11(1):7-9. Siew N, Fischer D. Analysis of singleton ORFans in fully sequenced microbial genomes.Proteins. 2003 Nov 1;53(2):241-51. Siew N, Azaria Y, Fischer D. The ORFanage: an ORFan database.Nucleic Acids Res. 2004 Jan 1;32(Database issue):D281-3. Siew N, Fischer D.Structural biology sheds light on the puzzle of genomic ORFans. J Mol Biol. 2004 Sep 10;342(2):369-73 Siew N, Saini HK, Fischer D. A putative novel alpha/beta hydrolase ORFan family in Bacillus. FEBS Lett. 2005 Jun 6;579(14):3175-82.