Inordinately Fond of Viruses: ORFans and Intelligent Design


J.B.S. Haldane, when asked “What has the study of biology taught you about the Creator, Dr. Haldane?”, replied

“I’m not sure, but He seems to be inordinately fond of beetles.”

Discovery Institute Fellow Dr. Paul Nelson is inordinately fond of ORFans, genes unique to one species that appear to have no relatives in other species. He feels that these unique genes represent a significant challenge to evolutionary biology. However, he has not noticed that the distribution of ORFans implies that the designer is more enamoured of viruses than humans.

A very tiny mystery:
What are ORFans, and why should we care about them? ORFans take their names from Open Reading Frames (ORF’s ). ORF’s are stretches of DNA that apparently code for proteins. ORFans are ORF’s that appear to occur only in one species. Note that I say “appear to”. The computer programs that are used to identify genes during whole genome assembly can falsely identify segments of DNA as ORF’s, this can be a significant issue in some genomes. Also, our computer programs for identifying related genes can miss genes that have undergone rapid evolution.

An example is the rotating image below left. ORFan_3D_conservation_score_White_BG-2.gif This is a 3D model of the protein Xc5848 from Xanthomonas Campestris (it is also the static molecule above), originally designated as an ORFan, it was identified to be part a of large class of proteins by sophisticated structure analysis. The model is coloured by amino acid conservation, with red being the highest conservation, and blue being poorly conserved. The model is mostly red (ie it’s part of a highly conserved protein family, not an ORFan at all).

ORFans come in two classes, short (often less than 100 amino acids long), which are unlikely to represent real genes as they are usually much shorter than most real genes, and long (usually over 150 amino acids long), which are likely to be real genes. There are far more short ORFans than long ORFans.

Paul Nelson thinks that ORFans represent a major blow to evolutionary theory. To him they break attempts to determine phylogenies and throw doubt on the idea that all organisms descended from a common ancestor. I’ve dealt with that aspect before (see also here), and I won’t go into detail here, but I would like to simply re-iterate a few points. We have a number of explanations, based on evidence, for the existence of ORFans.

1) Some represent artifacts
2) Some represent rapidly evolving genes whose origin is obscured by the pace of evolution
3) Some represent genes horizontally transferred genes from organisms that have not been sequenced yet.
4) Some represent genuine, de novo genes.

Now, as I said, we have evidence for all of these explanations, and ORFans will represent a combination of all factors. For example, it has been estimated that about half of all short ORFans represent artifacts, but some do represent genuine protein coding genes. In the Firmicutes (the family of bacteria that include the well known gut bacteria Escherichia coli,a large percentage of the genuine short ORFans represent bacteriophage genes (although the confirmed proportion of viral genes in prokaryotes generally is somewhat smaller), and as we add more genomes we discover relatives for things we previously thought were unique.

A tiny mystery gets tinier:
In 2003, a fair percentage of the ORF’s found in fully sequenced prokaryotic genomes were ORFans. However, even back in 2003, it was apparent that as we sequenced more genomes we found more relatives for ORFans and fewer new ORFans.


Figure 1c of Seiw & Fischer 2003, Proteins, 53:241-251, showing that the percent of the genome that is ORFans is decreasing, while the number of ORFans is flattening out.

A relentless fall of ORFans:
We have a lot more data now, and the extent of the fall in ORFans can be found by looking at the ORFan mine, a database of ORFans. As we add more genomes, we identify more relatives of things we thought were unique, and identify and purge more artifacts.

Consider the Escherichia coli genome. In 2003 the total ORFans (things likely to be artifacts) in the E. coli genome constituted 5.5% of the genome, and long ORFans (things likely to be genes) represented 2.4% of the genome. By 2008, total ORFans and long ORFans represented 0.4% and 0.1% of the genome respectively. Consider also the Heliobacter pylori genome, going from 17% and 9% total and long ORFans in 2003 to 2.3% and 0.6% total and long ORFans in 2008.

If you look at all 60 of the genomes reported by Seiw and Fischer in 2003, the total ORFans averaged 14%, by 2008 this was down to 6%. If you look at the genomes added after those 60 (ie all the late comers, not those that are already characterised), their ORFan precent is 7%. In 2003, the last 10 organisms to be added to the databased had an average of 12% ORFans when first sequenced, in 2008, the last 10 organisms had 6% ORFans when first sequenced.

Even those figures may overestimate the number of ORFans, of the 19 ORFans in the E. coli data base, 10 are annotated to viral or conserved proteins. Of the ones I’ve investigated, there is significant sequence similarity to other proteins (eg the alleged ORFan NC_000913orf2361 is annotated to be a CPZ-55 prophage, and forms a high significance phylogeny with other proteins and even has a PFAM domain in it!)

ORFan_tree.jpg Some ORFans are not. The supposed ORFan NC_000913orf2361 is related to a whole range of conserved proteins.

So as we sequence new genomes, we are finding fewer and fewer ORFans. This entirely consistent with the position that ORFans represent rapidly evolving proteins, horizontally transferred proteins and annotation artifacts rather than unique proteins inserted by an unknown designer by unknown mechanisms. Paul Nelson like to emphasize the number of ORFans, as this is increasing. However, the pattern of increase is very instructive. Below I’ve plotted the total number of ORF’s and ORFan’s with increasing number of full genomes sequenced, and the fold increase of ORF’s and ORFans with respect to the numbers of ORF’s and ORFans when only 15 Genomes were sequenced (why do ID advoctes never do this type of thing?).

You can clearly see that the rate of increase in ORFan number is dramatically slowing. When we reached 60 sequenced genomes, this resulted in 4.5 fold increase of ORF’s over the numbers present at 15 genomes, but just over a doubling of ORFans, by the time we got to 330 genomes, ORF’s had increased 25 fold from the numbers at 15 genomes, but ORFans had increased less than eight fold. This is entirely consistent with the fact that as we add genomes, we find more relatives of these genes.

Total vs Fold ORF's.pngORFan numbers increase as we sequence more genomes, but ORF’s (real genes with known relatives) increase much, much faster. This is consistent with the majority of ORFans representing under sampling of phylogenies. Data taken from Seiw and Fischer, 2003 and the ORFan mine).

enter the virus:
Paul Nelson is now particularly taken with a paper from Fischer’s group, that showed that around 38% of complete virus genomes are ORFans. This figure seems to impress Paul. However, the same issues that applied to prokaryotic genomes apply to viral genomes.


As shown in figure 4 of Yin and Fischer (above), as the number of viral genomes sequenced increases, the percentage of ORFans drops as relatives are found (just like prokaryotic ORFans). The phage groups with the most “ORFans” are those that have the fewest sequences (just like prokaryotes, which suggest that sampling of genomes is the main issue).

Furthermore, 18% of alleged “ORFans” turn out to be horizontally transferred prokaryotic genes (just as a fair proportion of prokaryotic “ORFans” turn out to be horizontally transferred bacteriophage genes). Looking at the authors conclusions we find them saying:

Because the current sampling of phages (and of bacterial genomes in general), is limited and biased towards particular groups, the percentage of ORFans in different phage groups varies significantly. This low sampling may be a factor contributing to the abundance of phage ORFans, but is not likely to be the only one. That is, even after many more genomes are sequenced, we expect to find a significant number of ORFans and near-ORFans, awaiting interpretation. There are also other possibilities to account for the ORFans’ origin, like rapid divergence after horizontal transfer (from hosts or from other viruses, from existent genomes or yet extinct genomes) or duplication.

Rapid divergence obscuring ancestry in rapidly evolving viruses is by no means unusual, and more careful sequence comparison will undoubtedly turn up more relatives (just as happened with procaryotes).

Summary: So, the solutions to the ORFan “puzzle”, as outlined by Yin and Fischer (poor sampling, horizontal transfer, rapid evolution) follow the same lines as my previous Pandas Thumb posts (I also included annotation errors, known to produce a proportion of alleged prokaryotic “ORFans”. These annotation errors are likely to be substantial in small genomes as well).

It is instructive to compare the number of ORFans in various genomes (as they currently stand). The Human genome has 0% ORFans [see note], Prokaryotes an average of around 7% and viruses around 30%. Now, if it may be that ORFans represent artifacts, poor sampling and rapidly evolving genes (which would explain why rapidly evolving, under sampled and exceedingly diverse groups like viruses have more ORFans than prokaryotes or Humans).

Or the Designer really has an inordinate fondness for viruses.

Note: Paul Nelson objects to the paper that eliminated the last of the ORFans from the human genome (Clamp et al., 2007), as he claims that they did this on purely evolutionary reasoning. He is wrong; they also looked at whether these sequences were significantly different to random sequences, and whether they were expressed as protein. They weren’t and they aren’t. This is good evidence that they are artifacts.

Larry Moran has a good discussion of ORFans at the Sandwalk.