Junk DNA is still junk

Blogging on Peer-Reviewed Research

The ENCODE project made a big splash a couple of years ago — it is a huge project to not only ask what the sequence of a strand of human DNA was, but to analyzed and annotate and try to figure out what it was doing. One of the very surprising results was that in the sections of DNA analyzed, almost all of the DNA was transcribed into RNA, which sent the creationists and the popular press into unwarranted flutters of excitement that maybe all that junk DNA wasn't junk at all, if enzymes were busy copying it into RNA. This was an erroneous assumption; as John Timmer pointed out, the genome is a noisy place, and coupled with the observations that the transcripts were not evolutionarily conserved, it suggested that these were non-functional transcripts.

Personally, I fall into the "it's all junk" end of the spectrum. If almost all of these sequences are not conserved by evolution, and we haven't found a function for any of them yet, it's hard to see how the "none of it's junk" view can be maintained. There's also an absence of support for the intervening view, again because of a lack of evidence for actual utility. The genomes of closely related species have revealed very few genes added from non-coding DNA, and all of the structural RNA we've found has very specific sequence requirements. The all-junk view, in contrast, is consistent with current data.

Larry Moran was dubious, too — the transcripts could easily by artifactual.

The most widely publicized result is that most of the human genome is transcribed. It might be more correct to say that the ENCODE Project detected RNA's that are either complimentary to much of the human genome or lead to the inference that much of it is transcribed.

This is not news. We've known about this kind of data for 15 years and it's one of the reasons why many scientists over-estimated the number of humans genes in the decade leading up to the publication of the human genome sequence. The importance of the ENCODE project is that a significant fraction of the human genome has been analyzed in detail (1%) and that the group made some serious attempts to find out whether the transcripts really represent functional RNAs.

My initial impression is that they have failed to demonstrate that the rare transcripts of junk DNA are anything other than artifacts or accidents. It's still an open question as far as I'm concerned.

I felt the same way. ENCODE was spitting up an anomalous result, one that didn't fit with any of the other data about junk DNA. I suspected a technical artifact, or an inability of the methods used to properly categorize low frequency accidental transcription in the genome.

Creationists thought it was wonderful. They detest the idea of junk DNA — that the gods would scatter wasteful garbage throughout our precious genome by intent was unthinkable, so any hint that it might actually do something useful is enthusiastically siezed upon as evidence of purposeful design.

Well, score one for the more cautious scientists, and give the creationists another big fat zero (I think the score is somewhere in the neighborhood of a big number requiring scientific notation to be expressed for the scientists, against a nice, clean, simple zero for the creationists). A new paper has come out that analyzes transcripts from the human genome using a new technique, and, uh-oh, it looks like most of the early reports of ubiquitous transcription were wrong.

Here's the author's summary:

The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from "tiling" microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.

So, basically, they directly compared the technique used in the ENCODE analysis (the "tiling" microarray analysis) to more modern deep sequencing methods, and found that the old results were mostly artifacts of the protocol. They also directly examined the pool of transcripts produced in specific tissues, and asked what proportion of them came from known genes, and what part came from what has been called the "dark matter" of the genome, or what has usually been called junk DNA. The cell's machinery to transcribe genes turns out to be reasonably precise!

To assess the proportion of unique sequence-mapping reads accounted for by dark matter transcripts in RNA-Seq data, we compared the mapped sequencing data to the combined set of known gene annotations from the three major genome databases (UCSC, NCBI, and ENSEMBL, together referred to here as "annotated" or "known" genes). When considering uniquely mapped reads in all human and mouse samples, the vast majority of reads (88%) originate from exonic regions of known genes. These figures are consistent with previously reported fractions of exonic reads of between 75% and 96% for unique reads, including those of the original studies from which some of the RNA-Seq data in this study were derived. When including introns, as much as 92%-93% of all reads can be accounted for by annotated gene regions. A further 4%-5% of reads map to unannotated genomic regions that can be aligned to spliced ESTs and mRNAs from high-throughput cDNA sequencing efforts, and only 2.2%-2.5% of reads cannot be explained by any of the aforementioned categories.

Furthermore, when they looked at where the mysterious transcripts are coming from, they are most frequently from regions of DNA near known genes, not just out of deep intergenic regions. This also suggests that they're an artifact, like an extended transcription of a gene, or from other possibly regulatory bits, like pasRNA (promoter-associated small RNAs — there's a growing cloud of xxxRNA acronyms growing out there, but while they may be extremely useful, like siRNA, they're still tiny as a fraction of the total genome. Don't look for demolition of the concept of junk DNA here).

There clearly are still mysteries in there — they do identify a few novel transcripts that come up out of the intergenic regions — but they are small and rare, and the fact of their existence does not imply a functional role, since they could simply be byproducts of other processes. The only way to demonstrate that they actually do something will require experiments in genetic perturbation.

The bottom line, though, is the genome is mostly dead, transcriptionally. The junk is still junk.

van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most "Dark Matter" Transcripts Are Associated With Known Genes. PLoS Biology 8(5):1-21.