The ENCODE delusion

I can take it no more. I wanted to dig deeper into the good stuff done by the ENCODE consortium, and have been working my way through some of the papers (not an easy thing, either: I have a very high workload this term), but then I saw this declaration from the Electronic Frontier Foundation.

On September 19, the Ninth Circuit is set to hear new arguments in Haskell v. Harris, a case challenging California's warrantless DNA collection program. Today EFF asked the court to consider ground-breaking new research that confirms for the first time that over 80% of our DNA that was once thought to have no function, actually plays a critical role in controlling how our cells, tissue and organs behave.

I am sympathetic to the cause the EFF is fighting for: they are opposing casual DNA sampling from arrestees as a violation of privacy, and it is. The forensic DNA tests done by police forces, however, do not involve sequencing the DNA, but only look at the arrangement of known variable stretches of repetitive DNA by looking at just the length of fragments cut by site-specific enzymes; they can indicate familial and even to some degree ethnic relationships, but not, as the EFF further claims, "behavioral tendencies and sexual orientation". Furthermore, the claim that 80% of our genome has critical functional roles is outrageously bad science.

This hurts because I support the legal right to genetic privacy, and the EFF is trying to support it in court with hype and noise; their opposition should be able to easily find swarms of scientists who will demolish that argument, and any scientifically knowledgeable judge should be able to see right through the exaggerations (maybe they're hoping for an ignorant judge?). That conclusion, that 80% of the genome is critical to function, is simply false, and it's the notorious dishonest heart of ENCODE's conclusions.

And then there is this lovely little commercial for ENCODE, narrated by Tim Minchin, and portraying ENCODE as a giant cancer-fighting robot.

Oh, jebus…that was terrible and cringeworthy. Not just the ridiculous exaggerations … the Human Genome Project also claimed that it would provide the answers to all of human disease, as has, to a lesser degree, most every biomedical grant proposal, it seems — but that they invested in some top-notch voice talent and professional animation to promote some fundamentally esoteric science to the general public as a magic bullet…I mean, robot.

Scientists, don't do this. Do make the effort to communicate your work to the public, but don't do it by talking down to them and by portraying your work in a way that is fundamentally dishonest and misleading. If you watch that video, ask yourself afterward: if I hadn't read any of the background on that project, would I have the slightest idea what ENCODE was about from that cartoon? There was no usable information in there at all.

So what is ENCODE, actually? The name stands for Encyclopedia of DNA Elements, and it's the next step beyond the Human Genome Project. The HGP assembled a raw map of the genome, a stream of As and Gs and Cs and Ts, and dumped it in our lap and told us that now we have to figure out what it means. ENCODE attempts to break down that stream, reading it bit by bit, and identifying what each piece does; this part binds to a histone, for instance, or this chunk is acetylated in kidney cells, or this bit is a switch to turn expression of Gene X off or on. It tries to identify which genes are active or inactive in various cell types. It goes beyond the canonical sequence to look at variation between individuals and cell types. It identifies particular genetic sequences associated with Crohn's Disease or Multiple Sclerosis or that are modified in specific kinds of cancers.

ENCODE also looks at other species and does evolutionary comparisons. We can identify sequences that show signs of selection within the mammals, for instance, and ENCODE then maps those sequences onto proposed functions.

You know what? This is really cool and important stuff, and I'm genuinely glad it's being done. It's going to be incredibly useful information. But there are some unfortunate realities that have to be dealt with.

It's also drop-dead boring stuff.

I remember my father showing me a pile of maintenance manuals for some specific aircraft at a Boeing plant when I was a kid; these were terrifyingly detailed, massive books that broke down, bit by bit, exactly what parts were present in each sub-assembly, how to inspect, remove, replace, repair, and maintain a tire on the landing gear, for instance. It's all important and essential, but…you wouldn't read it for fun. When you had a chore to do, you'd pull up the relevant reference and be grateful for it.

That's ENCODE. It's a gigantic project to build a reference manual for the genome, and the papers describing it are godawful tedious exercises in straining to reduce a massive data set to a digestible message using statistics and arrays of multicolored data visualization techniques that will give you massive headaches just looking at them. That is the nature of the beast. It is, by necessity and definition, a huge reference work, not a story. It is the antithesis of that animated cartoon.

I'm uncomfortable with the inappropriate PR. The data density of the results makes reading the work a hard slog…but that's the price you have to pay for the volume of information delivered. But then…disaster: a misstep so severe, it makes me mistrust the entire data set — not only are the papers dense, but I have no confidence in the interpretations of the authors (which, I know, is terribly unfair, because there are hundreds of investigators behind this project, and it's the bizarre interpretations of the lead that taints the whole).

I refer to the third sentence of the abstract of the initial overview paper published in Nature; the first big razzle-dazzle piece of information the leaders of the project want us to take home from the work. That 80%:

These data enabled us to assign biochemical functions for 80% of the genome.

Bullshit.

Read on into the text and you discover how they came to this startling conclusion:

The vast majority (80.4%) of the human genome participates in at least one biochemical RNA- and/or chromatin-associated event in at least one cell type.

That isn't function. That isn't even close. And it's a million light years away from "a critical role in controlling how our cells, tissue and organs behave". All that says is that any one bit of DNA is going to have something bound to it at some point in some cell in the human body, or may even be transcribed. This isn't just a loose and liberal definition of "function", it's an utterly useless one.

Now this is all anyone talks about when describing this research: that it has found a 'function' for nearly all of human DNA (not true, and not supported by their data at all) and that it spells the demise of junk DNA, also not true. We know, for example, that over 50% of the human genome has a known origin as transposable elements, and that those sequences are basically parasitic, and has no recognizable effect on the phenotype of the individual.

I don't understand at all what was going through the head of the author of that paper. Here's this awesome body of work he's trying to summarize, he's representing a massive consortium of people, and instead of focusing on the useful, if rather dry, data the work generated, he decides to hang it all on the sensationalist cross of opposing the junk DNA concept and making an extravagant and unwarranted claim of 80 going on 100% functionality for the entire genome.

Well, we can at least get a glimpse of what's going on in that head: Ewan Birney has a blog. It ended up confusing me worse than the paper.

For instance, he has a Q&A in which he discusses some of the controversy.

Q. Hmmm. Let's move onto the science. I don't buy that 80% of the genome is functional.
A. It's clear that 80% of the genome has a specific biochemical activity - whatever that might be. This question hinges on the word "functional" so let's try to tackle this first. Like many English language words, "functional" is a very useful but context-dependent word. Does a "functional element" in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of "functional" works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as "specific biochemical activity" - for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like "having a phosphodiester bond" would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, "broad" histone modifications, "narrow" histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.

Oh, jeez, straining over definitions—ultimately, what he ends up doing is redefining "functional" to not mean functional at all, but to mean simply anything that their set of biochemical assays can measure. It would have been far more sensible to use a less semantically over-loaded word or phrase (like "specific biochemical activity") than to court confusion by charging into a scientific debate about functionality that he barely seems to comprehend. It would have also conformed to the goals he claims to have wanted to achieve with public education.

ENCODE also had the chance of making our results comprehensible to the general public: those who fund the work (the taxpayers) and those who may benefit from these discoveries in the future. To do this we needed to reach out to journalists and help them create engaging stories for their readers and viewers, not for the readers of Nature or Science. For me, the driving concern was to avoid over-hyping the medical applications, and to emphasize that ENCODE is providing a foundational resource akin to the human genome.

Uh, "giant cancer-fighting robot", anyone? Ewan Birney's name is right there in the credits to that monument to over-hyping the medical applications.

I'll be blunt. I don't think Birney has a clue about the biology. So much of what he has said about this project sounds human-centered and biased towards gross misconceptions about our place in biology. "We are the most complex things we know about," he says, and seems to think that there is a hierarchy of complexity that correlates with the phylogenetic series leading to humans, where, for instance, fugu are irrelevant to the argument because they're not a mammal. This is all nonsense. I would not be at all surprised to learn that the complexity of the teleost genome is significantly greater than that of the tetrapod genome; and there's nothing more complex about our genetics than that of a mouse. I get the impression of an extremely skilled technologist with almost certainly some excellent organizational skills, who is completely out of his depth on the broader conceptual issues of modern biology. And also, someone who is a total media disaster.

But I'm just a guy with a blog.

There is a mountain of material on ENCODE on the web right now — I've come late to the table. Here are a few reading recommendations:

Larry Moran has been on top of it all from day one, and has been cataloging not just the scientific arguments against ENCODE's over-interpretation, but some of the ridiculous enthusiasm for bad science by creationists.

T. Ryan Gregory has also been regularly commenting on the controversy, and has been confronting those who claim junk DNA is dead with the evidence: if organisms use 100% of their genome, why do salamanders have 40 times as much as we do, and fugu eight times less?

Read Sean Eddy for one of the best summaries of junk DNA and how ENCODE hasn't put a dent in it. Telling point: a random DNA sequence inserted into the human genome would meet ENCODE's definition of "functional".

Seth Mnookin has a pithy but thoughtful summary, and John Timmer, as usual, marshals the key evidence and makes a comprehensible overview.

Mike White summarizes the ENCODE projects abject media failure. If one of Birney's goals was to make ENCODE "comprehensible to the general public", I can't imagine a better example of a colossal catastrophe. Not only does the public and media fail to understand what ENCODE was about, but they've instead grasped only the completely erroneous misinterpretation that Birney put front and center in his summary.

You'll be hearing much more about ENCODE in the future, and unfortunately it will be less about the power of the work and more about the sensationalistic and misleading interpretation. The creationists are overjoyed, and regard Birney's bogus claims about the data as a vindication of their belief that every scrap of the genome is flawlessly designed.