A new announcement from the Human Genome Project

The human genome project has reached another landmark, the effective completion of the euchromatic sequence. It's still not 100% done, but the remaining small bits are going to require some new tricks to ferret out. You may recall announcements all over the place back in 2001 that the genome had been sequenced, but that was the draft sequence; 90% of the euchromatic genome was done, but there were still about 150,000 gaps scattered through it. You have to think of this project as something like assembling a colossal jigsaw puzzle—when the draft was done, we had a pretty good idea of the structure of the picture, and maybe had the borders done, but there were still these broad patches of solid colors that hadn't been pieced together yet. At this point, though, most of those have been filled in and the gaps are smaller and sparser.

Some numbers: the completed sequence so far consists of 2,851,330,913 nucleotides. There are only 341 gaps left in the sequence. and 33 of those are in the heterochromatin (the mildly boring, repetitive chunks of the genome, which correspond to those regions of solid color in a jigsaw puzzle), representing 198 megabytes of stuff that still has to be sequenced. In the euchromatin (the more interesting and complex stuff) there are more gaps, 308, but they are much smaller, so only 28 Mb of mystery remains. The total length of the genome is 3.08 Gb, with 2.88 Gb of it in the form of euchromatin.

The new, better defined sequence allows for a more accurate count of total gene number, and that number has dropped once again. We're down to 20-25,000 protein-coding genes. Some may think that knocks us off our pedestal a bit more, but that sounds like plenty to me.

One thing that leaps out at anyone reading the announcement is the importance of evolution in analyzing and understanding the genome. They used alignment with the chimpanzee draft sequence, for instance, to search for deletions. They are identifying recent duplications by their degree of divergence from neighboring genes, and have found 1,183 new genes that have arisen since the human and rodent lineages split. They're tracking the death of genes by identifying sequences with small numbers of disabling mutations (we seem to be losing olfactory genes at a rapid clip, relative to rodents).

The bottom line is that the HGP has provided us with a better tool for all kinds of research.

Nonetheless, the euchromatic human genome can now be regarded as effectively known. The accuracy and completeness of the current near-complete human genome sequence has important consequences for biomedical research. It allows systematic searches for the causes of disease—for example, to find all key heritable factors predisposing to diabetes or somatic mutations underlying breast cancer—with confidence that little can escape detection. It facilitates experimental tools to recognize cellular components—for example, detectors for mRNAs based on specific oligonucleotide probes or mass-spectrometric identification of proteins based on specific peptide sequences—with confidence that these features provide a unique signature. It allows sophisticated computational analyses—for example, to study genome structure and evolution—with confidence that subtle results will not be swamped or swayed by noisy data. At a practical level, it eliminates tedious confirmatory work by researchers, who can now rely on highly accurate information. At a conceptual level, the near-complete picture makes it reasonable for the first time to contemplate systems approaches to cellular circuitry, without fear that major components are missing.

International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945.