Improving the Accuracy of Genomic Data

Imagine that you want to analyze the 3.2 billion bases of the human genome. If you recruited every undergraduate student at ASU, all 70,000 of us, to type those data into a spreadsheet, it would still take about 13 hours. So you develop a computer program that analyzes the data for you. But then you find out that your huge data set amplified small errors in your algorithm and gave you the wrong answer. This is the issue facing evolutionary biologists using genomic data, a practice that is becoming standard to construct reliable phylogenies (see our previous posts about the new bird and insect phylogenies). Our lab, working under Dr. Reed Cartwright, has developed a novel method to quickly analyze genomic data and produce an accurate phylogeny that improves upon previous techniques.

The giant panda genome was assembled using de novo techniques in 2010, but better methods of phylogeny construction are in development. Image: Wikipedia

Historically, scientists have compensated for potential inaccuracies in genomic-size data in two ways: by using better statistical tools to analyze the data after they have been acquired or by acquiring fewer, more informative data.

In the first method, you start with sequenced genomes in the form of short fragments (about 100 base pairs) and develop computational algorithms to compare those sequences to a reference genome for reassembly, like Liu et al. did in their 2003 analysis of primate genomes. The reference genome is one that we know with a high level of confidence; for example, the human genome is reliably known and often used as a reference. If, however, a reference is unavailable or unreliable, you could use a computer program to assemble the sequences with a process known as de novo assembly, which Li et al. used to construct the giant panda genome in 2010. These programs, called assemblers, use graphical techniques (for example, De Brujin graphs) to remove errors in phylogenetic trees and resolve repeated data that are harder to determine in short sequences than longer ones. Algorithms like this can greatly improve the accuracy of conclusions made from genomic data, but de novo assembly without a reference genome requires high quality annotation of the sequences and, once the genome is reconstructed, time-consuming alignments of similar sequences to produce a phylogenetic tree.

Alternatively, you could acquire fewer data in the first place. You would need to determine which markers in a genome are informative and necessary to draw certain conclusions and then only obtain those data. By reducing the size of the data set and eliminating unnecessary information, we improve the accuracy without having to implement sophisticated analytical techniques. McCormack et al. used this principle in 2012 to determine the tree of placental mammals from certain markers. However, the major drawback of this method is that markers appropriate for a particular project or species most likely cannot be reused for other projects. The ability to recycle genomic data reduces the cost and time of phylogenomic studies.

Our lab is working on a program that constructs phylogenetic trees more quickly and easily than either of these methods. The program, called SISRS, combines genome assembly with identification of homologous genes to rapidly reconstruct phylogenies without the need of a reference genome or annotation. In the next post, we’ll go into detail about how SISRS works and what makes it a better way to analyze genomic data.

This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.