Recently in Evolution Category

Analyzing the Genome with Statistics

| 19 Comments

This is the third in a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. Today, we talk about the challenges of using statistics to analyze phylogenomic data.

Suppose you were a door manufacturer trying to figure out the average height of a population living in a certain country. You might conduct an experiment where you ask a group of people to report their height. You would then assemble those measurements in a data set. But in order to study this data set and draw conclusions you would need to analyze it using statistics. For example, how tall should your door be in order to fit 95% of people in the country? How many people do you need to survey to accurately represent the total population? These questions can be answered with statistical analysis.

Because acquiring data from experiments can be costly and time-consuming, we often use small data sets to represent a larger population of interest. In our height experiment, we would not be able to ask every single person in the country his or her height. We would choose a group of people under the assumption that they accurately reflect the population as a whole. However, when we are trying to map out the evolutionary history of organisms using data from sequenced genomes (phylogenomics, which we talked about last time), we need to change our method of analysis.

Let’s look at the treeshrew, for instance. It looks like a rodent but actually shares some internal similarities with primates (studied by Sir Wilfrid Le Gros Clark in the 1920s), like brain anatomy and reproductive traits. To figure out if the treeshrew is more similar to rodents or primates, we could sequence its genome and, using statistics, compare its genes to those of rodents and primates. But typical statistical models are based on subsets of populations, while by definition, genomic sequencing gives us a complete data set - all of the treeshrew’s genes. These typical models may not be suitable for interpreting genomic data.

The treeshrew. Source: Wikipedia

Before reaching a conclusion about the tree shrew, or any set of data, scientists must consider precision and accuracy. Multiple measurements of the same quantity are precise if they are similar to each other. Another way of saying this is that their variance is small. On the other hand, measurements are accurate if they are close to the true value of what they are trying to measure. For genomic data, we need better statistical tools to ensure that the accuracy of our conclusions matches the precision characteristic of these huge data sets.

Larger data sets provide more precise conclusions than smaller ones. For example, when we ask more people to report their height, we are more confident that our sample represents the variability of the actual population. Similarly, we analyze more genes in the treeshrew’s genome to increase our confidence that our conclusion is precise. However, our results might not necessarily be accurate; big data sets may lead us to draw incorrect conclusions with high confidence. The treeshrew’s genome contains some genes that are more similar to rodents’ genes and some that are more similar to primates’ genes (Fan et al., Nie et al., and Xu et al.), and with so much data we could find that the treeshrew is most similar to either group with high confidence. We need analysis tools that will tell us which genes give the correct answer.

Why are conclusions from data sometimes inaccurate? Statistical biases are external factors that produce consistent error in our measurements. Biases have many sources, including faulty experimental design, violation of assumptions made in analyzing the data, and errors in the data collection process. Bias in our height experiment might arise if we unintentionally ask the height of more women than men, causing our estimate of the average height to be lower. But in the case of phylogenomics, we are likely to have biases because of our relative lack of knowledge about the genome: we don’t always know which genes to analyze or the correct way to model the data. For example, some models assume that evolution followed the same pattern throughout all time, but this most likely was not the case.

Furthermore, the process of genome sequencing and analysis itself may create error, especially in the reconstruction of the genome and the alignment of genes for comparison. If we are comparing the genome of the treeshrew to the genomes of primates and rodents, it is difficult for us to know which genes are correlated between species when we are looking at a data set of billions of points. We might use a probability model to determine correlated genes, but all models are at least somewhat incorrect and introduce bias. In smaller data sets, biases are offset by a low precision and relatively small confidence in reaching conclusions. However, in genomic-size data sets, even small biases can be amplified and lead to high confidence in the wrong answer and incorrect phylogenetic trees.

When analyzing phylogenomic datasets, we need to use analyses that are appropriate for large data sets. This will unlock the potential of phylogenomic research to draw unbiased conclusions, like figuring out the correct phylogenetic classification of the treeshrew (still a topic of controversy among evolutionary biologists). However, phylogenomics is such a young field that these tools do not yet exist. When they are developed, we can increase our chances of correctly classifying species’ relationships and discovering the true history of evolution.

For more detail, check out: “Statistics and Truth in Phylogenomics”, Kumar, Sudhir et al. Molecular Biology and Evolution (2011).

References:

Fan, Yu, et al. “Genome of the Chinese tree shrew.” Nature communications 4 (2013): 1426.

Nie, Wenhui, et al. “Flying lemurs-The’flying tree shrews’? Molecular cytogenetic evidence for a Scandentia-Dermoptera sister clade.” BMC biology 6.1 (2008): 18.

Xu, Ling, et al. “Evaluating the Phylogenetic Position of Chinese Tree Shrew ( Tupaia belangeri chinensis) Based on Complete Mitochondrial Genome: Implication for Using Tree Shrew as an Alternative Experimental Animal to Primates in Biomedical Research.” Journal of Genetics and Genomics 39.3 (2012): 131-137.

Our next installment will cover some misused terminology in phylogenomics. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

Lenticular clouds

| 11 Comments
IMG_1154Cloud_600.JPG

Interesting cloud formation, Boulder, Colorado. The camera is facing south, and the wind is coming from the west, or right.

One hour later, in Golden,

Philae craft lands on comet

| 70 Comments

Rosetta headquarters announced a few moments ago that the Philae lander is now sitting on the surface of the comet and transmitting data. Unfortunately, the European Space Agency is not exactly releasing a trove of pictures. I know this is not biology, but where did you think those hydrocarbons came from in the first place?

Phylogenomics: Deciphering a Billion-Piece Puzzle

| 146 Comments

This is the second in a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. Today, we talk about phylogenomics, the application of whole genome sequencing to understand evolutionary relationships among species.

DNA Chemical Structure. Source: Madeleine Price Ball

The haploid human genome is 3.2 billion DNA bases long, and each base can be one of four nucleotides: A, T, C, and G. Uncoiled, the DNA in a single human cell would be 2 meters long, and the DNA in a human body would stretch from the sun to Pluto multiple times. With 3.2 billion bases, each person’s genome is unique, and this plays an essential role in shaping our physical and mental individuality. However, despite being unique, each human genome is very very similar, due to our shared ancestral heritage. Similarly, species that share a recent ancestral heritage also have similar genomes. Species that are distantly related are likely to demonstrate significant differences in their genomes. This is why, as we discussed last week, evolutionary biologists compare traits and genes to determine the relationships of different species.

Unfortunately, some genes give us the wrong answer about how species are related. A section of a gene can be identical for two species due to independent mutations. After all, any given base can only mutate into one of three other bases. Chances are the same mutation could happen twice, or multiple mutations can produce the same sequence. Consider two species that are distantly related; one contains an AGA fragment, while the corresponding fragment in the other species is TGT, i.e. they differ in 2 out of 3 positions. As these species evolve, by chance the first species may experience a change in the first position such that AGATGA, and the second species may experience a change in the third position such that TGTTGA. Now, these two sequences look the same so you might think the species share a recent common ancestor; however, it is only an accident of biology that they appear closely related. Because some fragments may be identical due to independent mutations and not shared ancestry, estimating species relationships with using whole genomes is better than just a few genes. The more information we have, the more likely we are to figure out species’ relationships correctly.

The cost to sequence whole genomes has fallen from $100 million to $1000 in just the past twelve years. It now takes days to sequence a genome compared to the 13 years it took for the first human genome. The challenge now is not to obtain the data but to compare all the billions of base pairs in one genome to those in another. Current sequencing methods, while fast, can only read the genome by dividing it into millions of short fragments, which must be reassembled like an enormous puzzle. Researchers then have to figure out which genes correspond to one another in different species’ genomes. These comparisons are challenging because genes in one genome might be in a different order, on different chromosomes, or missing completely in another species’ genome.

Biologists are beginning to use genomic information to understand how species are related and measure how fast or slowly different genes evolve. Then in turn allows us to understand how evolution happens. For example, using genomic information we can figure out how genes mutate, characterize and diagnose genetic diseases, and track harmful pathogens. But before that can happen, we need to address the difficulties of analyzing these large genomic datasets. You might think that more data is always better, but having a lot of data can lead us to have very high confidence in the wrong answer. In a pool of thousands of genes, we need to find the ones that tell us the right answer.

Next week, we’ll discuss statistical challenges associated with big data analysis, especially as it relates to phylogenomics. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

I started this post thinking I’d write a review of Andreas Wagner’s recent book “Arrival of the Fittest: Solving Evolution’s Greatest Puzzle” (links below), an engrossing book about how biological innovation arises from the structure of metabolic, genotype, and protein networks, and how robustness–the stability of phenotypes in the face of underlying genetic variability–is critical in evolutionary innovations. But there are several excellent reviews already out there, so another would be redundant. I’ll mention only a couple of points I think worth emphasizing below the fold.

Phelsuma laticauda

| 5 Comments

Photograph by Tony Gamble.

Photography contest, Honorable Mention.

Gamble.Phelsuma_laticauda_dorsum.jpg

Phelsuma laticauda – gold dust day gecko.

The Family Tree of Life

| 92 Comments

In the next few weeks, we’ll be posting a series of articles for the general public focused on understanding how species are related and how genomic data is used in research. We start with a background on phylogenetic trees.

Imagine you could go back in time and meet your great grandmother or even your great-great-great-great-great grandmother, when they were your age. Would they look like you? Or would they look more like your siblings or cousins? Maybe you would all look a little different. Scientists try to figure out how the distant ancestors of apes, other animals, plants, and all organisms living today looked and behaved, much in the same way that people use a family tree to trace their ancestry.

primate-family-tree-780x520_0.gif

The common ancestor of great apes lived about 18 million years ago. Source: Smithsonian National Museum of Natural History http://humanorigins.si.edu/evidence/genetics

In evolutionary biology scientists use a type of tree called a “phylogenetic tree” to organize the history of how species descended from common ancestors. The closer two species are to a common ancestor on the phylogenetic tree, the more closely the two are related.

Take the phylogenetic tree of primates, for example. The common ancestor of apes lived about 18 million years ago. But over time, this one group branched off to form many different species, including humans, which have their own separate branch on this tree.

How did so many unique species develop from one ancestor? New branches formed by a process known as divergence. When groups of ancient organisms became geographically isolated from one another, either through migration or geologic events like earthquakes, each group began to develop its own unique set of physical attributes. Sometimes, by chance, a change in a characteristic enabled an individual to survive better in its environment and produce more offspring.

Perhaps individuals in one group with larger arms were better able to break open the hard-shelled fruits that were common in one region, while some individuals in another group had the ability to travel more easily through tall trees that offered protection from predators. Whatever the reason may have been, selection favored genetic differences that improved survival. Over time, this gradual process of isolation and selection produced distinct species, which in turn branched into more species.

The end result of divergence is many species, related in a tree-like fashion, and we display these relationships using phylogenetic trees. Scientists now use increasingly sophisticated methods to determine how species were related and build phylogenetic trees. In the past, scientists built these trees simply by comparing physical traits, like how many limbs an organism has or whether it has a tail. But with the recent surge in fast and affordable gene sequencing technologies, researchers today can directly compare species’ DNA to determine how they are related.

But analyzing entire genomes, with billions of DNA base pairs, presents its own unique set of challenges, and researchers often struggle to determine if the DNA differences they find between species are truly significant or are simply due to common variability. As computer software and statistical analysis become more adept at handling these challenges, our understanding of species’ relationships could change — providing exciting new insights into our family tree of life.

Check back next week when we discuss the differences between studying small and large datasets, and the challenges associated with big data analysis. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

IMG_4248Eclipse_600.JPG

Pinhole-camera images of solar eclipse formed by spaces between leaves in canopy. According to Jon Grepstad, this phenomenon was explained by Aristotle. The eclipse is just ending; the picture was as close to total as it got here (Boulder, Colorado).

Aeshna cyanea

| 3 Comments

Photograph by Marilyn Susek.

Photography contest, Honorable Mention.

Susek.Dragon_Fly.jpg

Aeshna cyanea – southern hawker.

Beginning this week, we will run photographs every other Monday, so no picture next week; we no longer have enough honorable mentions and other miscellaneous photographs to continue posting a photograph every week. But polish your lenses (very carefully) and keep an eye out for the contest in the summer.

Cupido comyntas

| 1 Comment

Photograph by Robin Lee-Thorp.

Photography contest, Honorable Mention.

Lee-Thorp.Eastern Blue.JPG

Cupido comyntas – eastern tailed-blue butterfly.

Larus delawarensis

| 8 Comments
IMG_1104Gull_600.JPG

Larus delawarensis – ring-billed gull, Boulder, Colorado. There is right now a fairly large flock at Walden Ponds east of Boulder. They are too far away to get a picture, unless you like snapshots of an array of gray-and-white ellipses. But this one very kindly landed in a parking lot and posed long enough to enable this portrait.

On August 14, William Dembski spoke at the Computations in Science Seminar at the University of Chicago. Was this a sign that Dembski’s arguments for intelligent design were being taken seriously by computational scientists? Did he present new evidence? There was no new evidence, and the invitation seems to have come from Dembski’s Ph.D. advisor Leo Kadanoff. I wasn’t present, and you probably weren’t either, but fortunately we can all view the seminar, as a video of it has been posted here on Youtube.

It turns out that Dembski’s current argument is based on two of his previous papers with Robert Marks (available here and here) so the arguments are not new. They involve considering a simple model of evolution in which we have all possible genotypes, each of which has a fitness. It’s a simple model of evolution moving uphill on a fitness surface. Dembski and Marks argue that substantial evolutionary progress can only be made if the fitness surface is smooth enough, and that setting up a smooth enough fitness surface requires a Designer.

Briefly, here’s why I find their argument unconvincing:

  1. They conside all possible ways that the set of fitnesses can be assigned to the set of genotypes. Almost all of these look like random assigments of fitnesses to genotypes.
  2. Given that there is a random association of genotypes and fitnesses, Dembski is right to assert that it is very hard to make much progress in evolution. The fitness surface is a “white noise” surface that has a vast number of very sharp peaks. Evolution will make progress only until it climbs the nearest peak, and then it will stall. But …
  3. That is a very bad model for real biology, because in that case one mutation is as bad for you as changing all sites in your genome at the same time!
  4. Also, in such a model all parts of the genome interact extremely strongly, much more than they do in real organisms.
  5. Dembski and Marks acknowledge that if the fitness surface is smoother than that, progress can be made.
  6. They then argue that choosing a smooth enough fitness surface out of all possible ways of associating the fitnesses with the genotypes requires a Designer.
  7. But I argue that the ordinary laws of physics actually imply a surface a lot smoother than a random map of sequences to fitnesses. In particular if gene expression is separated in time and space, the genes are much less likely to interact strongly, and the fitness surface will be much smoother than the “white noise” surface.
  8. Dembski and Marks implicitly acknowledge, though perhaps just for the sake of argument, that natural selection can create adaptation. Their argument does not require design to occur once the fitness surface is chosen. It is thus a Theistic Evolution argument rather than one that argues for Design Intervention.

That’s a lot of argument to bite off in one chew. Let’s go into more detail below the fold …

Apis mellifera

| 8 Comments
IMG_4085_A_Mellifera_600.jpg

Apis mellifera – western or European honeybee, dining along with others on a milkweed flower. Apparently a melanic form, because Bugguide assures me that it is “just a dark one.”

Noctilucent clouds

| 5 Comments

Photograph by Kari Tikkanen.

Photography contest, Honorable Mention.

Tikkanen.Noctilucent_Clouds.jpg

Noctilucent clouds. Mr Tikkanen writes that these “are bluish clouds located in the mesosphere at altitudes of around 80 kilometers. Relative recent appearance and their gradual increase may be linked to climate change.”

Brachystola magna

| 20 Comments

Photograph by Ralph Arvesen.

Photography contest, Honorable Mention.

Ralph.Arvesen - Plains Lubber (Brachystola magna Girard).jpg

Brachystola magna – plains lubber, or western lubber..

Alluvial fan

| 11 Comments
IMG_4151AlluvialFan_600.JPG

Alluvial fan created by the torrential rainfall 1 year ago, as seen from the Visitor Center, Trail Ridge Road, Rocky Mountain National Park, Colorado, September, 2014. The meander at the bottom of the screen passes through the bed of Fan Lake, which was formed in 1982 when the Lawn Lake Dam burst and inundated the City of Estes Park.

Canis lupus baileyi

| 12 Comments

Photograph by Dan Stodola.

Photography contest, Honorable Mention.

Stodola.MexicanWolf.jpg

Canis lupus baileyi – Mexican wolf, Brookfield Zoo, Illinois.

Tradescantia occidentalis

| 6 Comments

Photograph by Rob Dullien.

DullienTradescantia occidentalis_600.jpg

Tradescantia occidentalis – prairie or western spiderwort, near Coyote Buttes, Arizona, May, 2014.

Lonicera X bella

| 2 Comments

Honeysuckle, by Richard Meiss.

Photography contest, Finalist.

Meiss-Honeysuckle_Second_Flowering.jpg

Lonicera X bella – Asian bush honeysuckle. Mr. Meiss writes, “This photo shows the coexisting ripe berries and new flowers of the Asian bush honeysuckle, an invasive species in the American midwest. This ‘second flowering’ in mid-September was induced by the very hot and dry summer of 2012. The phenomenon, an adaptation to environmental stress, was also widely noted in the British Isles; its prevalence is likely related to global warming. In this case, it may give a ‘leg up’ to an already-troublesome invasive species.”

Eclipse

| 7 Comments

Photograph by Keith Barkley.

Photography contest, Finalist.

Barkley.Eclipse.jpg

Solar eclipse, May 20, 2012. Mr. Barkley writes, “I lucked out that the eclipse was still going on during local sunset. One of the few eclipse images you will see that was taken without a sun-viewing filter on the lens.”

About this Archive

This page is an archive of recent entries in the Evolution category.

Eugenics is the previous category.

Evolution Education is the next category.

Find recent content on the main index or look in the archives to find all content.

Categories

Archives

Author Archives

Powered by Movable Type 4.381

Site Meter