Recently in Phylogenetics Category

One thing I’ve loved about living in Australia this past year is how much more generally pro-science the culture seems to be (PT blogmeister Reed Cartwright was just in Canberra to visit collaborators, but sadly he forgot Prof. Steve Steve). We have the annual Australian National Science Week coming up next month – can you even imagine having a National Science Week in the United States?

2016-04_Australasian_Science_cover_373.jpgAnother thing I’ve loved is how there seem to be many independent media outlets interested in science. I got to write a short popular article on the Evolution of Antievolutionism paper, which ended up on the cover of Australasian Science, for instance, and participate in several other talks or radio shows.

The most recent radio show was:

Update on the Tree of Birds


The tree of birds just got a bit more accurate with a study published last October. We first covered the ever-evolving tree in 2014, when we posted about a study in Science magazine that used phylogenomics and thousands of genes sequenced from 48 bird species to produce what was thought to be the most accurate phylogenetic tree of birds to date (Jarvis et. al., 2014; see their tree here). Since then, a different team of scientists published a new phylogeny of birds that it claims is the most comprehensive (Prum et. al., 2015; see their tree below). So what is the difference between these two trees and how they were constructed, and which is more accurate?

The biggest difference between the methods of the two studies is the amount of data used. In the Jarvis et. al. study, the authors sequenced the whole genomes of 48 bird species and aligned thousands of genes. But Prum et. al. criticize this methodology as too “sparse” of a sampling; instead, they used 198 bird species and two crocodile species. Because sequencing that many whole genomes would be costly and time-consuming, Prum et. al. developed genetic markers that targeted highly conserved “anchor” regions of vertebrate genomes – regions did not change much over many years. Using this technology, the new tree of birds could be developed with only about 400 genetic regions instead of the thousands of genes in the previous study.

If there is a tradeoff between analyzing more genes or more species, is it more accurate to compare fewer genes between more species, or more genes between fewer species? One is not inherently better than the other, but rather, the way in which each is used relative to common issues in constructing a phylogenetic tree determines accuracy of the tree.

One such issue is distinguishing important genetic signals from noise. Genomic data contains a certain amount of “phylogenetic signal,” the informative genes that determine lineage. This signal must be differentiated from non-phylogenetic signal–genes that falsely suggest certain relationships. For example, non-phylogenetic signal can arise because species divergence events that happened close together in time are difficult to distinguish, or when species that diverged from a common ancestor a long time ago independently develop similar traits (called homoplasy). A 2011 article in PLOS Biology analyzed published phylogenetic trees and noted that merely adding more genes did not improve their accuracy because adding genes amplifies all signal (non-phylogenetic and phylogenetic alike).


Additionally, a phylogeny will only be accurate if the orthologous genes–those genes shared between species that were inherited from a common ancestor–are correctly identified. And that depends on the ability of software to distinguish orthologous genes from similar genetic sequences between species that code for genes that are not orthologous but rather are xenologous (transferred via horizontal gene transfer instead of inherited from the common ancestor) or paralogous (resulting from duplication of a gene). (Read more about orthologous, xenologous, and paralogous genes here.)

The model of evolution that researchers choose to use in their analysis can also greatly influence phylogeny accuracy. The PLOS article authors analyzed models and found some have difficulty detecting nucleotide substitutions, resulting in trees that are dominated by non-phylogenetic signal.

While analyzing a larger set of species won’t help when a model of evolution is inadequate or software has issues identifying orthologous genes, it can help with the issue of non-phylogenetic signal. Increasing the number of species in a study generally increases the phylogenetic signal-to-noise ratio, making it easier to detect substitutions that can lead to homoplasy, and also can improve accuracy by breaking up long branches. But the PLOS article states that it is not enough to just add more species; researchers should analyze more species that evolve slowly and comprise outgroups closely related to the group of interest.

Thus, including more species to construct a phylogenetic tree may be beneficial for tree accuracy, but only as long as methods for determining orthologous structures and modeling evolution are sufficient, and the chosen species are appropriate. Because Prum et. al. looked at more species while keeping these important factors in mind, and developed quality genetic markers that analyze enough genetic regions to determine phylogenetic relationships, they argue their tree is the most accurate yet. It is a convincing argument for the moment, but phylogenetic analysis can always be improved with better software and models, and the tree of birds (and the tree of life in general) will be constantly revised in the future as these methods improve.

The latest tree of birds presents a few differences from the Jarvis et. al. tree, some of which Prum et. al. suggest resulted from their larger sample size. One of the most striking differences is in the classification of the major bird groups. Jarvis et. al. propose that the initial divergence of a highly debated branch of birds, called Neoaves, resulted in two main groups: Columbea, containing birds like doves and flamingoes, and Passerea, containing a wide variety of species (parrots, falcons, penguins, and eagles, to name a few). But Prum et. al.’s tree instead splits Neoaves into five groups: Strisores (nightjars, hummingbirds, and frogmouths), Columbaves (cuckoos, pigeons, and sandgrouse), Gruiformes (cranes, coots, and rails), Aequorlitornithes (grebes, flamingoes, and shorebirds), and finally the very diverse Inopinaves (owls, vultures, and parrots). Also, Jarvis et. al. place pigeons, mesites, and sandgrouse in their own branch (Columbea) apart from the rest of Neoaves, while Prum et. al. rejected that for their five-group system. The Prum et. al. classification of Neoaves is likely the most accurate because they included more species that diverged close to speciation events (called nodes), which is especially important when the time between multiple nodes is short.

These findings bring up some new ideas about bird evolution and also support some old ones. For one, the new tree of birds developed by Prum supports a previous theory that swifts and hummingbirds, neither of which is nocturnal, evolved from a group of birds that had been nocturnal for 10 million years (Jarvis et. al. finds a similar relationship). Also, the new finding of the group consisting of waterbirds and shorebirds (Aequorlithornithes) suggests that the divergence of birds into different environments occurred with some level of restriction, known as evolutionary constraint. But as interesting and exciting as the new tree and its implications for bird evolution are, it is unlikely to be the final word on bird evolution. Other studies have also been published examining parts of the bird tree (like Rocha et. al. on the bird genus containing woodcreepers and Bell et. al. on an extinct group of Cretaceous birds). A new, more accurate complete tree of birds that supports or rejects these theories may be only another year away. Such is the nature of scientific research.

This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

Remember the two studies published at the end of last year that produced groundbreaking evolutionary trees of birds and insects? The researchers in these studies used data from whole sequenced genomes to construct these more reliable trees. This is a practice that is somewhat novel but gaining importance in phylogenomics. But we’ve talked about how large data sets, like genomes, can lead to incorrect conclusions if analyzed improperly. How did the researchers avoid this problem?

In the last post, we discussed the two major methods of improving genomic analysis. First, scientists can determine the informative subset of a genome and only obtain and analyze that set. Alternatively, they can develop algorithms to compare whole genomes to a well-established reference genome. But these methods have their drawbacks; subsets of genomes often are not reusable in other experiments and reference genomes, if unavailable, can take a lot of time and work to develop.

Genomes can consist of several billions of nucleotides, so we need different methods of analyzing such a large dataset.

Image source: Boise State University

That’s why our lab is developing SISRS (pronounced “scissors,” Site Identification from Short Read Sequences), a new software program that can analyze genomic data in a matter of days. This NSF- and ASU-funded software eliminates the need for a reference genome and does not require genetic markers, which can take months to determine. Thus, SISRS greatly reduces the time, effort, and cost required to construct a phylogenetic tree from genomic data.

So how does SISRS achieve all of this? From data sequenced via next-generation sequencing, SISRS randomly constructs a subset of data using reservoir sampling. The software then uses de novo assemblers (for example, a program called Velvet) to construct a composite genome from this subset to act as a reference. Because sequences shared among species occur frequently in the collected data, they are more likely to be chosen during the random sampling process than sequences unique to one species. Thus, the composite genome contains genetic information from each species and is a suitable reference genome.

Once the composite reference genome is assembled, SISRS aligns the raw data to the reference. Some species may be missing data in sites, which could be due to several reasons: a gene may not be present in all genomes, there could be variable regions of the genome to which the reference does not align well, or there could have been error in the genome sequencing process. SISRS removes these sites that are missing too much information and filters out other sites that may produce errors (like sites with paralogous, or duplicate, genes). Finally, SISRS outputs the phylogenetically informative sites for phylogeny construction.

To verify SISRS’ effectiveness, our team tested it with the genomic data of primates, whose phylogenetic tree is well-established. SISRS reconstructed the tree with 100% accuracy. Along with genomes, SISRS worked with transcriptomes (the complete set of RNA), estimated the mammal phylogeny very well, and showed promising preliminary results of estimating species divergence times.

SISRS is still under development, and future improvements will enable the program to analyze larger data sets more rapidly. SISRS makes it possible to analyze genomic data quickly, efficiently, and accurately with minimal work. As we continue to improve the software, we welcome feedback from anyone working in the field of phylogenetics; SISRS is available open-source here. We expect this software will have a major impact on phylogenetic analysis.

For more detail about SISRS, click here.

This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

A big story in the press today. Scientists – mechanical engineers and physicists, one working for Boeing with his office only a few miles from my home – show that the evolution of airplanes works the same way as the evolution of organisms:

The evolution of airplanes

A. Bejan, J. D. Charles and S. Lorente

J. Appl. Phys. 116, 044901 (2014);

(fortunately this paper can be downloaded for free).

They make allometric plots of features of new airplane models, log-log plots over many orders of magnitude. The airplanes show allometry: did you know that a 20-foot-long airplane won’t have 100-foot-long wings? That you need more fuel to carry a bigger load?

But permit me a curmudgeonly point: This paper would have been rejected in any evolutionary biology journal. Most of its central citations to biological allometry are to 1980s papers on allometry that failed to take the the phylogeny of the organisms into account. The points plotted in those old papers are thus not independently sampled, a requirement of the statistics used. (More precisely, their error residuals are correlated). Furthermore, cultural artifacts such as airplanes do not necessarily have a phylogeny, as they can borrow features from each other in massive “horizontal meme transfer”. In either case, phylogeny or genealogical network, statistical analysis requires us to understand whether the points plotted are independent.

The paper has impressive graphs that seem to show trends. But looking more closely we notice that neither axis is actually time. If I interpreted the graphs as trends, I would conclude that birds are getting bigger and bigger, and that nobody is introducing new models of small airplanes.

At least we may rejoice that the authors are not overly shy. They make dramatic statements on the implications for biology:

The engine mass is proportional to the body size: this scaling is analogous to animal design, where the mass of the motive organs (muscle, heart, lung) is proportional to the body size. Large or small, airplanes exhibit a proportionality between wing span and fuselage length, and between fuel load and body size. The animal-design counterparts of these features are evident. The view that emerges is that the evolution phenomenon is broader than biological evolution. The evolution of technology, river basins, and animal design is one phenomenon, and it belongs in physics.


Evolution means a flow organization (design) that changes over time.

Thanks, now I finally know what evolution is. And that biologists should go home and leave its study to the physicists and engineers.

[Note: I will pa-troll the comments as aggressively as I can and send trolling and troll-chasing to the Bathroom Wall.]

Musings from the mind of a mouse


Casey Luskin is such a great gift to the scientific community. The public spokesman for the Discovery Institute has a law degree and a Masters degree (in Science! Earth Science, that is) and thinks he is qualified to analyze papers in genetics and molecular biology, fields in which he hasn't the slightest smattering of background, and he keeps falling flat on his face. It's hilarious! The Discovery Institute is so hard up for competent talent, though, that they keep letting him make a spectacle of his ignorance.

I really, really hope Luskin lives a long time and keeps his job as a frontman for Intelligent Design creationism. He just makes me so happy.

His latest tirade is inspired by the New York Times, which ran an article on highlights from the coelacanth genome. Luskin doesn't think very deeply, so he keeps making these arguments that he thinks are terribly damaging to evolution because he doesn't comprehend the significance of what he's saying. For instance, he sneers at the fact that we keep finding conserved elements in the genome, because as we all know, there are lots of conserved elements.

The coelacanth genome has been sequenced, which is good news all around…except that I found a few of the comments in the article announcing it disconcerting. They keep calling it a "living fossil" — and you know what I think of that term — and they keep referring to it as evolving slowly

The slowly evolving coelacanth

The morphological resemblance of the modern coelacanth to its fossil ancestors has resulted in it being nicknamed 'the living fossil'. This invites the question of whether the genome of the coelacanth is as slowly evolving as its outward appearance suggests. Earlier work showed that a few gene families, such as Hox and protocadherins, have comparatively slower protein-coding evolution in coelacanth than in other vertebrate lineages.

Honestly, that's just weird. How can you say its outward appearance suggests it is slowly evolving? The two modern species are remnants of a diverse group — it looks different than forms found in the fossil record.

Phylogenetics and population genetics, that is. Larry Moran calls attention to the confusion of Ann Gauger, ID-pushing BioLogic Institute “researcher.” My favorite comment in the thread is from (PT crew member) Joe Felsenstein:

I must be totally confused. I wrote a book on reconstructing evolutionary trees – and it’s the standard textbook in that area. But it does not mention many basic population genetics concepts. I have another book (a free downloadable e-book) that is a textbook of theoretical population genetics. And it does not mention homoplasy at all.

So I must misunderstand what “population genetics” is. And here I’ve been giving courses on it for the last 44 years. At the university where Ann Gauger got her Ph.D. degree, for that matter.

Silly me.

My second favorite is from Piotr Gasiorowski:

Cargo cult science

Precisely. The cult members gather in mock laboratories full of imitation equipment, where they mimic the way scientists speak and behave.

It’s time for the annual birthday greeting to Jean Baptiste Pierre Antoine de Monet, Chevalier de Lamarck, born 1 August 1744. Born into the impoverished nobility, he distinguished himself in the army, then had to leave military life because of a peacetime injury. In Paris, he started writing books on plants and ended up as Professor in the Natural History Museum. He was the great pioneer of invertebrate biology (he coined the terms “invertebrate” and “biology”). But of course he is best known as the first major evolutionary biologist, who propounded a theory of evolution which had an explanation for adaptation. (A wrong explanation, but nevertheless an explanation).

This time let’s use an image of the tree of animals, from his Philosophie Zoologique (1809):


This is not entirely a tree of history: it is also paths up which evolution proceeds (actually, on this diagram, down which evolution proceeds). So it is not quite the same as the trees we use now. Note that not all animals are connected on this tree.

Of course, it goes without saying that Lamarck was not responsible for inventing or popularizing “Lamarckian inheritance”. He invoked it but everyone already believed it. And to add one last jibe: epigenetics is not in any way an example of the use-and-disuse mechanisms that Lamarck invoked.

It has been announced that Robert Sokal died on April 9. I wrote a brief obituary here last autumn for his co-worker Peter Sneath. Together they pioneered the use of clustering algorithms in taxonomy, and argued for the adoption of phenetic methods based on clustering there. While they were ultimately unsuccessful in this, they became founding fathers of work on mathematical clustering, and their book Principles of Numerical Taxonomy was widely-noticed and greatly stimulated the development of phylogeny algorithms. A paper by Michener and Sokal (1957) is, as far as I can tell, the first one publishing a numerical phylogeny. His publication of the 1965 paper by Camin and Sokal in Evolution, and a visit he made to the University of Chicago that year, inspired me to start working on phylogeny algorithms.

sokal1964.jpg Sokal2-n.jpg
Robert Sokal in 1964 at the International
Entomological Congress in London
Bob Sokal, more recently

Bob’s Stony Brook colleague Michael Bell has written a fine obituary, which I reprint below with his permission.

Good news! The gorilla genome sequence was published in Nature last week, and adds to our body of knowledge about primate evolution. Here's the abstract:

Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.

I've highlighted one phrase in that abstract because, surprise surprise, creationists read the paper and that was the only thing they saw, and in either dumb incomprehension or malicious distortion, took an article titled "Insights into hominid evolution from the gorilla genome sequence" and twisted it into a bumbling mess of lies titled "Gorilla Genome Is Bad News for Evolution". They treat a phenomenon called Incomplete Lineage Sorting (ILS) as an obstacle to evolution rather than an expected outcome.

Peter H. A. Sneath (1923 - 2011)


Word has reached me that Peter Sneath died last Friday at his home in Leicestershire. He was 87 years old. For a more complete and entertaining autobiographical account see page 77 of this Bulletin of Bergeys International Society for Microbial Systematics.

Peter was a medical microbiologist, who, in the late 1950s, began to work on numerical methods for classifying bacteria. He developed numerical clustering methods. He soon came into contact with Robert Sokal, who was doing the same. Together they wrote Principles of Numerical Taxonomy, a widely-noticed textbook advocating taking a phenetic approach to classification, basing it on measures of overall similarity rather than any inference of phylogeny. The smartest thing Sokal and Sneath did was to not fight over who invented numerical taxonomy, but to join together to promote it (Sneath was first author on the 1973 revision Numerical Taxonomy).


                 Peter Sneath with his children, about 1960. Photo by Joan Sneath, courtesy of the late Peter Sneath

Numerical taxonomy rattled the systematic establishment, then dominated by followers of Ernst Mayr and George Gaylord Simpson’s school of “evolutionary systematics”. It encouraged and stimulated many younger people to look into numerical approaches. By about 1980 phenetic approaches had been pushed aside by phylogenetic systematics, but Sneath and Sokal’s work is still regarded by mathematical clusterers as the most important founding work in their field. The most widely-used of Sneath’s methods is the UPGMA clustering method (independently also invented by F. J. Rohlf). [See comment of September 30 below for correction of this statement].

I always enjoyed meeting Peter and Joan Sneath. Peter was intrigued by any and all uses of numerical and computer methods in science, and was even willing on occasion to violate his own precepts and come up with methods for analyzing phylogenies.

He wrote a pioneering 1975 paper (with Sackin and Ambler) on detecting recombination between lineages, for example. I remember Peter telling me that as he traveled around he collected soil samples to study their bacteria. He carried no sterile vials for that – he simply went out and bought a ream of typing paper, as it was sterile, then used some to scoop up the sample and fold it into an envelope. It was a brilliant common-sense improvisation typical of the best of his generation of English scientists.

One of the goals of the intelligent design (ID) movement is to show that evolution cannot be random and/or unguided, and one way to demonstrate this is to show that an evolutionary transition is impossibly unlikely without guidance or intervention. Michael Behe has attempted to do this, without success. And Doug Axe, the director of Biologic Institute, is working on a similar problem. Axe’s work (most recently with a colleague, Ann Gauger) aims (in part, at least) to show that evolutionary transitions at the level of protein structure and function are so fantastically improbable that they could not have occurred "randomly."

Recently, Axe has been writing on this issue. First, he and Gauger just published some experimental results in the ID journal BIO-Complexity. Second, Axe wrote a blog post at the Biologic site in which he defends his approach against critics like Art Hunt and me. Here are some comments on both.

Read the rest at Quintessence of Dust.

Blogging on Peer-Reviewed Research

I've been giving talks at scientific meetings on educational outreach — I've been telling the attendees that they ought to start blogs or in other ways make more of an effort to educate the public. I mentioned one successful result the other day, but we need more.

I give multiple reasons for scientists to do this. One is just general goodness: we need to educate a scientifically illiterate public. Of course, like all altruism, this isn't really recommended out of simple kindness, but because the public ultimately holds the pursestrings, and science needs their understanding and support. Another reason, though, is personal. Scientific results get mangled in press releases and news accounts, so having the ability to directly correct misconceptions about your work ought to be powerfully attractive. Even worse, though, I tell them that creationists are actively distorting their work. This goes beyond simple ignorance and incomprehension into the malign world of actively lying about the science, and it happens more often than most people realize.

I have another painful example of deviousness of creationists. There's a paper I've been meaning to write up for a little while, a Nature paper by David and Alm that reveals an ancient period of rapid gene expansion in the Archaean, approximately 3 billion years ago. Last night I thought I'd just take a quick look to see if anybody had already written it up, so I googled "Archaean genetic expansion," and there it was: a couple of references to the paper itself, a news summary, one nice science summary, and…two creationist distortions of the paper, right there on the first page of google results. I told you! This happens all the time: if there's a paper in one of the big journals that discusses more evidence for evolution, there is a creationist hack somewhere who'll quickly write it up and lie about it. It's a heck of a lot easier to summarize a paper if you don't understand it, you see, so they've got an edge on us.

Over the past few years there have been increasing numbers of calls for governments to properly fund systematics and taxonomy (and a number of largely molecular-focused biologists insisting they can do the requisite tasks with magic molecule detectors, so don't fund old-school, fund new-fangled-tech). But I think that there is considerable confusion about what systematics and taxonomy are.

Now the usual way a philosopher resolves such questions, apart from interrogating their intuitions relying upon what they learned in grade school, is to go find a textbook or some other authoritative source and quote that. If it is someone they already know, all the better, like Mayr or Dawkins. This is problematic, so I thought I'd do a slightly better job at reviewing what people think. And then I will of course give my own view.

Teaching Tree-Thinking to Undergraduate Biology Students


Phylogenetic trees are essential tools for representing evolutionary relationships. Unfortunately, they are also a major conceptual stumbling block for budding biologists. Anyone who has taught basic evolutionary concepts to college undergrads (and probably high school students as well) has most likely dealt with students struggling to properly read and draw phylogenies.

Lucky for us, there is also a growing body of literature on the most effective ways to teach what has been dubbed “tree-thinking”. I have summarized this literature in a review due to be published in the journal Evolution: Education and Outreach (doi:10.1007/s12052-010-0254-9). The full text of the article is available at that link, and I have reproduced the abstract below.

Evolution is the unifying principle of all biology, and understanding how evolutionary relationships are represented is critical for a complete understanding of evolution. Phylogenetic trees are the most conventional tool for displaying evolutionary relationships, and “tree-thinking” has been coined as a term to describe the ability to conceptualize evolutionary relationships. Students often lack tree-thinking skills, and developing those skills should be a priority of biology curricula. Many common student misconceptions have been described, and a successful instructor needs a suite of tools for correcting those misconceptions. I review the literature on teaching tree-thinking to undergraduate students and suggest how this material can be presented within an inquiry-based framework.

About this Archive

This page is an archive of recent entries in the Phylogenetics category.

Junk DNA is the previous category.

Transitional Fossils is the next category.

Find recent content on the main index or look in the archives to find all content.



Author Archives

Powered by Movable Type 4.381

Site Meter