99.9% Wrong

by Joe Felsenstein,
http://evolution.gs.washington.edu/felsenstein.html

Over at Uncommon Descent, “niwrad” is back with more calculations showing that conventional figures for comparing sequences of genomes are all wrong. Last time “niwrad” showed that humans and chimp genomes match only about 62% of the time. The usual figure given is 98.77%. Niwrad did this by taking 30-base chunks of one genome, finding the best match in the other genome, and then asking what fraction of the time there was a perfect match of all 30 bases. That’s where the 62% figure comes from. I immediately pointed out here at PT that this was expected and did not represent some insightful new way of calculating these figures.

Now Niwrad has turned to comparing two human genomes. The figure for 30-base perfect matches is about 96%. The conventional figure is about 99.9%. Let’s see what is expected. If a single base position has a 0.999 probability of matching, two bases have a 0.999x0.999 probability, three bases a 0.999x0.999x0.999 probability. 30 bases then have a probability that is 0.999 raised to the 30th power. Which turns out to be (ta-da!) 0.97. Not a bad fit.

Niwrad proudly notes that in the previous discussion

it seemed to me that the general feeling at the end was that my statistical method for performing genome-wide comparisons might have some merit, after all.

(Niwrad must have missed the discussion over here).

It does have merit: It’s a way of taking a close match and making it sound much less close – without changing anything. I have a suggestion: why not try 100-base chunks? That way human/chimp match will drop to only about 29%, while human/human will drop to 90%. Or how about 1000-base chunks? (human/chimp would be only about 0.00042 of a percent, and human/human would be down to about 37%). Where will this all end?