Meyer 2004 and Deja Vu All Over Again

Whilst reading Stephen C. Meyer’s latest publication, I had the strangest feeling that I had seen it before somewhere. Well, given that Meyer is an “intelligent design” advocate, the feeling that one is not getting a pure feed of innovative, novel prose is perhaps simply to be expected. It didn’t take long to find the source of a substantial chunk of the “new” paper:

S.C. Meyer, M. Ross, P. Nelson, & P. Chien. 2003. The Cambrian explosion: biology’s big bang. Pp. 323–402 in I. A. Campbell & S. C. Meyer, eds., Darwinism, design and public education. Mich­igan State University Press, Lansing.

That reference comes right from the citations in Meyer 2004. It is cited in the paper in support of a couple of specific claims. It is not noted as a substantial source of the text of the Meyer 2004 article.

But Meyer et al. 2003 is a source of a substantial proportion of the Meyer 2004 text. Read on for the details…

Lord Kelvin wrote:

In physical science the first essential step in the direction of learning any subject is to find principles of numerical reckoning and practicable methods for measuring some quality connected with it. I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be.

It’s one thing to say that it looks like there are similarities between two texts, and another to produce a measure of how much similarity there is. So I wasn’t satisfied with just noting a few close correspondences. I wanted to find out just how much cutting-and-pasting occurred between the two texts.

Nick Matzke looked around for some plagiarism-detection software solutions, but the stuff available appeared to either lack features or require the payment of cold, hard cash. So I decided to write my own Perl script to do the job.

It has taken a few days of debugging, but I have a version that I am pretty satisfied with now. It will take two file names from the command line and read both in. The first is treated as the text which may be copied in part from the second file. Words of some length i or longer are matched across the two texts. At every point of a match, the script tries to find the longest run of words possible with some m words perhaps deleted or inserted at a time. If the number of words in a run meets or exceeds some number n, the matching text is saved for a report. Since every potential match is examined, false negatives should be absent. (I won’t guarantee that the script never has a false negative; after all, I’m not a perfect coder.) A simple test to see if a potentially matching word is already part of an identified run helps keep down redundant processing.

Now. about false positives… the script certainly does have those. The important thing to do is to characterize the rate at which false positives may appear, such that when one considers the results, one may adjust the findings in light of the false positive rate. How many false positives are seen depends upon the lengths of the texts that are compared and the vocabulary used in the texts.

Having said all that, here are the numbers for various comparisons.

Meyer 2004
(13,534 words) vs.   | Total words | Run length | Matches | Words |  %  |
---------------------+-------------+------------+---------+-------+-----+
Moby Dick            |  218,623    |    n>=5    |      37 |   187 |  1  |
Origin of Species    |  209,176    |    n>=5    |     170 |   873 |  6  |
Budd and Jensen 2000 |   32,184    |    n>=5    |      75 |   402 |  3  |
Meyer et al. 2003    |   28,481    |    n>=5    |     455 | 6,280 | 46  |
---------------------+-------------+------------+---------+-------+-----+
Moby Dick            |  218,623    |    n>=10   |       0 |     0 |  0  |
Origin of Species    |  209,176    |    n>=10   |       4 |    49 |  0  |
Budd and Jensen 2000 |   32,184    |    n>=10   |       2 |    47 |  0  |
Meyer et al. 2003    |   28,481    |    n>=10   |     163 | 5,192 | 38  |


OoS (209,176 words) vs. | Total words | Run length | Matches | Words |  %  |
------------------------+-------------+------------+---------+-------+-----+
Moby Dick               |  218,623    |   n>=10    |       6 |    72 |  0  |

It can readily be seen from the results above that the script finds that over a third of Meyer 2004 has been lifted from Meyer et al. 2003. The tightening-up of results given the longer n>=10 run length indicates that the true proportion is over 33%.

Here are some examples of matches found between Meyer 2004 and Meyer et al. 2003:

Match 2 (1): Reference (000556 .. 000590, of 13533): Subject (008921 .. 008956, of 28480):
the situation starting in the 1970s many biologists began questioning its neo Darwinism s adequacy in explaining evolution Genetics might be adequate for explaining microevolution but microevolutionary changes in gene frequency were not seen as Synthesis is a remarkable achievement However starting in the 1970s many biologists began questioning its adequacy in explaining evolution Genetics might be adequate for explaining microevolution but microevolutionary changes in gene frequency were not seen as
Match 10 (1): Reference (001911 .. 002032, of 13533): Subject (003150 .. 003274, of 28480):
bases Since each of the four bases has a roughly equal chance of occurring at each site along the spine of the DNA molecule biologists can calculate the probability and thus the information carrying capacity of any particular sequence n bases long The ease with which information theory applies to molecular biology has created confusion about the type of information that DNA and proteins possess Sequences of nucleotide bases in DNA or amino acids in a protein are highly improbable and thus have large information carrying capacities But like meaningful sentences or lines of computer code genes and proteins are also specified with respect to function Just as the meaning of a sentence depends upon the specific arrangement of the letters in a linear array Since each of the four bases has a roughly equiprobable chance of occurring at each site along the spine of the DNA molecule biologists can calculate the probability and thus the information carrying capacity of any particular sequence n bases long The ease with which information theory applies to molecular biology has created confusion about the type of information that DNA and proteins possess Sequences of nucleotide bases in DNA or amino acids in a protein are highly improbable and thus have a large information carrying capacity But like meaningful sentences or lines of computer code genes and proteins are also specified with respect to function Just as the meaning of a sentence depends upon the specific arrangement of the letters in
Match 39 (1): Reference (003591 .. 003777, of 13533): Subject (015807 .. 015994, of 28480):
natural selection acting to preserve small advantageous variations in genetic sequences and their corresponding protein products Dawkins 1996 for example likens an organism to a high mountain peak He compares climbing the sheer precipice up the front side of the mountain to building a new organism by chance He acknowledges that his approach up Mount Improbable will not succeed Nevertheless he suggests that there is a gradual slope up the backside of the mountain that could be climbed in small incremental steps In his analogy the backside climb up Mount Improbable corresponds to the process of natural selection acting on random changes in the genetic text What chance alone cannot accomplish blindly or in one leap selection acting on mutations can accomplish through the cumulative effect of many slight successive steps Yet the extreme specificity and complexity of proteins presents a difficulty not only for the chance origin of specified biological information i e for random mutations acting alone but also for selection and mutation acting in concert Indeed mutagenesis experiments cast doubt on each of the two scenarios by which neo Darwinists envisioned new information arising natural selection acting to preserve small advantageous variations in genetic sequences and their corresponding protein products Richard Dawkins for example likens an organism to a high mountain peak 95 He compares climbing the sheer precipice up the front side of the mountain to building a new organism by chance He acknowledges that this approach up Mount Improbable will not succeed Nevertheless he suggests that there is a gradual slope up the backside of the mountain that could be climbed in small incremental steps In his analogy the backside up Mount Improbable corresponds to the process of natural selection acting on random changes in the genetic text What chance alone cannot accomplish blindly or in one leap selection acting on mutations can accomplish through the cumulative effect of many slight successive steps Yet the extreme specificity and complexity of proteins present a diffi culty not only for the chance origin of specified biological information that is for random mutations acting alone but also for selection and mutation acting in concert Indeed mutagenesis experiments cast doubt on each of the two scenarios by which neo Darwinists envision new information arising
Match 41 (1): Reference (003794 .. 003889, of 13533): Subject (016004 .. 016101, of 28480):
either arise from non coding sections in the genome or from preexisting genes Both scenarios are problematic In the first scenario neo Darwinists envision new genetic information arising from those sections of the genetic text that can presumably vary freely without consequence to the organism According to this scenario non coding sections of the genome or duplicated sections of coding regions can experience a protracted period of neutral evolution Kimura 1983 during which alterations in nucleotide sequences have no discernible effect on the function of the organism Eventually however a new gene sequence will arise that either new functional genes arise from noncoding sections in the genome or functional genes arise from preexisting genes Both scenarios are problematic In the first scenario neo Darwinists envision new genetic information arising from those sections of the genetic text that can presumably vary freely without consequence to the organism According to this scenario noncoding sections of the genome or duplicated sections of coding regions can experience a protracted period of neutral evolution in which alterations in nucleotide sequences have no discernible effect on the function of the organism Eventually however a new gene sequence will arise that
Match 140 (1): Reference (011258 .. 011352, of 13533): Subject (020800 .. 020892, of 28480):
At every level of the biological hierarchy organisms require specified and highly improbable arrangements of lower level constituents in order to maintain their form and function Genes require specified arrangements of nucleotide bases proteins require specified arrangements of amino acids new cell types require specified arrangements of systems of proteins body plans require specialized arrangements of cell types and organs Organisms not only contain information rich components such as proteins and genes but they comprise information rich arrangements of those components and the systems that comprise them Yet we know based on our present experience At every level of the biological hierarchy organisms require specified and highly improbable arrangements of lower level constituents in order to maintain their form and function Genes require specified arrangements of nucleotide bases proteins require specified arrangements of amino acids new cell types require specified arrangements of proteins and systems of proteins new body plans require specialized arrangements of cell types and organs Organisms not only contain information rich components such as proteins and genes but they comprise information rich arrangements of those components and the subsystems that comprise them Based on experience

(For the complete set of matches found at word runs of length 10 or more, see http://www.antievolution.org/people/meyer_sc/meyer2004_bio_info/mnr_bbb_comp.html)

As a check, I also ran the Perl Algorithm::Diff package’s “traverse_sequences” routine, and found the following result for comparing Meyer 2004 to Meyer et al. 2003:

5,484 words out of 13,534 match, 40.5 %

As yet another check of my script’s sensitivity, I compared two review articles on the same subject by the same first author published three years apart. In looking for runs of 10 words or more, there were two matches comprising some 28 words, or 1% of the total length of the more recent paper. (Nonaka, Masaru, and Yoshizaki, Fumiko. 2004. Evolution of the complement system. Molecular Immunology Volume 40, Issue 12, February 2004, Pages 897-902 compared to Nonaka, Masaru. 2001. Evolution of the complement system. Current Opinion in Immunology Volume 13, Issue 1, 1 February, Pages 69-73.)

As a philosopher of science, Meyer should know a thing or two about scientific custom when it comes to journal articles. One custom is that one does not submit for publication in a peer-reviewed publication work that has already been substantially published in another peer-reviewed publication. Now, I don’t think that this applies in this case, but Stephen Meyer ought to. After all, the Discovery Institute bills the book that Meyer et al. 2003 appears in as “peer reviewed”.

SEATTLE Darwinism, Design and Public Education (DDPE) is a peer-reviewed book published by Michigan State University Press that presents a multi-faceted scientific case for the theory of intelligent design while also examining the legal and pedagogical arguments for teaching students about the scientific controversies that surround the issue of biological origins. Contributors to the book include both leading scientific proponents of intelligent design and neo-Darwinism.

(Source: http://www.discovery.org/scripts/viewDB/index.php?program=News-CSC&command=view&id=1694)

ID fans are fond of saying that you don’t get any more information in two copies of a newspaper than in one. This is of course false, as Elsberry and Shallit 2003 note. But the germ of truth in it is that you don’t get any more misinformation in two ID papers than in one: it’s the same old same old.