Junk DNA, Linguistics and the scientific vacuity of Intelligent Design

On Pandasthumb, our dear friend Salvador Cordova (YEC) presents us with the following “argument”

Pellionisz FractoGene has demonstrated at least one layer of linguistic architecture for the junk DNA. A linguistic structure suggest function even if the structure is not fully understood (like seeing an undecoded communication, the communication has function, but it is not understood). Furthermore, Fracis Collins called it hubris to say any part of the genome is junk.

Salvador may perhaps not be familiar with the term ‘non-coding DNA’ which describes much better the scientific thinking on the somewhat unfortunate term “junk DNA”, especially since the term seems to be used for cherry picking rhetoric. In this posting I will explore the term junk DNA, address some of the findings in research that DNA and junk DNA show “linguistic features” and show why ID remains fully vacuous since it cannot predict let alone explain “junk DNA”.

Junk DNA and its confusions

According to Wikipedia Junk DNA “is a collective label for the portions of the DNA sequence of a chromosome or a genome for which no function has yet been identified.”

This however does not mean that all “Junk DNA” has a “function” although such DNA can serve, as in the case of the anti-freeze example, as a source for novel functions. In fact, science has detected examples of pseudogenes:

These chromosomal regions could be composed of the now-defunct remains of ancient genes, known as pseudogenes, which were once functional copies of genes but have since lost their protein-coding ability (and, presumably, their biological function). After non-functionalization, pseudogenes are free to acquire genetic noise in the form of random mutations.

Evolutionary science can explain the existence of such pseudo-genes.

It’s safe to say that “Junk DNA” contains many areas where we lack sufficient data or knowledge to understand its origins or function and yet, science is slowly unraveling the details surrounding non-coding DNA. One may thus compare science’s progress in understanding non-coding DNA with Intelligent Design. And one quickly comes to realize that ID remains as usual scientifically vacuous since it is based on our ignorance, not our knowledge.

DNA and linguistic features

Researchers have uncovered some interesting features in non-coding DNA, features which are also found in languages. Based on the work by linguist George Kingsley Zipf, the feature was called Zipf’s law.

Zipfs law states that

in a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table. So, the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. The term has come to be used to refer to any of a family of related power law probability distributions.

In R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng, M. Simons, and H. E. Stanley, Linguistic Features of Noncoding DNA Sequences, Phys. Rev. Lett. 73, 3169 (1994), the authors showed how non-coding DNA abides by Zipf’s law. As I will show however, their findings come with a lot of warnings.

What ID proponents seem to suggest is that Zipf’s law provides for a specification and given our ignorance of non-coding DNA (which causes it to be complex in ID speak), they conclude that non-coding DNA is complex specified information (CSI) and thus designed.

As I will show, they are correct that non-coding DNA’s linguistic features are designed, and that the designed is a fully natural process

The first warning to ID activists should have been that non-coding DNA follows Zipf’s law distribution more closely than coding DNA. The second warning should have come from the work by researchers showing how many features in the genome match power scaling laws (Zipf’s law is a subset of such laws).

In “True reason for Zipf’s law in language”, researchers Wang Dahui, Li Menghui and Di Zengru published in Physica A: Statistical Mechanics and its Applications Volume 358, Issues 2-4 , 15 December 2005, Pages 545-550, describe the ‘true reason’ why Zipf’s law arises in languages.

Analysis of word frequency have historically used data that included English, French, or other language, data typically described by Zipf’s law. Using data on traditional and modern Chinese literatures, we show here that Chinese character frequency stroked Zipf’s law based on literature before Qin dynasty; however, it departed from Zipf’s law based on literature after Qin dynasty. Combined with data about English dictionaries and Chinese dictionaries, we show that the true reason for Zipf’s Law in language is that growth and preferential selection mechanism of word or character in given language.

Growth and preferential selection… But wait a minute, imagine a process of gene duplication and preferential attachment, and one has recovered one of the processes thought to be the cause behind the scale free nature of so many processes in the genome.

The cause for the deviation of Chinese from Zipf’s law is simple

What causes this difference between Chinese and other languages? Let us pay attention to the some features of Chinese characters and English words. Before the Qin dynasty, Chinese characters were in infancy and different in various areas of China. After Emperor Qin Shihuang unified the characters, the Chinese language became mature. It is difficult to create new characters because Chinese characters are pictographs, and the number of Chinese characters has grown very slowly, from 10 000 to 50 000 over last 2000 years. So, the available number of Chinese characters for any author is almost fixed. On the other hand, the words of other language, such as English, new words are introduced constantly and the number of words grows very fast compared with Chinese character. The available words for authors are unlimited.

It is difficult to add new Chinese characters while adding new words in most other languages is trivial. In other words, when the set of ‘words’ are fixed, the distribution will tend to start deviating from Zipf’s law, helping us understand why Zipf’s law applies better to non-coding DNA than coding DNA, since the latter is constrained by (strong) selection.

In 1996, researchers already pointed out the problems with Zipf’s law

This also showed highly similar Zipf behavior to noncoding DNA and language. Thus, to detect language Zipf analysis should be applied with caution, since it cannot distinguish language from power-law noise

N. E. Israeloff, M. Kagalenko, and K. Chan Can Zipf Distinguish Language From Noise in Noncoding DNA? Phys. Rev. Lett. 76, Issue 11 – March 1996.

In fact, in No Signs of Hidden Language in Noncoding DNA Phys. Rev. Lett. 76, (1996) , authors Bonhoeffer et al add

We have thus shown that most of the observations in [1] may be simple consequences of unequal nucleotide frequencies. Our explanation does not exclude the existence of an undeciphered language in noncoding DNA, but it does undercut speculative arguments based on Zipf’s Law or Shannon redundancy [4]. There remains, however, the very interesting question implicit in [1]: Why are there differences in nucleotide frequencies between coding and noncoding DNA?

The original authors reply in R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley, Mantegna et al. Reply:, Phys. Rev. Lett. 76, 1979 (1996) , pointing out that until the distributions of nucleotides in coding and noncoding can be established, the Zipf’s law feature in DNA remains unresolved.

What is interesting is that in 1992, Wentian Li published an article titled Random Texts Exhibit Zipfs-Law-Like Word Frequency Distribution in IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 38, NO. 6, NOVEMBER 1992

Abstract-It is shown that the distribution of word frequencies for randomly generated texts is very similar to ZipPs law observed in natural languages such as English. The facts that the frequency of occurrece of a word is almost an inverse power law function of its rank and the exponent of this inverse power law is very close to 1 are largely due to the transformation from the word’s length to its rank, which stretches an exponential function to a power law function.

Salvador Cordova (YEC) also reports on the work by Pellionisz based on the concept of FractoGene. Salvador makes much of the unfortunate use of the term design by Pellionisz, but as Sal already points out such ‘fractal or recursive design’ already takes place in many plants (Fibonnacci series). In other words, design, once again does not mean ‘intelligent design’ but rather algorithmic design. Since algorithmic design is based on regularities, once again we notice how Intelligent Design is unable to distinguish between actual and apparent complex specified information (CSI). Much is made of the work by Isidore Rigoutsos and others titled Short blocks from the non-coding parts of the human genome have instances within nearly all known genes and relate to biological processes PNAS | April 25, 2006 | vol. 103 | no. 17 | 6605-6610

However, as Miklos Csuros et al point out in “Reconsidering the significance of genomic word frequency “

Determining what constitutes unusually frequent and rare in a genome is a fundamental and ongoing issue in genomics[6]. Sequence motifs may be frequent because they appear in mobile, structural or regulatory elements. It has been suggested that some recurrent sequence motifs indicate hitherto unknown or poorly understood biological phenomena[17]. We propose that the distribution of DNA words in genomic sequences can be primarily characterized by a double Pareto-log normal distribution, which explains lognormal and power-law features found across all known genomes. Such a distribution may be the result of completely random sequence evolution by duplication processes.

[17] references the paper by Rigoutsos et al.

In 2006 in a paper titled “Picking Pyknons out of the Human Genome” published in Cell Volume 125, Issue 5 , 2 June 2006, Pages 836-838, Meynert et al argue that there may be an unknown role for Pyknons, warning however that:

A frustration for computer scientists is that although DNA sequences are easy to analyze, interpreting why a sequence pattern in a genome is nonrandom is much harder to pin down. For example, patterns that appear many times in a genome might not be functionally important. Many dispersed repeats and retrotransposed pseudogenes also generate considerable numbers of related patterns in the genome. The authors point that although nearly all pyknons (99.9%) show some overlap with repeat elements, there are at least 50,000 instances of pyknons that show no overlap with repeat elements as defined by RepeatMasker (Smit et al., 1996). However, most pyknons (90%) are found at least half of the time in repeat regions, meaning that the vast majority of pyknon instances are in classical repeats.

So if these observations can in principle be explained by random sequence evolution by duplication processes, then how do we determine, once again, if there is true intelligent design to be found? Intelligent Design does not give us any answers here. Remember, all it can do is detect ‘design’ where design can include apparent or actual design, without providing ANY tools to differentiate between the two.

While ID is forced to remain hiding in the shadows of our ignorance, science is pushing forward, trying to unravel the ‘mystery’ of Junk DNA.

As a side note, the usefulness of “Junk DNA” dates back to as early as 1978

The usefulness of noncoding DNA for mapping human disease genes has been known for at least 25 years. In 1978, Y. W. Kan and Andres Dozy published a paper in The Lancet in which they used a variation in the flanking DNA of the beta-globin gene in the first successful prenatal genetic diagnosis of sickle cell anemia.

See Pubmed

As is explained by W. Maxwell Cowan et al in a review paper published in Annual Review of Neuroscience Vol. 23: 343-391 (Volume publication date March 2000) titled The Emergence of Modern Neuroscience: Some Implications for Neurology and Psychiatry

A major advance in the study of human genetic disorders occurred in the early 1980s with the development of restriction fragment length polymorphism analysis. Until that time, the genetic markers used to track genes and their mutations in human chromosomes were based solely on variations in coding regions of DNA, expressed ultimately as proteins. The common markers were blood group antigens, certain enzymes, and the antigens of the histocompatibility complex. However, DNA encoding gene products probably accounts for less than 10% of the human genome; more than 90% of the genome contains noncoding sequences, previously referred to as junk DNA. In 1980, Botstein et al (1980) realized that polymorphic sites could be recognized using restriction endonucleases. They pointed out that in the limit, single–base pair changes, which are genome specific and are tied to how closely individuals are related, can be detected by changes in restriction digest patterns. And it is important that these single–base pair changes are diagnostic even when the nucleotide changes occur in noncoding regions. These restriction fragment length polymorphisms (RFLPs) allowed saturation of the human genome with markers in noncoding as well as coding DNA regions, and this broad coverage made it easier to pinpoint the chromosomal loci of inherited diseases. Indeed, even before the report by Botstein and his colleagues, Kan and colleagues (Kan & Dozy 1978, Kan et al 1980) were able to show how RFLP analysis could be used for the prenatal diagnosis of a clinical disorder (in this case, sickle-cell anemia).