Letter Serial Correlation points to languages evolution


This essay has been called to life by Steve Reuland’s post to Panda’s Thumb titled “What good is half an underlying language structure?” ( www.pandasthumb.org/pt-archives/000853.html ) which refers to Carl Zimmer’s posts to Loom (http://www.corante.com/loom/archive[…]part_one.php and http://www.corante.com/loom/archive[…]part_two.php ). One point touched upon in passing by Zimmer and by some comments’ writers, was the question of whether or not natural languages have all evolved from the same proto-language.

Such an idea was, in particular, strongly pushed by Academician Nikolay Marr in the USSR. For some 30 years Marr had been acclaimed in the USSR as the greatest linguist of all times, whose teachings were supposedly in full agreement with Marxism-Leninism. Then, suddenly, in 1950, Stalin changed his mind and millions of copies of a thin booklet were published whose author was claimed to be Stalin himself. It explained what the “genuine Marxist linguistics” is. In this booklet Marr was claimed to be a pseudo-scientist and his theory denounced as anti-Marxist.

As I understand, the notion of a single proto-language is shared by many linguists.

I would like to briefly report on some data which, I believe, provide strong empirical support to the notion of the intrinsic unity of all natural languages, specifically evident in their written form. The experimental data in question have been obtained in a work which was conducted a few years ago by myself and Brendan McKay of the Australian National University (Canberra).

We developed a new method for a statistical analysis of texts dubbed Letter Serial Correlation (LSC). Although we have conducted hundreds of measurements on many texts in 12 languages as well as on a number of gibberish strings created in various ways (and also on the famous Voynich manuscript often referred to as the “most mysterious manuscript in the world”), so far our results have only been reported in a series of articles on my personal website ( http://members.cox.net/marperak/Texts). Twice during recent years we had a paper prepared for an international journal on computational linguistics with a concise presentation of our method and the results obtained, but in both cases we opted for postponing the planned publication because each time some new modifications improving the method came to mind. Besides, both myself and Brendan have been busy with other projects and did not devote as much time to LSC as it, perhaps, deserved.

The data obtained by the LSC method demonstrated the intrinsic unity of the structure of all studied languages (Hebrew, Aramaic, Greek, Latin, English, German, Italian, Spanish, Russian, Czech, Finnish, and Yiddish). Most of the biblical texts (both in Hebrew and in translations) as well as such diverse texts as Moby Dick, The Song of Hiawatha, Macbeth, UN convention on Sea Trade, Tolstoy’s War and Piece (in the Russian original and in translations), the full text of a Russian newspaper, and many others, were studied.

I believe our data vividly show that all meaningful texts, regardless of language, authorship, etc, have the same intrinsic structure, in particular reflected in the existence in all meaningful texts of what we called the Average Domain of Minimal Letter Variability (ADMLV). Gibberish strings, both highly ordered and highly disordered, do not possess this feature.

Briefly, the method of LSC is as follows (I’ll describe the latest version which slightly differs from that reported in the articles posted to my site.) Our computer program performs several actions on a text which is stored on a disk, namely: (1) It counts the total numbers Mi of each letter’s occurrence in the entire text. (2) It chooses a “window” in the text which is n letters long, where n is an even number varying from 2 to L/2 if L is an even number, or to (L-1)/2 if L is an odd number (L is the total length of the text expressed in the number of letters). Each “window” is divided into two equal “panes” 1 and 2, of a length of m=n/2 each. For each value of n the program counts the numbers Xi1 and Xi2 of occurrences of each letter in both panes 1 and 2. The window is moved along the text and for each window’s position the program calculates the expression (X1 - X2)2 for each letter. Then the program calculates the sum Sm of all such expressions over all positions of the window and over all letters of the alphabet.

The program generates a table where the values of Sm - the Measured Serial Correlation Sum, are listed for all values of n. Finally the program plots the graph of Sm vs n. Simultaneously, the program computes the Expected Serial Correlation Sum (Se ) as a function of n, using the theoretical formula we derived based on a random distribution of letters.

Although on my site the results are shown obtained by an earlier version of the method (where the window was not moved along the text; instead the program divided the text into k equal “chunks” and measured the sums for each pair of adjacent chunks) the results obtained by both versions differ only in secondary details; the newer version removes a certain inconvenience in the original method and generates a smoother curve, but does not generate principally different “Sm vs. n” curves).

The “Sm vs. n” curves for all meaningful texts in all studied languages had quite a distinctive shape, with a number of characteristic points which were absent in the graphs for gibberish texts. Many of such graphs can be seen on my site at http://members.cox.net/marperak/Texts .

One of the characteristic points seems to be of special interest. It is a distinctive deep minimum on the “Sm vs. n” graph which is present on all such curves for meaningful texts regardless of language, authorship, etc., but does not exist on the curves for gibberish texts (and, as expected, does not exist on “Se vs. n” curves).

This minimum testifies to the existence in meaningful texts of a distinctive Average Domain of Minimal Letter Variability (ADMLV). This is a text’s length, within which the distribution of letters frequencies is characterized by a maximal frequency of occurrence of the same subset of letters. Within the text’s length which is either shorter or longer than the length of the ADMLV, the variability of letters’ occurrences is larger than within the ADMLV’s length. Details of the measurements, calculation, and interpretation of data, can be seen at my site.

The length of ADMLV differs depending on language but varies only in a narrow range for different texts in the same language. For example, for all Hebrew and Aramaic texts, both biblical and secular, the length of ADMLV is invariably between 42 and 46 letters. In English texts the length of ADMLV varies between 60 and 140 letters, which corresponds to a certain extent to the difference between these two writing systems - in Hebrew there are no letters for vowels so the text’s portion in Hebrew containing a certain amount of a message necessarily comprises fewer letters than a corresponding segment in English.

The natural interpretation of the ADMLV is that it represents the average length of texts wherein a specific topic or notion is the subject of the narrative and this predetermines a relatively high frequency of repeated occurrences of the same letters.

The existence of ADMLV, which finds its empirical reflection in the minimum on the LSC curves, seems to be an ineliminable feature of all meaningful texts, regardless of language. It testifies to the deep unity of various languages and supports the notion of all languages’ evolution from the same proto-language via descent with modification.

There seems to be analogy between biological evolution and that of languages. The evolution of languages is a fact - for example, today’s English is so different from that of Chaucer’s that nobody in his right mind could deny such an evolution. I guess the creationists would say this is “microevolution,” as Chaucer’s English and today’s English both are still English. And what about, say, Latin and its descendants - Italian, French, Spanish, Portuguese, Romanian, etc.?

While the fossil record, for obvious reasons, necessarily is incomplete and has many gaps, the evolution of a language is often well recorded in all of its stages because of the preservation of written texts.

There is no principal difference between evolution of a language from Chaucer’s stage to today’s stage and evolution resulting in the emergence of a new language - Italian from Latin, or Russian and Ukrainian from Old Slavic (two different languages stemming from the same “progenitor,” the separation of which occurred around 11th - 12th centuries) or Czech, Polish, Bulgarian, Serbian-Croatian, and Macedonian from an even earlier proto-Slavic. The difference is in degree, so that evolution of a language can naturally graduate into evolution to a new language, no longer understandable to the speakers of the original language, provided the two groups of speakers are geographically separated. Likewise, there are no reasons why evolution within a species cannot extend to the loss of interbreeding ability of two geographically separated subspecies thus resulting in the appearance of a new species, i.e. in “macroevolution.”


Mark, I’m neither a linguist nor a competent statistical analyst, but, if my memory is correct, you’ve restricted your analysis to nine Indo-European languages and three non-Indo-European languages (Finnish, Aramaic and Hebrew). And unless I missed something, you’ve also analyzed the letter strings and frequencies but not phonemes (I do recognize that you’ve probably mapped the equivalent—meaning “same” sound—characters of different alphabets to each other). That may be a suitable first, to refine your technique, but shouldn’t you be looking at languages outside the Indo-European group like Indo-Chinese, Malay-Polynesian, the languages of sub-Saharan Africa, Japanese, Dravidian, the various American Indian tongues, and other rarer language? And shouldn’t you be looking at the phoneme patterns and frequencies and syntactical structures, which more accurately reflect the substance of a language than the characters found in the written form?

Having done nothing but ask questions, I should add that I’ve often thought that a careful analysis of existing languages, especially before some of the rarer ones go extinct, offers the potential for shedding as much or more light on the radiation of human culture and evolution as the analysis of DNA. And that being my perception, I’ve been surprised not to read in the popular press articles and essays about how languages mirror human development.

At the risk of humoring the troll, let me add my two cents worth in response DaveScot. I don’t recall and will not bother to search out what Reed may or may not have said, but in considering the origin of language at least two basic questions come to mind: 1) When and how did the capacity for language appear and evolve? 2) How did language itself, its development and use, arise and evolve? The two are related but different questions that may have related or totally different and unrelated answers. And the origins of ability and actual usage could well have been many thousands or millions of years apart.

I’ve often thought that a reasonable hypothesis would be that the capacity for language arose once, but that language itself arose multiple times in multiple places among isolated groups of early hominids or pre-hominids (or the first early modern humans), giving rise to the multiplicity of known language groups. The real challenge will be determining if basic syntax is a function of a hardwired brain or embedded in the nature of language itself.

Keanus: Thanks for your comment. I agree with you that my and Brendan’s study was very limited in scope (although it took a lot of time and effort) and that more languages should have been studied, including all those you’ve listed. I am too old to expect that I’ll be able to conduct such an extensive study, and Brendan is too busy with his main interests which are in combinatorics, so the continuation of the LSC study could conceivable happen only if some younger people take it up. So far, besides our original study, I know of only one other case of the use of our technique by other people. Two prominent Voynich manuscript scholars applied our method and have essentially confirmed the data I reported on my site for Voynich manuscript. They are interested only in decoding Voynich, so they did not conduct any measurements beyond that limited goal. I am not even sure that I and Brendan will finally submit a paper to a printed media because after I posted the data on my site, we had a number of improvements in technique, but I never got to put in in an orderly writing - it all sits in emails between me and Brendan. Cheers, Mark

Two questions.

What were your results for the Voynich ms?

Have you ever tried your method on the Enochian Calls (google)? They are supposed to have the same source as Voynich (John Dee’s slimy associate).

Re: comment by Dick Thompson, # 19232. Dick asked

What were your results for the Voynich ms?

. The results of the study of LSC in Voynich can be seen at http://members.cox.net/marperak/Tex[…]voynich1.htm and http://members.cox.net/marperak/Tex[…]voynich2.htm. Regarding Dick’s second question, the answer is No. Mark

Although Mark Perakh’s work is very interesting, I don’t see how it demonstrates common descent of languages. What he has found is IMO more likely a side effect of how our minds/brains work.

There are several other phenomena of language that may reasonably be explained in such a fashion, without resorting to the hypothesis of common descent. Like assimilation of neighboring sounds, where one sound is altered to make it flow more smoothly with a neighboring sound. For example, a prefix ending with -n, like in-, con-, or syn-, has its /n/ sound changed to /m/ or /ng/ or /r/ or /l/ depending on what sound comes next, simply because it is easier to pronounce that way. Also, the English indefinite article was originally always “an”, but nowadays loses the n before a consonant for that reason.

To demonstrate common descent, one has to find features of language that are (1) difficult to explain with such hypotheses and (2) rarely borrowed, the linguistic version of lateral gene transfer. Morphology and basic vocabulary are good places to look, though they are not absolutely unborrowable. This conclusion is reached by studying languages with long written records; though they may change over time, what I’ve described generally holds true. Compare present-day English with Old English; Though present-day English has a much simpler morphology than Old English and an enormous quantity of borrowings, most of the more basic vocabulary is well-preserved, and there is even a fair amount of continuity in grammar, like formation of verb past tenses and participles, especially irregular formation of these.

One can do the same with Latin and the Romance languages; despite various changes, much continuity is still recognizable in basic vocabulary and grammar.

This means that we can extrapolate beyond where the paper trails end and infer the existence of long-gone languages. This is easy for the Germanic languages, like English and German; they have an abundance of shared vocabulary and grammatical features. All but modern English have definite vs. indefinite adjective declensions, and all have two types of verb declension: the “strong” (vowel shifts, like English sing, sang, sung), and the “weak” (English -ed and cognates).

One can do likewise for other language families, like Celtic, Baltic, Slavic, Indic, Semitic, etc.; one can even find bigger families like Indo-European. However, the farther and farther one looks back in time, the more details get obscured by language change; for that reason, such proposed families as Nostratic and Sino-Caucasian are not widely accepted.

Here is a simple table:

Indo-European: Germanic: English: me, one, two, three, ten, name, sun, star Old English: me-, an, twa, thri, tien, nama, sunne, steorra German: mi-, eins, zwei, drei, zehn, Name, Sonne, Stern Swedish: mi-, en, tva, tre, tio, namn, sol, stjarna Gothic: mi-, ains, twai, threis, taihun, namo, sunna, stairno Slavic: Russian: me-, odin, dva, tri, desyat’, imya, solntse, zvezda Serbo-Croatian: mi, jedan, dva, tri, deset, ime, sunce, svjezda Bulgarian: me-, edin, dva, tri, deset, ime, sluntse, trugvam Celtic: Irish Gaelic: me-, aon, do, tri, deich, ainm, grian, ralta Breton: me, unan, daou, tri, dek, anv, heol, sterenn Latin-Romance: Latin: me-, unus, duo, tres, decem, nomen, sol, stella Italian: me, uno, due, tre, dieci, nome, sole, stella Spanish: me, uno, dos, tres, diez, nombre, sol, estrella French: me, un, deux, trois, dix, nom, soleil, etoile Hellenic: Classical Greek: eme, heis, duo, treis, deka, onoma, helios, aster Indic: Sanskrit: ma-, eka, dvaa, trayas, dasha, naama, surya, taara Hindi: mai, ek, do, tin, das, nam, surya, tara Bengali: ami, aek, dui, tin, dash, nam, surya, tara Sinhalese: ma-, eka, deka, tuna, dahaya, nama, ira, tharuwa

Ancestral IE: *me-, *oinos, *dwo, *treyes, *dekm, *nomn, *sawel, *ster (reconstructed)

Uralic: Finnish: mi-, yksi, kaksi, kolme, kymmenen, nimi, aurinko, tähti Hungarian: ?, egy, kettö, három, tiz, név, nap, csillag

Semitic: Hebrew: -i, ahat, shtayim, shalosh, eser, shem, shemesh, kokhab Arabic: -i, waahid, ithnaan, thalaatha, ‘ashara, ism, shams, kaukab

Sumerian: ?, desh, min, pesh, hu, mu, utu, kilib

Basque: ?, bat, bi, hiru, hamar, ?, ?, ?

Notice the varying amounts of resemblance. There is a little bit of resemblance between Indo-European and Uralic; Nostratic includes these two. And there is even less between these two and Semitic.

This treelike pattern of resemblance resembles what one finds from biological evolution, and has a similar explanation; it is unexplained by mythologies like the Tower of Babel story.

Thanks, Loren for your interesting comment (# 19267). You wrote

Although Mark Perakh’s work is very interesting, I don’t see how it demonstrates common descent of languages. What he has found is IMO more likely a side effect of how our minds/brains work.

. First, thanks for your kind words regarding the LSC work (one correction- Brendan McKay is my co-author, so referring to it as just Mark Perakh’s work is imprecise). I agree with your notion that the features unearthed by the LSC method in the studied languages reflect how out minds/brains work. The same can, perhaps, be said about any features of a language. The human ability for a language is a function of the human brain, isn’t it, so whatever features of a language there are, they all somehow are effects stemming from how our brain works. Whether you may call them “side efffects” or view them as some of the principal features of the brain’s work, is unclear until we understand in detail how the brain works, which so far is still the goal hopefully to be reached some day. Being displays of how the brain works does not prevent these features from pointing to the common descdent of all languages - human brains are presumably all working the same way regardless of race, ethnicity, etc, right? The common descent of languages may be a natural result of that similarity of how the brains of various ethnical groups work.

I don’t know, Loren, whether you base your fine comment only on this my post to PT or on having perused the eight artiles on my site. In the latter case you could have seen all those curves of the LSC sums which all have the same principal shape for all studied meaningful texts but not for gibberish regardless of the gibberish texts’ structure. This is IMHO an impressive manifestation of the intrinsic unity of all studied languages. Does it “prove” the common descent of these languages? No, it does not. But it does, IMHO, jibe with such a hypothesis. It may be just one more piece of circumstantial evidence in favor of the common descent. I have no comments to the rest of your interesting remarks, and thank you again for taking time to write such a detailed and enlightening post. Best wishes, Mark

Yes, Mark Perakh, I had read all those articles you’d written; that statistical regularity seems interesting. And I think that this work ought to be expanded to different genres of text, like conversation transcripts vs. expository writing vs. creative writing vs. poetry. And also to “plain” vs. “flowery” and “serious” vs. “funny”.

But I still think that that unity is a side effect of how we process language; how much of a short-term capacity our brains have.

And descent from a shared ancestor has no direct connection with brain mechanisms; several early human populations may have invented ancestral languages separately. However, I believe that to be unlikely, for these reasons:

(1) Our brains have adaptations for interpreting and generating language.

(2) Language is a human universal; no full-scale human society has ever been found without it.

So our species would always have had language, and the same could have been true of some ancestral species.

And if our species had originated from some relatively small offshoot population (the Punctuated Equilibrium picture), that population would likely have had a single language. Meaning that all present-day languages are descended from a single one.

But reconstructing it is something that most mainstream linguists refuse to think about, because it seems next to impossible. Ancestral Indo-European was spoken about 5000-6000 years ago; this ancestral language was spoken about 100,000 years ago.

And finally, I’m not sure what would be a good online introduction to historical/comparative linguistics. Shall I search for one?

Loren, I would be happy if the LSC study were expanded. Unfortunately there is hardly a chance I’ll be able to do so (the same relates to Brendan, although for a different reason). I’d welcome any young folks taking it up and would be happy to answer any questions they might have in relation to the experiments or measurements. Best!

