Information content of DNA

| 94 Comments

The information content of DNA is much harder to determine than merely looking at the number of base pairs and multiplying it by 2 to get the size in bits (remember that each site can have up to 4 different nucleotides, or 2 bits). However, this approach can provide us with a zeroth order estimate of the maximum possible information that can be stored in said sequence which for the human genome with 3 billion base pairs would amount to 6 billion bits or 750 Mbytes.

However, information theory shows that random sequences have the lowest information content and that well preserved sequences contain the maximum information content. In other words, the actual information content ranges from zero for totally random sequences to 2 bits for conserved sequences.

Another way to look at this is to compress the DNA sequence using a regular archive utility. If the sequence is random, the compression will be minimal, if the sequence is fully regular, the compression will be much higher.

So how does one obtain a better estimate of the information content of DNA? By estimating the entropy per triplet (3 base pairs) which has a maximum entropy of 6 and for coding regions a value of 5.6 and for non-coding regions 5.82. This means that the information content for coding regions is 0.4 bit per triplet and for non-coding regions .18 bit per triplet. For 3 billion base pairs, or 1 billion triplets, this gives us an actual information content of 0.4 billion bits or 50 Mbytes assuming the best case scenario that all DNA is coding or about 24 Mbytes if all the DNA is non-coding.

Now how does this compare with evolutionary theory? In a 1960 paper “Natural selection as the process of accumulating genetic information in adaptive evolution”, Kimura calculated that the amount of information added per generation was around 0.29 bits or since the Cambrian explosion some 500 million years ago, on the order of 108 bits or 12.5 Mbytes assuming that the geometric mean of the duration of one generation is about 1 year.

As a side note, Kimura reasoned that about 107 or 108 bits bits of information would be necessary to specify human anatomy.(Source: Adaptation and Natural Selection By George Christopher Williams)

So is this a reliable way to determine the information content of DNA? Perhaps not, and a better way is to take a large sample DNA from different people and determine for each base pair, how variable it is. A preserved site will have the maximum of 2 bits of information while a totally random site will have zero bits of information.

The problem is to understand how much information is contained by these ‘bits’. For instance, the total number of electrons is about 1079 and finding one ‘preferred’ one’ amongst these which translates to about 250 bits. This means that in 1000 generations, natural selection can achieve something far more improbable than this.



Update Oct 26: I have to take responsibility for not clarifying that my usage of information is based on Shannon’s theory of information according to which

I(y)=Hmax - H(Y)

where I(Y) is the amount of information, H(Y) is the entropy of the received sequence and HMax is the maximum entropy (basically the entropy of uniform distributed sequence).

See Shannon entropy applied where I described how Shannon entropy is applied in biology with references to the work by Chris Adami and Tom Schneider.

94 Comments

I don’t know why you posted this. I can’t imagine anyone being interested. :)
“As a side note, Kimura reasoned that about 10^7 or 10^8 bits bits of information would be necessary to specify human anatomy.(Source: Adaptation and Natural Selection By George Christopher Williams)”
I have no idea why I cut and pasted that quote.

I happened to read Kimura’s paper while researching why Dembski seems to be unfamiliar with the history of the concept of information in biology and found Kimura’s 1961 comments to be quite relevant.

Now you’ve set yourself up for getting a lot criticism. Speaking as an expert on informaton and evolution, I can say that *everyone* who posts here is sure that they are an expert on information and evolution (which is how I know I am too). And some of them will no doubt argue vehemently that random sequences have the *most* information, not the least. Enjoy.

I’m going to throw my $.02 in, but I’m not sure I’m able to express this in a sufficiently coherent manner.

I submit that describing DNA in terms of information is rather like describing electrons and such in terms of waves and/or particles. An electron is what it is, and describing it as wave-like or particle-like is a human analogy that helps us understand it and does not mean that the electron is actually a wave or a particle. Similarly, DNA is what it is, and describing it in terms of information content doesn’t mean that DNA consists of information that is used in the way that a computer uses information.

Shannon or Kolmogorov sense?

The real question is does Bobby knows the difference :-)

Joe Felsenstein said:

Now you’ve set yourself up for getting a lot criticism. Speaking as an expert on informaton and evolution, I can say that *everyone* who posts here is sure that they are an expert on information and evolution (which is how I know I am too). And some of them will no doubt argue vehemently that random sequences have the *most* information, not the least. Enjoy.

Last sentence of essay seems to have a word missing.

The complexity of DNA proves evolution. It must be easy for 6 billion bits to evolve over 4 billion years.

To my “friends” - what do you guys think of NOMA? Dawkins or Gould?

Interesting discussion PvM.

I wouldn’t be surprised if we see an appearence by creationist Kirk Durston at this point so here’s a bit of background for discussion. He’s kind of looking at things from the Douglas Axe point of view (evolution not being able to cross sequence space) and he’s doing a bioinformatics kind of PhD and claims to be usefully applying information to evolution. Here’s one of his papers as part of this research:

A Functional Entropy Model for Biological Sequences.

http://www.newscholars.com/papers/D[…]%20paper.pdf

Here’s the kind of argument he uses in relation to evolution:

Darwinian theory also requires another prediction:

P2- Since an average, 300 amino acid protein requires approximately 500 bits of functional information to encode, and even the simplest organism requires a few hundred protein-coding genes, variation and natural selection should be able to consistently generate the functional information required to encode a completely novel protein.

Functional information is information that performs a function. When applied to organisms, functional information is information encoded within their genomes that performs some biological function. Typically, the amount of functional information required to encode an average, 300 amino-acid protein is in the neighborhood of 500 bits and most organisms contain thousands of protein-coding genes in their genome. Most combinations of amino acids will not produce a stable, three dimensional folded protein structure. Furthermore, the sequence space that encodes a stable folding protein tends to be surrounded by non-folding sequence space. Thus, to generate a novel protein with a stable fold, an evolutionary pathway must cross non-folding sequence space via a random walk, where natural selection will be inoperative. Thus, it requires functional information to properly specify a biological protein with a stable secondary structure. Recent computer simulations have failed to generate 32 bits of functional information in 2 x 10^7 trials, unless the distance between selection points is kept to 2, 4, and 8-bit steps. Such small gaps between selection points are highly unrealistic for biological proteins, which tend to be separated by non-folding regions of sequence space too large for the evolution of a novel protein to proceed by selection. Organic life requires thousands of different proteins, each requiring an average of 500 bits to encode. 32 bits is far too small to encode even one, average protein. An approximate and optimistic upper limit can be computed for the distance between selection points that could be bridged over the history of organic life if we postulate 1030 bacteria, replicating every 30 minutes for 4 billion years, with a mutation rate of 10-6 mutations per 1000 base pairs per replication. The upper limit falls between 60 and 100 bits of functional information, not sufficient to locate a single, average folding protein in protein sequence space. The Darwinian prediction P2, therefore, appears to be falsified. Variation and natural selection simply does not appear to have the capacity to generate the amount of functional information required for organic life.

A recent appearance at Larry Moran’s blog provided the following discussion of information and evolution:

In response to Mike Haubrich’s proposed challenge:”Explain to me what sort of ‘information’ you are referring to. You can do it in a five page report, and give references, please.”

I would suggest that functional information is what Haubrich is looking for. As long as it doesn’t matter if the information is gibberish or not, either Shannon information or Komolgorov Complexity will do. But Szostak pointed out that for biological life, it does matter a great deal whether the information encoded in the genomes of life is functional or not, so he proposed that it was time for biologists to start analyzing biolpolymers in terms of ‘functional information’ (see Szostak JW, ‘Functional information: Molecular messages’, Nature 2003, 423:689.) Four years later, Szostak et al. published a paper laying out the concepts of functional information, with application to biological life (see Hazen, R.M., Griffin, P.L., Carothers, J.M. and Szostack, J.W., ‘Functional information and the emergence of biocomplexity’, PNAS 2007, 104: 8574-8581. Going over their paper, I could see that they made some simplifying assumptions, that they did not state in their paper, including a) amino acids functional at a particular site occur with equal probability and b) all functional sequences occur with equal probability. They also do not consider the time variable in their equation so that one can measure the change in functional information as the set of functional sequences evolve. Nevertheless, their method does give an approximation of the functional information for a given biopolymer, although there are more sophisticated methods out there. I wrote some software that would calculate the functional information required to encode a given protein family that does take into consideration variable probabilities of amino acids at each site as computed from existing aligned sequence data. For example, I ran 1,001 sequences for EPSP Synthase through and obtained a value of 688 bits of functional information required to fall within the set of functional sequences. Of course, there is likely to be alignment errors in the Pfam data base where I obtained my alignment. The effect of any alignment errors will give an artificially low result. I’ve also looked at what functional information means in terms of the structure and function of a given protein family and have found some very interesting results.

http://sandwalk.blogspot.com/2008/1[…]ion-can.html

Also see a previous discussion at Jeffrey Shallit’s blog, with, amongst others, PandasThumbs very own Art Hunt:

http://recursed.blogspot.com/2008/0[…]ientist.html

To some degree, of course, this is all a red herring. DNA alone does not “specify human anatomy”; a lot of anatomy is in fact epigenetic. This means, strictly speaking, that it is inherited but not encoded in DNA; one of the best-studied mechanisms for this is the interactions between cells during development. Both cells could become the same thing, but one cell’s signalling molecule tells an adjacent cell to become something else, and in turn that cell may change which signalling molecules it uses. Depending on the pattern of signalling molecules, both spatially and temporally, the results of development can differ significantly. (What starts it all? one might ask. There are a number of mechanisms known for this as well, many of which are external to the embryo, often being set up by the mother.) The signals themselves are often highly evolutionarily conserved, such that the Pax6 gene homologue from a fruit fly, which (among other things) specifies eyes, can make eyes grow in places where it is injected into a developing frog. This is not to say that DNA is unimportant, of course, just that it is not the only part of the story (at least with eukaryotes).

That said, this is a nice article, and an important investigation into one of ID’s primary claims.

There are questions in here folded into questions. First question: is there some “magic ratio” of the number of bits in a “program” to the complexity it produces? “Yes, the value of the ratio is … 0.42!”

I don’t think anybody’s figured out any such ratio. And even if they had, nobody’s figured out how to mathematically determine the complexity of an organism to permit such a calculation.

Even comparing the same program written in different computer languages is tricky. Some languages may be able to do particular tasks in much less code than others. And even when it comes to the binary executable program, it’s hard to make comparisons. For a large program, the executable for an interpreted system is much smaller than the executable for a compiled system (if much slower as well). And among compiled systems the size of the executable depends on the compiler and the processor. A specialized processor will probably need much less code for a task tailored to it than a general-purpose processor.

And that’s only comparing the SAME program. Comparing different programs? Writing a little toy demo program to draw even a simple picture is a pain; a toy demo program to draw an elaborate fractal pattern is shorter, and it can produce as much fractal detail as one likes just by changing the count of the number of iterations. Incidentally, the growth of organisms seems to have fractal features, and fractal algorithms are noted for ability to generate lots of elaboration for a small amount of code.

Then … comparing “programs” between two different systems that don’t have any real resemblance to each other and don’t perform the same functions is out in hyperspace.

Are there not enough bits in the human genome to encode the human body? For all we know I could insist that there’s FOUR TIMES as many bits as required, and dare anyone to prove me wrong: “You see, because of the Binary Coding Efficiency [it’s nice to make up impressive-sounding phrases here] of the human system its Binary Coding Ratio is vastly better than that of a personal computer … “

But at least I would be being silly on purpose.

White Rabbit (Greg Goebel) http://www.vectorsite.net/gblog.html

Actually, in the Kolmogorov theory, random sequences are highly likely to have maximum or near-maximum information content. Furthermore, compression experiments with DNA suggest that it is quite difficult to achieve significant compression, suggesting they are close to random and have very high Kolmogorov information content.

iml8 said: Writing a little toy demo program to draw even a simple picture is a pain; a toy demo program to draw an elaborate fractal pattern is shorter, and it can produce as much fractal detail as one likes just by changing the count of the number of iterations. Incidentally, the growth of organisms seems to have fractal features, and fractal algorithms are noted for ability to generate lots of elaboration for a small amount of code.

Yes. Another example would be a series representation of Pi - a finite, relatively compact formula producing an infinitely long number. I don’t know why our creationist friends can’t see the possible analogy to a relatively compact DNA string producing a huge amount of complexity via iteration.

And this complexity is changed by the pre-existing environment (as Opisthokont pointed out), so unlike mathematical formula, the information content of a biological structure is not equivalent to the content of the instruction set that produced it. Its more.

I think questions about DNA information content bring us back to the phenotype/genotype error again. Creationists are confusing the information content of a molecule-by-molecule description of the cake with the information content of the recipe, and on top of that they make the error of forgetting we are making souffle - the environment in which the instructions are carried out makes a difference to the end product. :)

Your post seems to be lacking a conclusion, but I find this subject interesting, because I recently had a look at a presentation of a supposedly new ID theory, in which the supposed scientist presenting it believes that ‘functional information’ is an indicator of intelligence.

Problem is, he defines it as ‘the negative log to base 2 of, the number of ways to perform a function acceptably well divided by the total number of ways it can be performed’, or I = -log2[M/N], which is a kind of equation you’ll be familiar with, as it’s basically the same as Dembski’s.

Problem is, this doesn’t even try to give an estimate of information content - rather, it is saying ‘Given a list of all the ways to do something, this will give you the minimum information required to pick one from that list’.

Post is here. I’m glad to see, at least, that scientists have been doing real science on this matter long before the ID people.

eric said:

I don’t know why our creationist friends can’t see the possible analogy to a relatively compact DNA string producing a huge amount of complexity via iteration.

The odd thing is that as such this isn’t really a creationist argument, or if it is, it’s a stretch even by those standards: “I maintain that the genome isn’t big enough to encode all the complexity of the organism.”

“Well, OK, but we don’t know of any other mechanism for encoding the blueprint for an organism – so if you say it can’t, feel free to engage in a research project to show what else can. Sorry, don’t know where you’ll get a research grant. Send me a report when you’re done – nah, on second thought, put it up on your website and I’ll look it over if I get the time.”

I suppose this MIGHT be a creationist argument if the development of an individual organism was supposedly only explainable by Supernatural Intervention, but I don’t think even most Darwin-bashers would try to make such a claim. Otherwise, this argument simply invokes unknown sources of developmental information and says nothing about Darwin one way or another.

The “information theory” argument takes the approach of claiming that Darwin can’t account for the information actually contained in the genome. It’s hard to see any real linkage between that and the notion that the genome isn’t big enough to do the job. That’s just an exercise in muddying the waters.

White Rabbit (Greg Goebel) http://www.vectorsite.net/gblog.html

Yes, I was suprised that the entropy of coding and non-coding sequences was quite similar. As I pointed out, a more useful measure in the Shannon sense is to look at what sites are strongly conserved across the population rather than look at the compressibility of a single genome. Are you aware of any ways to reconcile Kolmogorov and Shannon approaches?

Jeffrey Shallit said: Actually, in the Kolmogorov theory, random sequences are highly likely to have maximum or near-maximum information content. Furthermore, compression experiments with DNA suggest that it is quite difficult to achieve significant compression, suggesting they are close to random and have very high Kolmogorov information content.

Of course it is, and this is something I have been trying to explain to Bobby who argued that the information content of the genome was somehow too low to be able to explain how an embryo forms. Since Bobby lacked any solid data, I have attempted to show how to more reliably estimate ‘information’ in the genome and how to relate it to the information in the human body.

Opisthokont said:

To some degree, of course, this is all a red herring. DNA alone does not “specify human anatomy”; a lot of anatomy is in fact epigenetic. This means, strictly speaking, that it is inherited but not encoded in DNA; one of the best-studied mechanisms for this is the interactions between cells during development. Both cells could become the same thing, but one cell’s signalling molecule tells an adjacent cell to become something else, and in turn that cell may change which signalling molecules it uses. Depending on the pattern of signalling molecules, both spatially and temporally, the results of development can differ significantly. (What starts it all? one might ask. There are a number of mechanisms known for this as well, many of which are external to the embryo, often being set up by the mother.) The signals themselves are often highly evolutionarily conserved, such that the Pax6 gene homologue from a fruit fly, which (among other things) specifies eyes, can make eyes grow in places where it is injected into a developing frog. This is not to say that DNA is unimportant, of course, just that it is not the only part of the story (at least with eukaryotes).

That said, this is a nice article, and an important investigation into one of ID’s primary claims.

Joe Felsenstein said: I can say that *everyone* who posts here is sure that they are an expert on information and evolution

I feel out of place, because I am sure that I am not.

I don’t understand whether information is an extensive or intensive property of a physical object. Do two identical DNA molecules have twice the information, or the same information? Is the information in a DNA molecule greater than, less than, or equal to the sum of the information in each of its atoms? … in each of its constituent quarks and electrons?

SteveF said:

Interesting discussion PvM.

I wouldn’t be surprised if we see an appearence by creationist Kirk Durston at this point so here’s a bit of background for discussion. He’s kind of looking at things from the Douglas Axe point of view (evolution not being able to cross sequence space) and he’s doing a bioinformatics kind of PhD and claims to be usefully applying information to evolution. Here’s one of his papers as part of this research:

A Functional Entropy Model for Biological Sequences.

http://www.newscholars.com/papers/D[…]%20paper.pdf

Here’s the kind of argument he uses in relation to evolution:

Darwinian theory also requires another prediction:

P2- Since an average, 300 amino acid protein requires approximately 500 bits of functional information to encode, and even the simplest organism requires a few hundred protein-coding genes, variation and natural selection should be able to consistently generate the functional information required to encode a completely novel protein.

Functional information is information that performs a function. When applied to organisms, functional information is information encoded within their genomes that performs some biological function. Typically, the amount of functional information required to encode an average, 300 amino-acid protein is in the neighborhood of 500 bits and most organisms contain thousands of protein-coding genes in their genome. Most combinations of amino acids will not produce a stable, three dimensional folded protein structure. Furthermore, the sequence space that encodes a stable folding protein tends to be surrounded by non-folding sequence space. Thus, to generate a novel protein with a stable fold, an evolutionary pathway must cross non-folding sequence space via a random walk, where natural selection will be inoperative. Thus, it requires functional information to properly specify a biological protein with a stable secondary structure. Recent computer simulations have failed to generate 32 bits of functional information in 2 x 10^7 trials, unless the distance between selection points is kept to 2, 4, and 8-bit steps. Such small gaps between selection points are highly unrealistic for biological proteins, which tend to be separated by non-folding regions of sequence space too large for the evolution of a novel protein to proceed by selection. Organic life requires thousands of different proteins, each requiring an average of 500 bits to encode. 32 bits is far too small to encode even one, average protein. An approximate and optimistic upper limit can be computed for the distance between selection points that could be bridged over the history of organic life if we postulate 1030 bacteria, replicating every 30 minutes for 4 billion years, with a mutation rate of 10-6 mutations per 1000 base pairs per replication. The upper limit falls between 60 and 100 bits of functional information, not sufficient to locate a single, average folding protein in protein sequence space. The Darwinian prediction P2, therefore, appears to be falsified. Variation and natural selection simply does not appear to have the capacity to generate the amount of functional information required for organic life.

A recent appearance at Larry Moran’s blog provided the following discussion of information and evolution:

In response to Mike Haubrich’s proposed challenge:”Explain to me what sort of ‘information’ you are referring to. You can do it in a five page report, and give references, please.”

I would suggest that functional information is what Haubrich is looking for. As long as it doesn’t matter if the information is gibberish or not, either Shannon information or Komolgorov Complexity will do. But Szostak pointed out that for biological life, it does matter a great deal whether the information encoded in the genomes of life is functional or not, so he proposed that it was time for biologists to start analyzing biolpolymers in terms of ‘functional information’ (see Szostak JW, ‘Functional information: Molecular messages’, Nature 2003, 423:689.) Four years later, Szostak et al. published a paper laying out the concepts of functional information, with application to biological life (see Hazen, R.M., Griffin, P.L., Carothers, J.M. and Szostack, J.W., ‘Functional information and the emergence of biocomplexity’, PNAS 2007, 104: 8574-8581. Going over their paper, I could see that they made some simplifying assumptions, that they did not state in their paper, including a) amino acids functional at a particular site occur with equal probability and b) all functional sequences occur with equal probability. They also do not consider the time variable in their equation so that one can measure the change in functional information as the set of functional sequences evolve. Nevertheless, their method does give an approximation of the functional information for a given biopolymer, although there are more sophisticated methods out there. I wrote some software that would calculate the functional information required to encode a given protein family that does take into consideration variable probabilities of amino acids at each site as computed from existing aligned sequence data. For example, I ran 1,001 sequences for EPSP Synthase through and obtained a value of 688 bits of functional information required to fall within the set of functional sequences. Of course, there is likely to be alignment errors in the Pfam data base where I obtained my alignment. The effect of any alignment errors will give an artificially low result. I’ve also looked at what functional information means in terms of the structure and function of a given protein family and have found some very interesting results.

http://sandwalk.blogspot.com/2008/1[…]ion-can.html

Also see a previous discussion at Jeffrey Shallit’s blog, with, amongst others, PandasThumbs very own Art Hunt:

http://recursed.blogspot.com/2008/0[…]ientist.html

That’s what I get for not reading… Kirk Dunston is the chap who I was also referring to. If you follow the link you can see a video of him giving a talk about his functional information.

I also apparently can’t tell the difference between an r and an n, so I’ve misspelt his name several times. Silly me.

PvM said:

Yes, I was suprised that the entropy of coding and non-coding sequences was quite similar.

A maybe simpler way of looking at this is to think of bitmap image files. Take a set of bitmap files with the same resolution – say 300 x 300 pixels – and the same color depth – 24-bit full color. In an uncompressed image file format (like .BMP) every such image file is exactly the same size in kilobytes.

Now convert the files to a compressed format (like .PNG – a lossless format, no information is thrown out like in .JPG). The actual information in each of those image files is more or less reflected in the size of the compressed file. If the image is simple, say a matrix of colored squares, the compressed file is small – there’s not much information in the file, it’s mostly “air”, so it squeezes down a lot.

If the image is elaborate, say of a flower garden, the compressed file is big, there’s more information in the file. It has nothing to do with the subject matter of the image, only that the image is “busy”. Get an image consisting of nothing but a random scattering of lots of colored dots and the compression is slight. There’s no “air” in it to squeeze out.

The trick is that the information content of these images has absolutely nothing to do with what the images are of, or what they communicate to a viewer. The only issue is the number of bits that it takes to fully create the image. If the image is “busy”, full of noisy variations, there’s a lot of information in it.

From what I can see of KC entropy, it’s basically a “quantity” measurement. It says nothing about what the information does or how well it does it. If you want to compress an image file (or any other file for that matter), if it’s got a high KC entropy it doesn’t compress very well. It has nothing to do with the function of the file.

White Rabbit (Greg Goebel) http://www.vectorsite.net/gblog.html

Good point, the many repeats in non-coding DNA also may help explain why its information content may not be that dissimilar from coding DNA. Which returns me to a population measure of information. Take N human genomes and align them. Find the distribution for every single nucleotide at a given spot and use Shannon information concept to assign a number between 0 and 2 bits depending on how conserved the site is. This at least would link information to conserved sites which are likely to be related to function unless we have a recent bottleneck?

It’s the intractable nature of ‘information’ which makes ID’s arguments so vacuous. If ID cannot even address the information content of let’s say the formation of a protein, how can it make any arguments other than stating in Dembskian fashion that ‘it looks complex’ thus evolutionary processes cannot explain it and even if they can, they still need an information rich source. Of course when pointed out that the environment provides such a source much like breeders in artificial selection, the argument becomes one of: But where does the information in the environment come from. As if the information provided by intelligent designers does not require a similar explanation?

Who are they fooling?

iml8 said:

PvM said:

Yes, I was suprised that the entropy of coding and non-coding sequences was quite similar.

A maybe simpler way of looking at this is to think of bitmap image files. Take a set of bitmap files with the same resolution – say 300 x 300 pixels – and the same color depth – 24-bit full color. In an uncompressed image file format (like .BMP) every such image file is exactly the same size in kilobytes.

Now convert the files to a compressed format (like .PNG – a lossless format, no information is thrown out like in .JPG). The actual information in each of those image files is more or less reflected in the size of the compressed file. If the image is simple, say a matrix of colored squares, the compressed file is small – there’s not much information in the file, it’s mostly “air”, so it squeezes down a lot.

If the image is elaborate, say of a flower garden, the compressed file is big, there’s more information in the file. It has nothing to do with the subject matter of the image, only that the image is “busy”. Get an image consisting of nothing but a random scattering of lots of colored dots and the compression is slight. There’s no “air” in it to squeeze out.

The trick is that the information content of these images has absolutely nothing to do with what the images are of, or what they communicate to a viewer. The only issue is the number of bits that it takes to fully create the image. If the image is “busy”, full of noisy variations, there’s a lot of information in it.

From what I can see of KC entropy, it’s basically a “quantity” measurement. It says nothing about what the information does or how well it does it. If you want to compress an image file (or any other file for that matter), if it’s got a high KC entropy it doesn’t compress very well. It has nothing to do with the function of the file.

White Rabbit (Greg Goebel) http://www.vectorsite.net/gblog.html

Ugh. I’ve just read that paper of Kirk Durston’s, above, and I can’t believe they’re still trying the tornado-in-a-junkyard ploy (or as Kirk says, he ‘assumes that evolution is a random walk’). Still, while they’re not coming up with new stuff, it’s easier to debunk I guess.

I find their work hardly that novel as scholars like Kimura and more recently Adami, Schneider, Ofria and others have since long proposed the use of Shannon information. The problem is with converting the number of bits to probabilities, which assume a random search rather than something which more accurately represents evolutionary processes.

For instance the authors suggest that for 26 bits of information to arise, 4*1019 trials would be needed. That of course ignores any evolutionary processes and the work by Schneider has shown that the number of actual trials can be much lower. In fact, Dembski and Marks, in their paper addressing Schneider’s Ev made similar errors, compounded by additional errors to conclude that a random search outperforms an evolutionary search. Now anyone who understands the mathematics involved would have frowned at such a conclusion. And yet it took the work of Schneider and a person with the alias “2ndclass” to find these errors.

SteveF said:

Interesting discussion PvM.

I wouldn’t be surprised if we see an appearence by creationist Kirk Durston at this point so here’s a bit of background for discussion. He’s kind of looking at things from the Douglas Axe point of view (evolution not being able to cross sequence space) and he’s doing a bioinformatics kind of PhD and claims to be usefully applying information to evolution. Here’s one of his papers as part of this research:

A Functional Entropy Model for Biological Sequences.

http://www.newscholars.com/papers/D[…]%20paper.pdf

Here’s the kind of argument he uses in relation to evolution:

Darwinian theory also requires another prediction:

P2- Since an average, 300 amino acid protein requires approximately 500 bits of functional information to encode, and even the simplest organism requires a few hundred protein-coding genes, variation and natural selection should be able to consistently generate the functional information required to encode a completely novel protein.

Functional information is information that performs a function. When applied to organisms, functional information is information encoded within their genomes that performs some biological function. Typically, the amount of functional information required to encode an average, 300 amino-acid protein is in the neighborhood of 500 bits and most organisms contain thousands of protein-coding genes in their genome. Most combinations of amino acids will not produce a stable, three dimensional folded protein structure. Furthermore, the sequence space that encodes a stable folding protein tends to be surrounded by non-folding sequence space. Thus, to generate a novel protein with a stable fold, an evolutionary pathway must cross non-folding sequence space via a random walk, where natural selection will be inoperative. Thus, it requires functional information to properly specify a biological protein with a stable secondary structure. Recent computer simulations have failed to generate 32 bits of functional information in 2 x 10^7 trials, unless the distance between selection points is kept to 2, 4, and 8-bit steps. Such small gaps between selection points are highly unrealistic for biological proteins, which tend to be separated by non-folding regions of sequence space too large for the evolution of a novel protein to proceed by selection. Organic life requires thousands of different proteins, each requiring an average of 500 bits to encode. 32 bits is far too small to encode even one, average protein. An approximate and optimistic upper limit can be computed for the distance between selection points that could be bridged over the history of organic life if we postulate 1030 bacteria, replicating every 30 minutes for 4 billion years, with a mutation rate of 10-6 mutations per 1000 base pairs per replication. The upper limit falls between 60 and 100 bits of functional information, not sufficient to locate a single, average folding protein in protein sequence space. The Darwinian prediction P2, therefore, appears to be falsified. Variation and natural selection simply does not appear to have the capacity to generate the amount of functional information required for organic life.

A recent appearance at Larry Moran’s blog provided the following discussion of information and evolution:

In response to Mike Haubrich’s proposed challenge:”Explain to me what sort of ‘information’ you are referring to. You can do it in a five page report, and give references, please.”

I would suggest that functional information is what Haubrich is looking for. As long as it doesn’t matter if the information is gibberish or not, either Shannon information or Komolgorov Complexity will do. But Szostak pointed out that for biological life, it does matter a great deal whether the information encoded in the genomes of life is functional or not, so he proposed that it was time for biologists to start analyzing biolpolymers in terms of ‘functional information’ (see Szostak JW, ‘Functional information: Molecular messages’, Nature 2003, 423:689.) Four years later, Szostak et al. published a paper laying out the concepts of functional information, with application to biological life (see Hazen, R.M., Griffin, P.L., Carothers, J.M. and Szostack, J.W., ‘Functional information and the emergence of biocomplexity’, PNAS 2007, 104: 8574-8581. Going over their paper, I could see that they made some simplifying assumptions, that they did not state in their paper, including a) amino acids functional at a particular site occur with equal probability and b) all functional sequences occur with equal probability. They also do not consider the time variable in their equation so that one can measure the change in functional information as the set of functional sequences evolve. Nevertheless, their method does give an approximation of the functional information for a given biopolymer, although there are more sophisticated methods out there. I wrote some software that would calculate the functional information required to encode a given protein family that does take into consideration variable probabilities of amino acids at each site as computed from existing aligned sequence data. For example, I ran 1,001 sequences for EPSP Synthase through and obtained a value of 688 bits of functional information required to fall within the set of functional sequences. Of course, there is likely to be alignment errors in the Pfam data base where I obtained my alignment. The effect of any alignment errors will give an artificially low result. I’ve also looked at what functional information means in terms of the structure and function of a given protein family and have found some very interesting results.

http://sandwalk.blogspot.com/2008/1[…]ion-can.html

Also see a previous discussion at Jeffrey Shallit’s blog, with, amongst others, PandasThumbs very own Art Hunt:

http://recursed.blogspot.com/2008/0[…]ientist.html

Yes, not very novel indeed but wrapped inside just enough ‘scientific’ sounding language that it may confuse the uninformed reader as to its relevance.

Venus Mousetrap said:

Ugh. I’ve just read that paper of Kirk Durston’s, above, and I can’t believe they’re still trying the tornado-in-a-junkyard ploy (or as Kirk says, he ‘assumes that evolution is a random walk’). Still, while they’re not coming up with new stuff, it’s easier to debunk I guess.

And then, like the professionals they are, they go straight to claiming that there must be a conspiracy against them which is preventing their work being accepted.

Even without the alarmingly large amount of evidence that there IS something incredibly fishy behind the ID scenes (Wedge Document, presentations to Christian groups, association with creationists, creationist arguments, creationist websites, etc)… even without all that, they won’t accept that their failings are entirely their own.

Great Discussion so far, and great comments from PvM and Joe.

I’ve always found this topic pretty interesting, mostly because of the absolutely horrendous abuses of math in general, but probability and information theory in particular, that guys like Dembski are doing. Does any of the more knowledgeable folks here, like Joe, have an opinion of this “Functional Information” as used by Szostack (and I am presuming misused by Durston)? Is it useful at all?

I think PvM is quite right in the difficulty of properly probabilsitically modeling the growth of information content of a genome in that it isn’t really random walk, although one supposes that it does have a random walk-like element to it. Some sort of bounded random walk/Markov Process would better emulate an evolutionary search pattern of that type in my opinion.

I would also suggest that as well as population level information measures of a given gene that measures across evolutionary diversity are also useful, and we use that sort of Entropy score already in a Shannon Information sense when looking at aligned homologs from diverse taxa.

Venus Mousetrap said:

And then, like the professionals they are, they go straight to claiming that there must be a conspiracy against them which is preventing their work being accepted.

Even without the alarmingly large amount of evidence that there IS something incredibly fishy behind the ID scenes (Wedge Document, presentations to Christian groups, association with creationists, creationist arguments, creationist websites, etc)… even without all that, they won’t accept that their failings are entirely their own.

Such as the fact that the primary reason why Intelligent Design “papers” are not published is because Intelligent Design proponents have expressed absolutely no desire to do any research for any paper, pro Intelligent Design or otherwise, in the first place?

iml8 said: From what I can see of KC entropy, it’s basically a “quantity” measurement. It says nothing about what the information does or how well it does it. If you want to compress an image file (or any other file for that matter), if it’s got a high KC entropy it doesn’t compress very well. It has nothing to do with the function of the file.

Understanding this definition also demonstrates that the secondary creationist argument - that nature cannot produce information - is just complete bunkum. Its not even logically self-consistent, as the exact same point substitution mutation in different places can lead to more or less compressability. Consider a toy example, substituting CGC for CAC in the following two strings: cgcgCACgc (makes it more compressible) cacaCACac (makes it less compressible)

Venus Mousetrap said: And then, like the professionals they are, they go straight to claiming that there must be a conspiracy against them which is preventing their work being accepted.

Being rejected by two fields is clear evidence that the Biological-Industrial Complex has gotten to mathematicians, too. :)

eric

although one supposes that it does have a random walk-like element to it.

Especially in the areas not constrained by having an affect on reproductive success (i.e., subject to natural selection).

Daniel Gaston said:

Great Discussion so far, and great comments from PvM and Joe.

I’ve always found this topic pretty interesting, mostly because of the absolutely horrendous abuses of math in general, but probability and information theory in particular, that guys like Dembski are doing. Does any of the more knowledgeable folks here, like Joe, have an opinion of this “Functional Information” as used by Szostack (and I am presuming misused by Durston)? Is it useful at all?

I think Hazen, Griffin and Szostak’s “functional information” is essentially the same as the concept of “specified information” developed by Leslie Orgel, and used by Dembski. I also (in 1978) described an “adaptive information” that is similar. I think these are useful, though Dembski’s proofs using them happen to be wrong.

The disagreements over whether a big stretch of DNA that is basically random (say a megabase of total junk) has lots of information or has little information depends on what you expect the calculation to do for you. A message that length is 2,000,000 bits, so Shannon-wise, has lots of information. A program that computes that 2,000,000-bit number has to be almost 2,000,000 bits long, so the Kolmogorov complexity is large. But if it is random stuff, it has almost no “functional” or “specified” or “adaptive” information, as it has no joint information about phenotypes that make you highly fit. So in that sense it carries little information.

Do two identical DNA molecules have twice the information, or the same information?

Neither. Two identical molecules have two bits more information than one, two bits being sufficient to represent the number “two”. (Actually one can quibble that in this case it’s only 1 bit extra, but that’s not very important in the context of a few megabytes.) 42 identical copies would have 6 bits more information than a single copy, and so on.

What he’s actually computed is the difference in information between memory with data and zeroed memory.

No, he has computed the difference in entropy between a genome where the binding sites are uniformly distributed (not ‘zeroed’ memory) and the entropy where the binding sites become ‘fixed’ or non-uniformly distributed.

Perhaps this will clarify

Adaptation = increase in the mutual information between the system and the environment.

“Evolution increases the amount of information a population harbors about its niche” (Adami)

I (Environment, Population) = Entropy (Population) – Entropy (Population | Environment) =

entropy in the absence of selection (Max Population Entropy) - diversity tolerated by selection in the given environment =

how much data can be stored in the population - how much data irrelevant to environment is stored

I(X:Y) = H(X) - Hy(x) = H(y) - Hx(y) = H(x) + H(Y) - H(x,y)

Where H(x,y) is the mutual entropy and Hx(x) is the conditional entropy

Excellent. Thank you very much

Given the way you have addressed the information content of the human genome in this article, could one address the information content of the 32 volume 2010 edition of the Encyclopedia in the same way?

And what sort of value for storage of the EB’s information content would be arrived at?

About this Entry

This page contains a single entry by PvM published on October 22, 2008 7:53 PM.

Religulous v Expelled: A simple comparison was the previous entry in this blog.

Euastacus sulcatus is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Categories

Archives

Author Archives

Powered by Movable Type 4.361

Site Meter