# Icons of ID: Probability as information

In this episode of Icons of ID I will take a quick look at how the definition of information used by ID proponents is nothing more than an argument from probability. In fact when ID proponents claim that chance and regularity cannot create complex specified information (CSI) all they are saying is that such pathways, as far as we know, are improbable. If a pathway is found that is probable, the measure of information, which is confusingly linked to probability decreases. In fact, I argue, that intelligent designers similarly cannot generate specified complex information since the probability of intelligent designers designing is close to 1.

#### Information and probability

Elsberry and Willkins on CSI

Then again, the choice of the term “complex specified information” is itself extremely problematic, since for Dembski “complex” means neither \complicated” as in ordinary speech, nor “high Kolmogorov complexity” as understood by algorithmic information theorists. Instead, Dembski uses \complex” as a synonym for “improbable”.

So how does Dembski define information?

So in other words, information is the log of the probability. But what probability is this? Others have shown how Dembski is unclear on this issue and often moves between uniform probabilities or actual probabilities, whenever it seems better to do so. Dembski mentions in NFL that this measure of information is similar to Shannon information. In fact Shannon’s entropy is the average of Dembski’s information measure. This confusion about information and entropy is not limited to Dembski’s writings however so let’s look at Shannon entropy and information in more detail.

#### Claude: Shannon: A mathematical theory of communication

In 1948 Shannon published one of his seminal papers on A mathematical theory of communication.

Shannon shows that the logarithm is the natural choice for expressing the concept of information. Entropy, a weighted measure of information is basically the expected value of information present. In other words

If there are messages with probabilities then the Shannon entropy of this set is defined as:

or in other words

Entropy is maximum when all values are equiprobable.

Information is defined as

Information in the Shannon sense is defined as the change in entropy before and after a particular event has taken place. Shannon information, also known as surprise, is any form of data which is not already known. In fact, when rare events occur, they generate a lot of information.

Tom Schneider has some good resources:

So what we have learned so far is that Dembski’s information measure is nothing more than a probability measure similar to Shannon’s entropy measure not Shannon’s information measure.

But the choice of the term information is quite unfortunate since it has more similarity to entropy than to Shannon information.

So let’s try to understand why Dembski argues that regularity and chance cannot create CSI. The answer is simple: If such processes have a high probability of being successful, their Dembski information measure will be low.

But the same problem applies to Intelligent designers. Given a particular ‘intelligently designed’ event, its probability is high and thus its information is low. In other words, according to Dembski’s own measure, nothing can create CSI other than pure chance.

Not much of a useful tool but the poor choice of the information measure has caused much unnecessary confusion. When in fact all Dembski was doing is repeating the age-old creationist argument that evolution or abiogenesis is improbable.

Talkorigins has some good FAQ’s on what’s wrong with these arguments.

It seems that ID is not only fundamentally flawed due to a theoretical failure of its claims but also empirically flawed in that ID has failed to be scientifically relevant. But in addition to these flaws, we also recognize the flawed arguments of Dembski based on probability. All because of the confusing usage of terms like information rather than entropy. Seems that the intelligent designer is as powerless in creating CSI. Or alternatively an intelligent designer is as capable of creating CSI as regular processes.

Tom Schneider attracted Dembski’s ire for showing how the simple processes of variation and selection can actually increase the information in a genome.

Dembski’s complexity measures have many problems.

Surprisingly various ID proponents such as for instance Fred Heeren seem to have taken Dembski’s claims too seriously.

Heeren quotes another unsupported and in fact falsified claim by Dembski

William Dembski puts it this way: “Specified complexity powerfully extends the usual mathematical theory of information, known as Shannon information. Shannon’s theory dealt only with complexity, which can be due to random processes as well as to intelligent design. The addition of specification to complexity, however, is like a vise that grabs only things due to intelligence. Indeed, all the empirical evidence confirms that the only known cause of specified complexity is intelligence.”

Careless usage of terminology, contradictory statements and examples, confusing usage of terms and inflated claims all seem to have made the design inference ‘quite problematic’.

Understanding what “regularity,” “chance,” and “design” mean in Dembski’s framework is made more difficult by some of his examples. Dembski discusses a teacher who finds that the essays submitted by two students are nearly identical (46). One hypothesis is that the students produced their work independently; a second hypothesis asserts that there was plagiarism. Dembski treats the hypothesis of independent origination as a Chance hypothesis and the plagiarism hypothesis as an instance of Design. Yet, both describe the matching papers as issuing from intelligent agency, as Dembski points out (47). Dembski says that context influences how a hypothesis gets classified (46). How context induces the classification that Dembski suggests remains a mystery.

Elsberry and Shallit have written an excellent paper “Information Theory, Evolutionary Computation, and Dembski’s “Complex Specified Information”. They address Dembski’s fallacious reliability claims, present the differences between rarefied design and ordinary design, and the problems with apparant and actual complex specified information (CSI).

Intelligent design advocate William Dembski has introduced a measure of information called “complex specified information”, or CSI. He claims that CSI is a reliable marker of design by intelligent agents. He puts forth a “Law of Conservation of Information” which states that chance and natural laws are incapable of generating CSI. In particular, CSI cannot be generated by evolutionary computation. Dembski asserts that CSI is present in intelligent causes and in the flagellum of Escherichia coli, and concludes that neither have natural explanations. In this paper we examine Dembski’s claims, point out significant errors in his reasoning, and conclude that there is no reason to accept his assertions.

Dembski has backed off his “Law of Conservation of Information”. Immunoglobulin genes (for example) are information creating machines. Dembski recognizes this and now claims that natural systems can’t create “complex information”. The boundary between “simple information” and “complex information” is vague. The phrase “complex specified information” returns zero (of 13 million) articles in Science Citation Index.

I’m really torn by your post. On the one hand, I think the very idea that anyone spends time refuting people like Dembski is a none productive expenditure of intellectual capital. On the ohterhand, if crap like this isn’t refuted, it grows a life of it’s own. On the third hand, the people who buy into this crap in the first place, aren’t going to ever read or hear the argument against, and if they do, they’ll accept Dembski’s explanations, and it still takes on a life of its own.

Pre-‘urban legends’ die even harder than the new ones.

I agree, and I am struggling with these issues as well. Since I have found from past experiences that although having correct data available may not convince the committed creationists, it may affect some to investigate further. As such I believe that presenting the arguments of why the ID approach(es) do not work in an accessible manner is important.

Dembski’s put up another essay on his site. Just giving you guys the heads up. Enjoy!

Information as a Measure of Variation By William Dembski

http://www.designinference.com/docu[…]ormation.pdf

Comment #4663

Posted by rick pietz on July 7, 2004 05:35 PM

I’m really torn by your post. On the one hand, I think the very idea that anyone spends time refuting people like Dembski is a none productive expenditure of intellectual capital. On the ohterhand, if crap like this isn’t refuted, it grows a life of it’s own. On the third hand, the people who buy into this crap in the first place, aren’t going to ever read or hear the argument against, and if they do, they’ll accept Dembski’s explanations, and it still takes on a life of its own.

I think that’s an important debate we need to have more of. I don’t think it’s clear that smart, educated people arguing with willfully-ignorant creationists is a productive use of time and energy. I think the basic reasons to argue are 1) it might do some good to the creationist 2) bystanders can be informed of how stupid the antievolution arguments are. In my opinion, 1 is not an efficient use of energy because creationists can argue the sky is plaid for decades. You can’t argue with blind faith. 2 might have some merit, but any reasonable person can already see that only a tiny fringe of scientists pretend it could be wrong. Further, I’m partial to the Gould (or is it Dawkins?) argument that arguing with them gives people the misleading impression that it’s an open scientific question. I think there’s an outside chance that ridiculing people who make insulting, ridiculous arguments (“all scientists are lying, they really believe in creationism”) might at least embarrass them into shutting up.

It’s endlessly funny to me that lots of Cold Fusion papers were worth publishing, and no ID ones are. The IDiots can’t even meet the bar of basic competence Cold Fusion research met.

Syntax Error: not well-formed (invalid token) at line 3, column 55, byte 112 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

PvM Wrote:

So how does Dembski define information?

I(X) = -log2P(X)

This definition is not peculiar to Dembski. I have seen several texts on information theory which refer to the quantity -log2(X) as the amount information one obtains when one learns that the event X has occurred. It is sometimes called the self-information of X. It is true, as Dembski notes in his latest opus, that the mathematical development of information theory makes very little direct use of this quantity. In textbooks its role seems to be confined to motivating the definition of entropy, and many (perhaps most) don’t even mention it.

Some writers on information theory (Tom Schneider, for one) object to this terminology and prefer to call the quantity -log2(X) the surprisal of X instead. However, such writers have formed a very small minority of the authors of the literature on information theory that I have read. I consider the objections to the usage favoured by Dembski to be unreasonable. In fact, it’s a usage I prefer myself.

Information is defined as

IS(X) = Hmax - H(X)

Information in the Shannon sense is defined as the change in entropy before and after a particular event has taken place.

Defined by whom? In the first place, if you want to follow Schneider’s terminology, this should be Hbefore - Hafter, where Hafter is the conditional entropy of the quantities still unknown after the event has occurred, given the quantities whose values became known during the event. The first term of this expression will only be equal to Hmax in the special case when the distribution of the quantities unknown before the event has taken place is uniform. But more to the point is the fact that this definition carries no more weight of authority behind it than the one which Dembski uses. So criticising him for using one and not the other seems to me to be completely unwarranted.

By conflating information with probability Dembski has introduced quite a difficulty namely that information is not what we commonly consider it to mean. Rather than information being a measure of ‘surprise’ information becomes very similar to the concept of entropy. Because of his usage of probability as an information measure, he is faced with the problem that neither regularity/chance nor intelligent designers can create complex specified information. The definition for information I chose is indeed for a uniform distribution which is not a bad assumption for initially random distributions and matches Shannon’s usage of these concepts.

See for instance Randomness, Order and Replication

But you are right, the definition can easily be generalized further. In Dembski’s latest opus he is somewhat more careful in his definitions but his usage of ‘self information’ or probability for information generates a lot of confusion and seems self-contradictory.

Pim van Meurs Wrote:

By conflating information with probability Dembski has introduced quite a difficulty namely that information is not what we commonly consider it to mean.

Dembski does not conflate information with probability (not, at least, in any of his writings that I have seen). A logarithm of a probability, which, as you have noted, is what Dembski uses as his definition of information, is not the same thing as a probability. As far as I am aware, Dembski has nowhere conflated these two things. As I also pointed out in my previous post, the definition is one that is commonly used in textbooks on information theory, so the way Dembski uses it seems perfectly reasonable to me.

The one objection I would have to his introduction of the term into The Design Inference, say, is that it is completely gratuitous. The technical arguments in The Design Inference are all about statistical inference and have very little to do with information theory. The one concept from information theory that Dembski introduces—namely, the definition of information—is completely irrelevant to his arguments.

As to the matter of the definition’s not corresponding to “what we commonly consider” “information” to mean, that happens to be a feature of every definition of information used in information theory. The definitions are chosen so that the quantities used as measures of the “amounts of information” produced by message sources will have some of the more obvious properties we intuitively expect them to. As a result, however, we find that they must also have other properties which some people find quite counterintuitive. If you have any objections on this score, you need to take them up with the founders of information theory. Dembski had nothing whatever to do with it.

Rather than information being a measure of ‘surprise’ information becomes very similar to the concept of entropy.

I’m afraid I don’t understand what your objection is here. All writers on information theory whom I have ever seen use the term “surprise” or “surprisal” in a technical sense have always used it to refer to the log of probability. So if you think information should be a “measure of ‘surprise’” (as you appear to do here), then Dembski’s choice of the log of probability is precisely what you want.

The definition for information I chose is indeed for a uniform distribution which is not a bad assumption for initially random distributions and matches Shannon’s usage of these concepts.

See for instance Randomness, Order and Replication

But the quantity R = logτ - H which Prof Lee calls “Shannon information” in this presentation has nothing whatever to do with “change in entropy before and after a particular event has taken place”. In fact, neither does it correspond to anything which Shannon ever gave the name “information” to, and Lee should be shot for calling it “Shannon information”. The quantity logτ appearing in the expression is simply the maximum average rate per symbol at which a source with an alphabet of τ sybols can produce information. The quantity H is the entropy rate per symbol of the source under consideration, or, in other words, the actual average rate per symbol at which it produces information.

As Lee notes in a little inset box on slide 18, Shannon called the quantity R/logτ, the redundancy of the source. He had very good reasons for choosing this terminology, and it’s a much better choice than what Lee has decided to replaced it with.

Syntax Error: mismatched tag at line 4, column 556, byte 877 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

Erik 12345 Wrote:

Pim van Meurs wrote in the blog entry above:

“ Dembski mentions in NFL that this measure of information is similar to Shannon information. In fact Shannon’s entropy is the average of Dembski’s information measure.”

About Dembski’s definition, David Wilson commented:

“This definition is not peculiar to Dembski. I have seen several texts on information theory which refer to the quantity -log2(X) as the amount information one obtains when one learns that the event X has occurred. It is sometimes called the self-information of X. It is true, as Dembski notes in his latest opus, that the mathematical development of information theory makes very little direct use of this quantity. In textbooks its role seems to be confined to motivating the definition of entropy, and many (perhaps most) don’t even mention it.”

The first statement is, for reasons to be explained below, mathematically ill-defined and the second is partially, but not completely, wrong.

Since you quote two statements by Pim van Meurs, and several by me, it’s not at all clear to me which two of them you are referring to. I assume that “The first” refers to the following statement of Mr van Meurs’s:

In fact Shannon’s entropy is the average of Dembski’s information measure.

It is true that, when taken in isolation, this sentence is ambiguous. However, the fault here is not Mr van Meurs’s, but mine. In extracting the quotation from his original post, I lifted it out of a context where it was unambiguous and accurate. My apologies for the confusion.

Erik 12345 Wrote:

Now, it is true that both Dembski and some information theory textbooks define “information” and “self-information”/”surprisal”, respectively, in an event A as

-log(Pr(A))

The difference between the definitions is in the restrictions on the A’s we are allowed to plug into the formula. This is a difference that most people probably don’t think too much about, but it is a very important difference nonetheless. Dembski allows us to plug in any event A. Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, …, An. The events we are allowed to plug into the formula for “self-information”/”surprisal” are then restricted to one of these partitions.

I categorically dispute these last two statements. For the purposes of defining “information”/”self-information”/”surprisal” or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.

Erik 12345 Wrote:

This restriction is crucial, because without it is meaningless to speak about average “self-information”/”surprisal” (note that Dembski makes no such restriction and it is consequently meaningless to speak about the average of his information). We can meaningfully average over all outcomes or over all partitions, but not over all events.

I assume that by “over all partitions” here you really meant “over all events in a partition”. It makes no more sense to average over all partitions than it does to average over all events.

Yes, of course, if you’re going to talk about an average of self-information then you will have to specify the set of events over which you’re taking the average. But I’m afraid I can see no reason why this should require you to place any restrictions at all on the events for which you are allowed to define self-information.

Until the recent appearance of his latest opus Dembski has never, as far as I know, tried to take an average of his “information”, so he has never needed to specify a partition over which to take it. In his latest paper, just referred to, where he does take such an average, it is, as far as I can see, perfectly well-defined.

Erik 12345 Wrote:

… Dembski’s definition is NOT the same as the “self-information”/”surprisal” introduced in some presentations of information theory.

I am afraid I am still mystified as to why you think this. Here is Dembski’s definition of “measure of information” as given on page 127 of No Free Lunch:

Thus, we define the measure of information in an event of probability p as -log2 p.

Please tell me how this differs in any essential respect from the definition of self-information as given, for instance, in the entry on Information Theory in the Encyclopedic Dictionary of Mathematics published by MIT press.

I Wrote:

“In fact Shannon’s entropy is the average of Dembski’s information measure.”

It is true that, when taken in isolation, this sentence is ambiguous. However, the fault here is not Mr van Meurs’s, but mine. In extracting the quotation from his original post, I lifted it out of a context where it was unambiguous and accurate. My apologies for the confusion.

Oops. On rereading my original post, I find I have not quoted this text at all. I had actually snipped it out of my quotation before posting, so I don’t need to assume any blame for whatever problems Erik 12345 has with it.

Given that the statement appears a couple of paragraphs before Mr van Meurs gives the equations that make sense of it, I would replace “the average” with “an average”. But apart from that, I can’t see anything much wrong with it.

David Wilson Wrote:

I categorically dispute these last two statements. For the purposes of defining “information”/”self-information”/”surprisal” or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.

My local library is closed until monday, so I can’t cite any textbook right now. What I can do is recall how logarithms of probabilities arise in information theory (by which I mean communication theory):

The goal of information theory is take a probabilistic description of data and design a way to encode the data so that it can be reliably and efficiently transmitted, stored and decoded. In the simplest case, we have a partition {A1,A2,…,An} of the possible outcomes and the objective should associate with each partition a codeword. This encoding should satisfy the following demands:

(i) Each codeword must be unique. (ii) No codeword can be a prefix of any other. (iii) The average codeword length should be minimal, subject to the constraints (i) & (ii).

If we denote the number of binary digits in the codeword assigned to Ak by L(Ak), then the constraints (i) & (ii) turn out to be equivalent to the constraint

(*) SUM 2^-L(Ak) = L(A1) Pr(A1) + L(A2) Pr(A2) + … + L(An) Pr(An).

Minimizing the average codeword subject to (*) but ignoring fact that the codeword lengths are integers, one finds that the optimal code should assign codewords with lengths

L(Ak) = -log(Pr(Ak)).

With this choice, the average codeword length is exactly the Shannon entropy of the partition. Since codeword lengths are necessarily integers, the Shannon entropy is really only a lower limit on the average codeword length and the real integer-valued optimal codeword lengths can be quite different from -log(Pr(Ak)). However, there always exist a nearly optimal code that assigns no more than -log(Pr(Ak))+1 binary digits to Ak. This is how logarithms arise in information theory.

In misguided attempts to present the above derivation more pedagogically, some authors have named the quantity -log(Pr(Ak)) “self-information” or “surprisal”. And the operational meaning of the Shannon entropy of a partition—the very reason for its introduction in information theory—has been obscured by the omission of the above optimization problem and its solution. Only in the context of the above optimization problem is it legitimate to attach the unit “binary digits” to the logarithms -log(Pr(Ak)), because only in this context do the logarithms represent codeword lengths.

It should be clear from all this that “self-information”/”surprisal” must be defined relative to a partition and that it would lose all its meaning if this relativization was dropped. We may infer from the fact that “self-information”/”surprisal” is given the unit “binary digits” or “bits” that it is restricted to a partition. (It may be remarked that Dembski also likes to give his logarithms the unit “bits”, but since Dembski clearly does not restrict attention to a partition, we should instead infer that he is unaware of the necessity of such a restriction or that he uses the word “bits” for purely rhetorical purposes.)

David Wilson Wrote:

I assume that by “over all partitions” here you really meant “over all events in a partition”. It makes no more sense to average over all partitions than it does to average over all events.

Yes. English is not my native language and for some reason I have great difficulty remembering whether “a partition” is a set of mutually exhaustive and disjoint events or an element of this set.

David Wilson Wrote:

Yes, of course, if you’re going to talk about an average of self-information then you will have to specify the set of events over which you’re taking the average. But I’m afraid I can see no reason why this should require you to place any restrictions at all on the events for which you are allowed to define self-information.

Reason #1: The “self-information”/”surprisal” is introduced for the single purpose of being averaged. Reason #2: Only with the restriction to events from a given partition is it legitimate to attach the unit “binary digits” to the “self-information”/”surprisal”. In information theory, that is the unit attached to it.

David Wilson Wrote:

Until the recent appearance of his latest opus Dembski has never, as far as I know, tried to take an average of his “information”, so he has never needed to specify a partition over which to take it. In his latest paper, just referred to, where he does take such an average, it is, as far as I can see, perfectly well-defined.

I agree that Dembski hasn’t made that mistake. He doesn’t take such an average in his paper on “variational information” — the average he does take (an average of the squared ratio of two probability distributions/measures) is, as you note, well-defined, though. The unit of “generalized binary digits”, on the other hand, is pretty bogus.

David Wilson Wrote:

Please tell me how this differs in any essential respect from the definition of self-information as given, for instance, in the entry on Information Theory in the Encyclopedic Dictionary of Mathematics published by MIT press.

Quote the definition and tell me what unit, if any, is attached to the quantity and if codes/codewords are discussed. Then I’ll be able to tell you if it is different from Dembski’s definition (which is inspired by the philosophical literature, where one is uninterested in good codes and interested in the properties of “information”, such as additivity for independent events, and its relation to knowledge).

Syntax Error: not well-formed (invalid token) at line 4, column 12, byte 537 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

For what it’s worth, here’s a link to something I wrote on the subject a while ago:

http://www.talkorigins.org/design/f[…]nfl/#shannon

I should add that I don’t claim to be any expert on this subject.

Syntax Error: not well-formed (invalid token) at line 6, column 18, byte 498 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

Syntax Error: not well-formed (invalid token) at line 11, column 18, byte 810 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

Dr. 12345: There is an evil undocumented feature in the preview script that can completely change the meaning of a comment.

Yes. I’ve noticed that. I think the evil lurks in the “less than” sign. I think it works if you don’t preview, but once you do, it aborts everything that follows

I Wrote:

Erik 12345 wrote:

“ .….. Information theory textbooks, on the other hand, require us to first partition the possible outcomes into non-overlapping events A1, A2, … , An. The events we are allowed to plug into the formula for “self-information”/”surprisal” are then restricted to one of these partitions.”

I categorically dispute these last two statements. For the purposes of defining “information”/”self-information”/”surprisal” or whatever else you want to call -log(Pr(A)), no restriction on the event A is necessary other than that it belong to the domain of definition of the probability measure Pr, and I have never come across any information theory text which imposes one. If you think otherwise, I invite you to cite such a text.

My memory played me false here, and I’ll have to eat some humble pie. On browsing through a selection of expositions on information theory I find that there are indeed plenty of them which do only define “self-information” for events restricted to lie within particular partitions, including the Encyclopedic Dictionary of Mathematics, which I had firmly believed did not. It would be a slight exaggeration to say that they “imposed a restriction” on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. Nevertheless, it is still a fact that they did not find it necessary to adopt that generalisation themselves.

That said, the assertion of Erik’s which I have requoted above is also false. I was easily able to find plenty of texts on information theory which did not “require us to first partition the possible outcomes into non-overlapping events A1, A2, … , An.” These texts did give definitions which allowed any event to be plugged into the formula. I’ll give a list of some of these texts later on tonight (Oz time) and quote one of the definitions for Erik’s analysis of how it differs (or not) from Dembski’s.

Syntax Error: not well-formed (invalid token) at line 11, column 1017, byte 3325 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

Syntax Error: mismatched tag at line 15, column 2, byte 1036 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

David Wilson Wrote:

Here is the list of texts: Amiel Feinstein, Foundations of Information Theory, McGraw-Hill, New York, 1958, p. 2

Norman Abramson, Information theory and coding Published, Mcgraw-Hill, New York, 1963, p.12

J. Aczél and Z. Daróczy, Measures of information and their characterizations, Academic Press, New York, 1975, p.71ff

Richard W. Hamming, Coding and information theory, 2nd edn., Prentice-Hall, Englewood Cliffs N.J, 1986, p.104ff

Klaus Krippendorff, Information theory : structural models for qualitative data, sae, Beverly Hills, 1986, p.14

John B. Anderson and Seshadri Mohan, Source and channel coding : an algorithmic approach, Kluwer Academic Publishers, Boston, 1991, p.5

Michael A. Nielsen and Issac L. Chuang, Quantum computation and quantum information, Cambridge University Press, Cambridge, 2000, p.501-502

Feinstein’s text does not belong in your list, since he explicitly restricts attention to partitions.

Abramson’s text, however, does belong in your list. I find his distinction between a “bit” and a “binit” unclear. He writes: “We note, also, that if P(E) = 1/2, then I(E) = 1 bit. That is, one bit is the amount of information we obtain when one of the possible equally likely alternatives is specified. Such a situation may occur when one flips a coin or examines the output of a binary communication system.” (p. 13, emphasis in original)

I find it tempting to interpret this as saying that the event E is restricted to one of the events in a partition of the sample space into two halves. However, all things considered, I’m inclined to interpret Abramson’s introduction of I(E) as not relativized on a partition of the sample space. I suppose that Abramson can avoid my factual objection (e.g. that there can be infinitely many events of probability 1/2 and that it is absurd to simultaneously associate each of these with its own binary digit) by distinguishing a binary unit (a “bit” in Abramson’s terminology) from a binary digit (a “binit” in Abramson’s terminology). This would avoid a factual error at the price of irrelevance, for what good is a binary unit in those situations when it doesn’t correspond to binary digit? Communication theory is about binary digits. Abramson, like many other textbook authors (e.g. Cover & Thomas write something to the effect that it is irresistable to play around with axioms for “information measures” and take on faith that what results is actually useful), does make clear that the considerations discussed during introduction of the definitions are unrelated to the actual justification for the definitions.

To be clear, I concede that some textbooks on information theory do not relativize what we here call “self-information” to a choice of partition. What I don’t agree with is that this is a good idea or that those who follow it should not be criticized. It is perverse to introduce definitions via discussions that are irrelevant to the subject at hand. And one should distinguish between stuff that is introduced only as a result of these perversions and stuff that is actually of use in information theory.

David Wilson Wrote:

These questions are obviously rhetorical, but I’m not sure I understand what you’re driving at. If you’re suggesting that my comment about “sermonising” was gratuitous, then you’re right, and I apologise. “Post in haste, repent at leisure” as they say.

There is no need for apologies, as my question was neither rhetorical nor an irrational reaction to the word “sermonise”.

I don’t know if you agree with that the quantity -log(Pr(E)) is of relevance only when E is taken from some partition of the sample space. By asking what you make of the absence of sermons against, say, definitions of -log(x) as the “information” associated with the positive real number x (whether a probability or not), I am simply trying to get you to either embrace the absurdity or commit to some relevance demands (which will make it hard to retain -log(Pr(E)) for arbitrary events, since it is quite irrelevant).

David Wilson Wrote:

But it seems to me that Feinstein himself has already assumed that his I-function can take probabilities of arbitrary events (at least of ones of non-zero probability) as arguments. He states explicitly that I( ) is defined over the interval (0, 1].

Arbitrary probabilities of events from a partition, not probabilities of arbitrary events!

David Wilson Wrote:

If I(px) is only defined for events belonging to a partition, how does he manage to prove that I( ) must be proportional to a negative logarithm? As far as I am aware this can’t be done without somewhere using the identity I(px∩y) = I(px) + I(py) for some independent events x, y of positive probability less than 1. That identity does not make sense unless I( ) is at least defined for the probabilities of the three events x, y, and x∩y, which cannot possibly form a partition.

Your identity is not used by Feinstein. He uses the standard axioms, where the Shannon entropy of the partition {x1,…,xn} is expressed recursively like this:

H(x1,…,xn) = H(x1, x1’) + (1-px1)H(x2,…,xn | x1’),

where x1’ is the complement of x1 (the union of x2,…,xn) and the second term is the entropy of a partition of a smaller probability space obtained by conditioning on the event x1’. This, of course, imposes demands on the I-function. You could still point out that expanding the r.h.s. gives an expression involving the I-function of x1, x2, …, and x1’. These events could not possibly be part of the same partition, but the point is that the I-function is evaluated only as a means to determine entropies of partitions. The recursive formula relates the Shannon entropies of different partitions and therefore also the I-functions for events from different partitions. But this does not prevent the I-function from being relativized on a fixed, but arbitrary, partition any more than it prevents the Shannon entropy from being relativized in the same way.

Dembski, in contrast, has introduced his “information” via the philosopher’s argument that -log(Pr(E)) is the only continuous function that is decreasing in Pr(E) and additive for independent events. That is, Dembski does not express his requirements of “information” in terms of the Shannon entropy of a partition.

David Wilson Wrote:

Chacun à son goût. I happen to like this approach. I certainly think it helped me gain a better understanding of why something like mutual information, for instance, should be defined in the way it is. Nevertheless, the only purpose it ever serves in communication theory seems to be pedagogical. So if it really is a large contributing cause to misunderstandings that would certainly be a good reason to avoid using it for pedagogical purposes.

Which genuine insights about mutual information did you gain from this approach that would have required a larger effort if you had only had access to the communication theoretic context (the average number of binary digits by which the optimal average codeword length can be reduced by exploiting that the value of another stochastic variable is known) and the statistical context (a viable test statistic in tests for independence)?

It probably isn’t a cause for misunderstandings among those who read textbooks themselves, but it is, I infer, a contributing cause of misleading popular and interdisciplinary presentations. At the popular level, there seems to be a widespread misconception that you’re doing information theory simply computing logarithms of probabilities. At the interdisciplinary level, there seems to be a lack of interest in distinguishing the use of Shannon entropies and Kullback-Leibler divergences for data coding, for obtaining Bayesian a priori-probabilities via the Maximum Entropy Principle, and as test statistics in non-Bayesian hypothesis testing. The functions used in information theory seem to have acquired an air of authority that make them the unquestioned choice even in applications outside the field for which they introduced.

David Wilson Wrote:

However, the fact remains that the definition Dembski has used for “information” is well-established in the information theory literature and as far as I can see he has accurately reproduced it. My opinion remains that criticism of him merely for doing this is unreasonable.

My opinion is that the reproduction of discussions, similar to those which some authors include for purely pedagogical reasons, as if they were the actual basis for information theory is evidence of a lack of understanding that is a legitimate target for criticism. This applies especially to mathematicians like Dembski, who should be able to figure out that the criterion for doing doing information theory has little to do with log-transforming probabilities per se and everything to do with data encoding.

Russel Wrote:

Dr. 12345: There is an evil undocumented feature in the preview script that can completely change the meaning of a comment.

Yes. I’ve noticed that. I think the evil lurks in the “less than” sign. I think it works if you don’t preview, but once you do, it aborts everything that follows

Yeah, I reached the same conclusion.

Just for the record, to aid those who evaluate my posts by weighing my authority against the authority of my opponents, I’ll note that I don’t have a PhD in any field.

David Wilson Wrote:

Incidentally, his “variational information” is not “additive” as he claims on page 9. His alleged proof contains a couple of errors. I posted a counterexample to the claim on talk.origins a couple of weeks ago. Here it is in a slightly more readable form.

I understand that the Radon-Nikodym derivates w.r.t. to the counting measure c are

dμ/dc = f = (1/4, 1/4, 1/4, 1/4) dν/dc = g = (1/5, 1/5, 1/5, 2/5)

since summing f(x) and g(x) for all x in an event A gives μ(A) and ν(A), respectively. This gives me dμ/dν as f(x)/g(x), since summing such terms weighted by ν gives the measure μ. But how did you determine dμ1/dν?

Anyway, despite that I don’t understand all the ingredients in your counterexample, I think your conclusion is right. It seems to me that some modification is required, such as requiring the reference measure ν to factorize in the same manner as μ.

Dr. 12345: “Just for the record, to aid those who evaluate my posts by weighing my authority against the authority of my opponents, I’ll note that I don’t have a PhD in any field.”

Yes, and Dembski has two, which speaks worlds about the significance of “PhD”. I’m going to continue to think of “doctor” in its etymological sense (“teacher”).

I gave up trying to follow the Wilson - 12345 discussion. It’s over my head. But can we summarize for the masses? As I understand it DW said that one or a few or some of the technical indictments of Dembski’s work are unwarranted. Much discussion between DW and E1 later: sort of yes, sort of no.

Big picture now: are there any mathematicians reading this who find Dembski’s arguments, specifically with respect to biology, compelling?

I find his understanding of biology so ludicrous that I’m not strongly motivated to educate myself on the mathematical legerdemain he uses to rationalize it.

(He reminds me of a math prof at my college who had a “mathematical proof” that all numbers are equal to 47 (our school’s numerical mascot). Only that that prof knew it was a joke.)

I don’t think it reflects poorly on the PhD degree. I have known many science PhDs, and they are all very intelligent. Unfortunately, intelligence is not always homogenously distributed throughout someone’s range of thinking. Some people are smart in everything they do, some are a little smarter in some things than others, and some people are intelligent in some respects and crazy in others. There’s no doubt Kurt Godel was among the brightest all-time logicians. And bright at math, bright at physics. Einstein enjoyed talking physics with him. Yet, was also somewhat crazy. He died of starvation because he thought everyone was out to poison him. Lots of people are bright at some things, crazy at others. You can have a brilliant mathematician who thinks communism’s a good idea. A brilliant journalist who thinks Sun Myung Moon is the second coming. It seems like especially on religious topics, some bright people can turn off their minds and keep believing nonsense. Like Shermer said, it’s not that smart people are without stupid beliefs, but they’re really good at coming up with justifications.

Steve:

I’m certainly not suggesting that a PhD is negatively correlated with having worthwhile thoughts to share. I am proposing, though, that you can garner any number of degrees without ever having a worthwhile thought to share.

In the case of our ID friends, it may be that there’s some Godel-like island of competence that I’m not aware of. (Well, rhetoric. I’d have to grant they’re good at that.)

But worthwhile thoughts?

Same difference :-)

(to use an oxymoron, like Loving God (w/r/t the biblical one))

Erik 12345 Wrote:

Feinstein’s text does not belong in your list, since he explicitly restricts attention to partitions.

I disagree. But since there are still plenty of other texts on the list I can’t be bothered arguing the point any further.

I Wrote:

“… It would be a slight exaggeration to say that they “imposed a restriction” on the definition, in the sense of sermonising against any generalisation of it to cover all events, as Erik has done. …”

Erik 12345 replied:

“Did they sermonize against a generalization of the formula -log(Pr(A)) from probabilities of partitions to any positive number (whether probabilities or not)? If not, what do you make of the absence of such sermons?”

Me:

“These questions are obviously rhetorical, but I’m not sure I understand what you’re driving at. If you’re suggesting that my comment about “sermonising” was gratuitous, then you’re right, and I apologise. “Post in haste, repent at leisure” as they say.”

Erik 12345:

“There is no need for apologies, as my question was neither rhetorical nor an irrational reaction to the word “sermonise”.

.… By asking what you make of the absence of sermons against, say, definitions of -log(x) as the “information” associated with the positive real number x (whether a probability or not), I am simply trying to get you to either embrace the absurdity or commit to some relevance demands (which will make it hard to retain -log(Pr(E)) for arbitrary events, since it is quite irrelevant).

Ok, what I was missing was your intention of referring to an association of the word “information” with -log(x). The difference between such an association and that of the word “information” with -log(Pr(E)) is that (as far as I know) no one has ever attempted to make it, and I can see no reason why anyone should want to. I can thus also see no reason why it should even occur to anyone to object to it.

However, as documented by my list of references, there are plenty of writers on information theory who do adopt a definition similar to Abramson’s without imposing any restrictions on the event E, and most of whom attempt to justify it by appealing to conditions they claim one might intuitively expect such a definition to satisify. I don’t find that absurd at all.

Erik 12345 Wrote:

I don’t know if you agree with that the quantity -log(Pr(E)) is of relevance only when E is taken from some partition of the sample space. …

No, I don’t. But even if I did, any event E always belongs to some partition of the sample space (if nothing else, it belongs to the partition { E, E’}, where E’ is its complement). Moreover, while it only makes sense to take averages of -log(Pr(E)) over partitions of the sample space, there is no reason why such a partition can’t be completely arbitrary. So, if you’re going to give the quantity -log(Pr(E)) a name at all, there seems to me to be no point imposing restrictions on the events E for which the name will be defined.

Erik 12345 Wrote:

Which genuine insights about mutual information did you gain from this approach that would have required a larger effort if you had only had access to the communication theoretic context (the average number of binary digits by which the optimal average codeword length can be reduced by exploiting that the value of another stochastic variable is known) and the statistical context (a viable test statistic in tests for independence)?

That’s impossible for me to judge, since I have never seen an exposition which attempts to motivate the definition of mutual information by appealing to either of the items you mention. In all of the expositions I have seen, the definition is motivated by giving some argument that it represents the average amount of “information” which one random variable provides about another. It is only after the definition has been made that theorems relating it to coding rates are then proved.

I agree however that it is the theorems on coding rates which do provide the ultimate justification for the definition. Heuristic arguments for motivating a definition are all well and good, but unless you can get around to doing something useful with it, they don’t amount to much.

Erik 12345 Wrote:

My opinion is that the reproduction of discussions, similar to those which some authors include for purely pedagogical reasons, as if they were the actual basis for information theory is evidence of a lack of understanding that is a legitimate target for criticism. This applies especially to mathematicians like Dembski, who should be able to figure out that the criterion for doing doing information theory has little to do with log-transforming probabilities per se and everything to do with data encoding.

I guess we are mostly in agreement on this point. The only use Dembski seems to make of the definition is to wave it around as an excuse for claiming he is doing “information theory”. However, my criticism would not be that he has chosen a wrong definition of “information”, but that he hasn’t done anything useful or interesting with it. I am also very sceptical that he can do anything useful or interesting with it. I am, however, willing to be convinced otherwise.

Erik 12345 Wrote:

I understand that the Radon-Nikodym derivates w.r.t. to the counting measure c are

dμ/dc = f = (1/4, 1/4, 1/4, 1/4) dν/dc = g = (1/5, 1/5, 1/5, 2/5)

since summing f(x) and g(x) for all x in an event A gives μ(A) and ν(A), respectively. This gives me dμ/dν as f(x)/g(x), since summing such terms weighted by ν gives the measure μ. But how did you determine dμ1/dν?

I was assuming that by dμi/dν Dembski actually meant dμi/dνi, where νi is the restriction of ν to the σ-algebra Ai. The standard definition of Radon-Nikodym derivative requires the two measures to be defined on the same σ-algebra. If μ and ν are defined on the σ-algebra A, and μ is absolutely continuous with respect to ν, the Radon-Nikodym derivative dμ/dν is the unique ν-integrable function which satisfies the equation:

μ(S) = ∫S dμ/dν dν

for all events S in A. If μ and ν are defined on different σ-algebras, A1 and A2 say, then the above equation only makes sense if S belongs to both A1 and the ν-completion of A2. And then for dμ/dν to be uniquely defined, it has to be measurable with respect to the ν-completion of A1 ∩ A2. In the case of my counterexample, requiring dμi/dν to be Ai-measurable means that dμ1/dν has to be constant on the sets {00,01} and {10,11} while dμ2/dν has to be constant on the sets {00,10} and {01,11}. Applying the above equation for these functions and sets gives dμi/dν(x) = μ(S)/ν(S) for all x in any set S on which dμi/dν is required to be constant.

Erik 12345 Wrote:

Anyway, despite that I don’t understand all the ingredients in your counterexample, I think your conclusion is right. It seems to me that some modification is required, such as requiring the reference measure ν to factorize in the same manner as μ.

Yes the result does hold in that case. The proof is not all that difficult, but I have to admit it gave me a lot more trouble than it should have.