# Shannon entropy applied

#### Introduction to Shannon entropy

Motivated by what I perceive to be a deep misunderstanding of the concept of entropy I have decided to take us onto a journey into the world of entropy. Recent confusions as to how to calculate entropy for mutating genes have will be addressed in some detail.

I will start with referencing Shannon’s seminal paper on entropy and slowly expand the discussion to include the formulas relevant for calculating the entropy in the genome.

But first some warnings

Okay, now that I have given these useful warnings lets have a look Shannon’s paper.

“A Mathematical Theory of Communication”, Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379-423, 623-656, July, October, 1948. By C. E. SHANNON

Suppose we have a set of possible events with probability the entropy for such a situation is defined by:

I will apply this formula to a simplified example in which there are two events with probability p and q where q=(1-p) since the sum of the probabilities adds up to 1. Theh entropy in the case of two possibilities with probabilities p and q = 1 is given by:

Figure 1: Entropy for a system with two states. The horizontal axis describes the probability p and the vertical axis the entropy as calculated using the formula above

Some important observations:

• Entropy is zero when probability for p=0 or p=1
• Entropy is maximum when p=q=0.5

Observation 1 can be expressed more generally by observing that if the probability for any one of the variables is 1 that the entropy is minimal.

Observation 2 can be made more universal by observing that the entropy for n variables is maximum when or in other words, a uniform distribution.

#### Extending Shannon entropy to the genome

Various people have taken the work by Shannon and applied it, quite succesfully, to the genome.

Tom Schneider’s website contains a paper called evolution of biological information which shows how to apply Shannon entropy to entropy calculcations for the genome. In addition Schneider provides us with a more in-depth introduction into Information theory.

For those interested in how to correctly apply entropy calculations to the genome, I refer to the excellent introduction by Tom Schneider.

Schneider has succesfully applied Shannon entropy to identify and predict binding sites

Some successes include: N. D. Herman and T. D. Schneider, High Information Conservation Implies that at Least Three Proteins Bind Independently to F Plasmid incD Repeats, J. Bact., 174, 3558-3560, 1992.

T. D. Schneider and G. D. Stormo, Excess Information at Bacteriophage T7 Genomic Promoters Detected by a Random Cloning Technique, Nucl. Acids Res., 17, 659-674, 1989.

Schneider’s 1997 paper Information content of individual genetic sequences published in J. Theor. Biol. 189(4): 427-441, 1997 shows an example of application of Shannon entropy.

Now we can extend this using Adami’s 2002 paper Adami C. (2002) “What is complexity?”, BioEssays 24, 1085-1094.

The entropy of an ensemble of sequences X, in which sequences occur with probabilities is calculated as

where the sum goes over all different genotypes i in X. By setting the base of the logarithm to the size of the monomer alphabet, the maximal entropy (in the absence of selection) is given by , and in fact corresponds to the maximal information that can potentially be stored in a sequence of length L. The amount of information a population X stores about the environment E is now given by:

The entropy of an ensemble of sequences is estimated by summing up the entropy at every site along the sequence. The per-site entropy is given by:

for site j, where denotes the probability to find nucleotides i at position j. The entropy is now approximated by the sum of per-site entropies:

so that an approximation for the physical complexity of a population of sequences with length L is obtained by

From this it should be obvious that is zero when a particular mutation becomes fixated in the genome since one of the probabilities will be 1 and the others zero.

For those who wonder:

Schneider has some interesting results which show what happens to the entropy in the genome with and No selection. It should be obious that, not surprisingly, the information remains constant around zero.

Figure 2 (from Schneider’s talk): Notice how the information in the genome increases under selection but disappears once selective constraints are removed. Once selection is added, the information quickly approaches the theoretical values

#### From Jerry’s to Shannon

Jerry’s entropy using the following transformations

For the human genome we look at a particular location and observe over many samples and we find , , and , we calculate the total number of states to be:

using Stirling’s approximation we find

where

and

and

and

QED

#### Some relevant papers

“Significantly Lower Entropy Estimates for Natural DNA Sequences”, David Loewenstern, Journal of Computational Biology (under revision)

Abstract Wrote:

Abstract: If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications.

One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM).

Experiments are reported using a wide variety of DNA sequences andcompared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared.

We also report lower entropy estimates for coding regions extracted from a large collection of non-redundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: i) predict the next amino acid based on inexact polypeptide matches, and ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of non-redundant coding sequences.

Entropy in biological sciences

Using Molecular Evolution and Information Theory to Improve Drug

Complexity Club at Caltech

In this handout the same formula is derived starting from Boltzmann entropy

S_B= k_b log W

Where W refers to the number of microstates.

The entropy can also be related to the probability distribution that the system is in the ith state, written P_i.Consider a large system consisting of N_v subsystems. These subsytems might be particles, if we are interested in excitations in a crystal, or subvolumes if we are interested dilute gas particles in a box. Each of the subsytems can be in any one of several states, e.g. quantum states (particles) or occupied/unoccupied (subvolumes). The total number of ways of arranging N_v indistinguishable particles in one of two states is a result from the Binomial distribution

W= N_v! /(N_v-n_1)!n_1!

where N_v −n_1 is the number in the zeroth state and n_1 is the number in the first state.

For an arbitrary number of states this becomes:

W = N_v!/(n_1!n_2!n_3!…) and using the Stirling approximation we recover

S = - k_b Sum p_i log p_i

Simple. Well not really and this is where people tend to incorrectly apply entropy.

The following points cannot be emphasized forcefully enough: 1. Every quantity to which we can associate a probability distribution has its own Shannon entropy. Every time we see the term “Shannon entropy” we must therefore ask “Shannon entropy of what, i.e. to which quantity does this Shannon entropy refer?”. 2. The Shannon entropy of a quantity is only as meaningful as the probabilities used to compute it. In the case when the quantity is a gene sequence, the probabilities may represent the researcher’s subjective knowledge about the gene sequence of a particular individual. Another possibility is that the probabilities represent the relative frequencies with which particular gene sequences occur in a given population. 3. There is nothing magical about weighted sums of logarithms that make them automatically relevant to the application you are interested in! If there is to be any point in computing the Shannon entropy of a quantity, one must first figure out its significance for the particular application. For instance, communication engineers study the Shannon entropy of messages because it provides a lower bound on the number of data symbols that must be transmitted for messages to be fully recoverable at the receiving end. Proponents of Maximum Entropy Inference, on the other hand, take interest in Shannon entropies for a completely different reason, namely because they think the right way translate knowledge into probabilities is to choose, of all probability distributions consistent with what is known, the one that maximizes the Shannon entropy. It is unfortunate that the theoretical framework used by communication engineers, the philosophy of Maximum Entropy Inference, as well as other kinds of frameworks all claim to be named “Information Theory”.

The last point should be reinforced by the existence of (infinitely many!) generalizations of the expression for the Shannon entropy. One such generalization is the Tsallis entropy. The Tsallis entropy of the quantity X, with the associated probability distribution p, is defined as

S(X; q) = (1 - SUM p(k)^q) / (q - 1).

With the choice q = 1 we recover the Shannon entropy of X (the expression looks mathematically nasty, since we divide by zero, but the limit is well-defined and equal to the Shannon entropy). Other choices of the value of q give other “entropies”. Interdisciplinary researchers should ask themselves: “Assuming for the sake of the argument that an ‘entropy’ is relevant to my application, why is the relevant entropy the Tsallis entropy with q = 1 (aka ‘Shannon entropy’) rather than, say, the Tsallis entropy with q = 2.378?”

Excercise: Investigate Schneider’s and Adami’s articles with respect 1-3 above. The reader will find that their different answers to the questions raised in 1-3 are not entirely transparent in their papers (personally, I had to read Schneider’s Ev paper several times). Bonus excercise: Is there any particular reason why Schneider should make use of the Tsallis entropy with the particular choice q = 1? Same question for Adami?

Thanks for your comments Erik. I have to admit that I am not familiar with “Tsallis entropy” but some searching uncovered an interesting issue indeed.

Tsallis Entropy

As I understand, the reason for chosing q=1 is that Tsallis entropy is only additive for q=1.

In addition the probability distribution function for q=1 seems to be Gaussian while the probability distribution for the Tsallis entropy can be used in cases where the probability distribution functions have different tails.

Although I understand that a transformation exists which turns Tsallis into additive entropy, or Renyi entropy

However for Tsallis entropy we find

S(AB:q)= S(A:q)+S(B:q)+(1-q)S(A:q)S(B:q)

Pim van Meurs Wrote:

: As I understand, the reason for chosing q=1 is that Tsallis entropy is only additive for q=1.

That, of course, immediately raises another question: Why is additivity so essential? Or is it?

The communication theorist is interested in obtaining useful theoretical limits for the efficiency of data transmission and could not care less whether a function that provides such a bound is additive or not. It just so happens that the most famous such bound, the Shannon entropy, is additive, but this additivity is not the reason for the communication theorist’s interest.

The supporter of Maximum Entropy Inference has in one way or another been convinced that of all probability distributions consistent with what you know, the one that maximizes the Shannon entropy is the rational choice. Some of the arguments in favour of this position starts with the stipulation that the probability distribution of rational choice should maximize “unbiasedness” or “uncertainty”, and then it is postulated that any “unbias measure” or “uncertainty measure” should satisfy certain desiderata, of which additivity may be one. For the supporters of Maximum Entropy Inference that became convinced by such arguments, it probably is important not to violate additivity. (Although I note that some Tsallis entropy enthusiasts prefer to generalize the Maximum Entropy Inference philosophy to maximize Tsallis entropy instead of Shannon entropy.)

The bottom line is that whether or not additivity is important depends entirely on what you are trying to do. The many publications on Tsallis entropy indicate that not all researchers find additivity essential for all problems.

Pim van Meurs Wrote:

: In addition the probability distribution function for q=1 seems to be Gaussian while the probability distribution for the Tsallis entropy can be used in cases where the probability distribution functions have different tails.

The Tsallis can be computed for any probability distribution, so I assume that you are here specifically referring to the probability distributions which maximize the Tsallis entropy. The probability distribution that maximizes the Tsallis entropy for q = 1, subject to the constraint that certain expectation values must take on given values, is an exponentional distribution similar to the Maxwell-Boltzmann distribution (although it isn’t restricted to thermodynamic quantities, of course), not a Gaussian. Changing the exponential in that distribution to something called a q-exponential will give you the probability distribution that maximizes the general Tsallis entropy, subject to the same constraints. (A Gaussian probability distribution does maximize a continuous version of the Shannon entropy, subject to the constraint that the variance should take on a given value.)

It is unclear to me why you regard the form of the probability distributions that maximize (subject to certain constraints) the Tsallis entropy with q != 1 as an argument in favour of the choice q = 1. In any case, my reason for mentioning the Tsallis entropy is not to convince people to change from q = 1 to some other choice. Rather, I want people to evaluate the use of Shannon entropy on a case-by-case basis with attention to the significance (or lack thereof) of the Shannon entropy of a quantity in the particular application. There seems to be fairly widespread, tacit assumption that the Shannon entropy is somehow almost universally relevant no matter what the application is. Knowing that there are infinitely many other entropies may help to challenge that assumption.

Misc: The R�nyi entropy is indeed additive, as you note. There is a sign error in the last term in your equation expressing the non-additivity of the Tsallis entropy.

Syntax Error: mismatched tag at line 5, column 2, byte 459 at /usr/local/lib/perl5/site_perl/mach/5.18/XML/Parser.pm line 187.

Erik: The bottom line is that whether or not additivity is important depends entirely on what you are trying to do.

Your words of warning are well taken and in fact I appreciate you pointing out to me one of the many facts of entropy I have yet to become aware of.

As far as additivity is concerned, while under certain circumstances I can see why this may be of less interest, for an uncorrelated genome it certainly seems to make sense. I am wondering though if the same holds for a genome where nucleotide sequences are correlated.