#### Introduction

Entropy may seem to be at first a simple concept but when trying to apply these concepts correctly one invariably runs into frustrating issues and areas of confusion. In this posting I intend to explore some of these confusions and I hope to explain how one applies entropy calculations correctly.

In its simplest form entropy can be described as

Where W describes the number of microstates of the system. The difference between macrostates and microstates is quite important and when confused can lead to erroneous calculations.

Feynman warned against this confusion when he remarked that

Feynman Wrote:“So we now have to talk about what we mean by disorder and what we mean by order. … Suppose we divide the space into little volume elements. If we have black and white molecules, how many ways could we distribute them among the volume elements so that white is on one side and black is on the other? On the other hand, how many ways could we distribute them with no restriction on which goes where? Clearly, there are many more ways to arrange them in the latter case.

We measure “disorder” by the number of ways that the insides can be arranged, so that from the outside it looks the same.The logarithm of that number of ways is the entropy. The number of ways in the separated case is less, so the entropy is less, or the “disorder” is less.”

Now let’s apply this to the tossing a coin 4 times. From a macro perspective there are 5 possible outcome namely: All Heads, All Tails, 3 Heads 1 Tail, 2 Heads and 2 Tails and finally 1 Head and 3 Tails. But the number of microstates can be calculated from the formula already presented in other threads

where and are the number of Tails and Heads.

Let’s put this all in a table:

Macrostate |
Number of Microstates |
Microstates |
Formula |
S |

4 Heads | 1 | HHHH | 4!/(4!0!) | 0 |

3 Heads, 1 Tails | 4 | HTTT,THTT, TTHT,TTTH | 4!/(3!1!) | 2 |

2 Heads, 2 Tails | 6 | HHTT,HTHT, HTTH,THTH, TTHH,THHT | 4!/(2!2!) | 2.58 |

1 Head, 3 Tails | 4 | THHH,HTHH, HHTH,HHHT | 4!/(3!1!) | 2 |

4 Tails | 1 | TTTT | 4!/(4!0!) | 0 |

Macro states and microstates for throwing a coin 4 times. Adapted from this tutorial. Remember that 0! is by definition 1

So what can we learn from this simple example? First that the number of macrostates are not the same as the number of microstates. In fact the number of microstates can become quite large compared to the number of macrostates. Let’s now assume that the throw is still 2 Heads and 2 Tails but in addition we now know that a ‘mutation’ has fixated one of the Tails to be in the first position.

What is the number of microstates now? Well, the only valid microstates are

**TTHH, THTH, THHT**or 3!/(2!1!) as we can see fixating one of the Tails has decreased entropy from 2.58 to 1.58 or a drop of 1.

Not surprisingly because while the state 2 Heads and 2 Tails is one of maximum entropy (a uniform distribution of Heads and Tails will always have maximum entropy), fixating one of the throws to always be Tails has increased the order or decreased the entropy.

I would like to point out that in this example I have not used any of the approximations used by Shannon but merely have focused on the possible microstates for any given macrostate. The relevance you may ask is that when in a genome of 1000 uniformly distributed nucleotides, 2 positions become fixated, contrary to intuition, the number of microstates dropped from to

Other websites point out the same thing

Excellent: Thermodynamics of Equilibrium

Check out the formula on the number of Microstates, with the correct substitutions, is exactly the same as the one above I’ll leave that as an exercise to the reader

And a more indepth view

and another way of looking at this issue

Update:

A quick comment: The model used here assumes that the probabilities for Heads and Tails is equal (0.5). When a particular location is assumed to be fixated, the entropy for that site is 0. Shannon’s model is a first order model in which the probabilities can vary. Entropy is calculated based on the frequencies of occurrence. For the genome these frequencies are obtained from looking at many genomes. This means that for each location in the genome, the probabilities for the various nucleotides can vary. The first order model assumes that the probabilities for the nucleotides are independent of the other locations. If there is a dependency between locations then a higher order model needs to be applied to correct for the correlation.

from this excellent page on theory of datacompression

Features, Patterns, Correlations in DNA and Protein Texts is a good website to explore.

A good paper on distribution of di- and tri-nucleotides

Study of statistical correlations in DNA sequences

The plots of C(l) for prokaryotic genomes show that, at short scales (below the characteristic size of genes) correlations are dominated by the non-uniform base composition in the three codon positions. At larger scales we observe both behaviours, genomes for which C(l) almost vanishes (e.g. M. tuberculosis ) and genomes for which C(l) is significantly different from zero in a broad range of sizes (e.g. B. subtilis ). In the former, the behaviour beyond the characteristic size of genes is similar to what could be observed in a random sequence, thus implying that these genomes are essentially homogeneous at large scales (a commonly accepted idea, Rolfe and Meselson, 1959; Sueoka, 1959). Nevertheless, the latter class of prokaryotic genomes presents correlations implying the presence of heterogeneities that cannot be explained in terms of nonuniform base composition in the three codon positions and could be related to a massive lateral transfer of compositionally biased genes from other genomes or even to natural selection. In addition, we observe power-law correlations in these genomes which, in some cases extend to more than four orders of magnitude, in agreement with previous results (de Sousa Vieira, 1999). Thus, the results obtained for such genomes clearly questions the assumption of homogeneity in prokaryotic DNA.

#### Onwards to Shannon entropy

In the above example I have used the formula . I will show that for large number of coin tosses the formula simplifies to the Shannon format by using the Stirling approximation .

First we assume that we have coint tosses with Heads and Tails. The number of microstates is given by thus the entropy is:

rewrite this as

with and

With Stirling approximation we find:

which simplifies to

and further since

When I use Stirling’s Approximation to calculate log[(N+M)!/(N!M!)], I get (N+M)(-p log p - q log q), where p = N/(N+M) and q = M/(N+M). Why is the (N+M) dropped in entropy calculations?

How does log[(N+M)!/(N!M!)] factor into the question of how many bits are needed to encode a signal?

The binomial coefficients arise only because we compute the number of ways to arrange a fixed number of objects into different slots (in the case of statistical mechanics the “slots” are ultimately energy levels, other kinds of slots are approximations or pedagogical analogies). We must keep in mind that the above blog entry is concerned with the reasoning of statistical mechanics exported to systems that are more familiar and in some ways analogous to the systems studied in statistical mechanics. Statistical mechanics relies on other assumptions (e.g. conservation laws pertaining to energy and particle number) and has other goals than communication theory, so it is not necessarily sensible to seek to interpret the calculations in the framework of communication theory.

Here are a few examples of correct (but not necessarily interesting) interpretations of a few combinatorial calculations related to DNA sequences (including Pim van Meurs’s example in the above blog entry). All calculations are exact and I have not used Stirling’s approximation.

Question: How many DNA sequences of length 1000 are consistent with the constraint that there are exactly 250 nucleotides of each type (adenine, guanine, cytosine, and thymine)?Answer: W_{1}= 1000! / (250! 250! 250! 250!) = 3.6838 * 10^{597}different DNA sequences are consistent with that composition.Interpretation in terms of data transmission: Given that both the sender and the receiver know that all DNA sequences that are sent over the communication channel have exactly 250 nucleotides of each type and given that all DNA sequences with this composition are equally probably, a nearly attainable lower bound for the number of binary digits that need to be sent is log_{2}(W_{1}) = 1985.0722 bits/DNA sequence.Question: How many DNA sequences of length 1000 are consistent with the constraint that there are exactly 250 nucleotides of each typeandthat two of adenines are constrained to be in specific positions?Answer: W_{2}= 998! / (248! 250! 250! 250!) = 2.2955 * 10^{596}different DNA sequences are consistent with that composition and the additional constraint.Interpretation in terms of data transmission: Given that both the sender and the receiver know the constraints imposed on the DNA sequences and given that all DNA sequences consistent with the constraints are equally probable, a nearly attainable lower bound for the number of binary digits that need to be sent is log_{2}(W_{2}) = 1981.0679 bits/DNA sequence.Question: How many DNA sequences of length 1000 are consistent with the constraint that there are exactly 250 nucleotides of each typeandthat one adenine and one cytosine are constrained to be in specific positions?Answer: W_{3}= 998! / (249! 249! 250! 250!) = 2.3047 * 10^{596}different DNA sequences are consistent with that composition and the additional constraints.Interpretation in terms of data transmission: Given that both the sender and the receiver know the constraints imposed on the DNA sequences and given that all DNA sequences consistent with the constraints are equally probable, a nearly attainable lower bound for the number of binary digits that need to be sent is log_{2}(W_{3}) = 1981.0737 bits/DNA sequence.Question: How many DNA sequences of length 1000 are there?Answer: W_{4}= 4^1000 = 1.1481 * 10^{602}different DNA sequences are consistent with that composition and the additional constraints.Interpretation in terms of data transmission: Given that all DNA sequences are equally probable, an attainable lower bound for the number of binary digits that need to be sent is log_{2}(W_{4}) = 2000.0000 bits/DNA sequence.Question: How many DNA sequences of length 1000 are consistent with the constraint that two given positions hold adenine?Answer: W_{5}= 4^998 = 7.1758 * 10^{600}different DNA sequences are consistent with that composition and the additional constraints.Interpretation in terms of data transmission: Given that both the sender and the receiver know the constraint imposed on the DNA sequences and given that all DNA sequences constistent with the constraint are equally probable, an attainable lower bound for the number of binary digits that need to be sent is log_{2}(W_{5}) = 1996.0000 bits/DNA sequence.Typo in article that needs correcting.

— Anti-spam: replace “usenet” with “harlequin2”

Erik, great posting. Thanks. I have updated your posting per your request to change 1986.0000 bits/DNA sequence to 1996.0000 bits/DNA sequence.

Mike Hopkins: I fixed the error you pointed out and changed the second microstates to macrostates.

Thanks

Pim,

Your two “4!/(3!0!)” formulas should be “4!/(4!0!).”

Man… all this checking and still errors.… Thanks I have updated the posting

Update