I am a computational evolutionary geneticist, and in my research I develop software to analyze genetic data and study evolutionary questions. As part of my research, I work with a lot of simulation programs to generate evolutionary datasets. My most widely used program is Dawg, and I am currently putting the finishing touches on a new version.

In simulating molecular sequences, you start by simulating the ancestral sequence at the root of a phylogenetic tree and then evolve that sequence upwards, making point mutations and indels as you go. Depending on type of sequences being generated, the root would be a string of nucleotides, amino acids, or codons. To simulate the root sequence, we draw its characters from a discrete, stochastic distribution. For example, lets say that in your system the frequencies of A, C, G, and T are 26%, 23%, 24%, and 27%. In order to create a root sequences of length *k*, you simply sample *k* nucleotides from this distribution, e.g. AAGTGCA or GATTACA.

Therefore, the key step in the simulation of the root sequence is sampling repeatedly from an arbitrary discrete distribution. While I have been doing this for years, I recently went searching for doing it better and came across the following excellent article: Darts, Dice, and Coins, written by Keith Schwarz, a lecturer at Stanford. In this article, he describes many different methods for sampling from a discrete distribution and analyzes their performance. It turns out the best method is the Alias Method, first described in the 1970s and improved by M. Vose in 1991. I will describe it below, but before we get there, here are some alternatives.

Imagine that you want to sample from the following discrete distribution of nucleotides:

Let the heights of these bars be *h _{0}*,

*h*,

_{1}*h*, and

_{2}*h*. Since these heights correspond to the probability that a random base is A, C, G, or T, the total area of the histogram is 1. Now to sample from this histogram, you can draw a uniform random number—call it

_{3}*u*—between 0 and 1 and find which bar corresponds to that number. If

*u < h*, you sample an A. If

_{0}*u < h*, you sample a C, etc. This works, but is clearly inefficient since it requires you to search through the histogram from left to right every time you sample a nucleotide. Imagine if you were sampling from 64 codons.

_{0}+h_{1}A more efficient method is to sample a point uniformly in a box that contains all the bars:

The (*x*, *y*) coordinates of the sampled point are defined by uniform random numbers between [0, 4) and [0,3/8) respectively. If this point lands in the white space, you reject it and try again. This approach is more efficient because once you have chosen *x*, you only need to compare *y* against *h _{floor(x)}*.
However, the white space makes up one third of the space in this box, and thus on average you will have to sample 1.5 points for every sampled nucleotide.

With the Alias Method, we can bring this down to 1 point for every sampled nucleotide. Start by constructing the following box, which has an area of 1.

Notice that the amount of white space within the box is equal to the amount of filled space outside it. Thus it is possible to cut the histogram into blocks and rearrange them such that the entire box is filled.

Thus every bar is defined by two regions, the lower, normal region, and the upper, “alias” region. Now we can draw a point (*x*, *y*) uniformly inside this box; *x* determines which bar we look at, and *y* determines whether we return the normal or the alias value. Since there is no white space, we only have to draw 1 point per sampled nucleotide.

/* * C++ code for sampling from an alias table */ uint32_t a[64]; //aliases uint64_t h[64]; //bar heights uint32_t get_codon() { uint64_t u = rand_uint64(); uint32_t x = u&63; return (u < h[x]) ? x : a[x]; }

Now the hard part comes in slicing and moving the blocks around. That is where Vose’s method come in. Basically keep a list of bars above or below the top line of the box, called large and small. Then pop an element off of both lists, and use the large element to fill in the small element. If the large element is still above the line, put it back in the large list, otherwise put it in the small list. Repeat until the box is filled. The process looks a bit like this:

While Vose’s method is typically implemented using arrays, (see Darts, Dice, and Coins), array manipulation can be expensive. However, it is possible to implement Vose’s method without using temporary arrays to hold large and small elements. In the code below, I use indices to walk through the histogram linearly. Only three indices are required: *g* for the current large bar, *m* for the current small bar, and *mm* for the next possible small bar. While *g* and *mm* keep track of the positions that we’ve examined and never decrease, *m* sometimes will jump backwards if a large bar has become small. This is the code that Dawg 2.0 is currently using to construct an alias table for codon simulations, and is more efficient than previous solutions.

/* * Initialize alias table from discrete distribution, pp */ void create_alias_table(const double *pp) { // normalize pp and copy into buffer double f=0.0, p[64]; for(int i=0;i<64;++i) f += pp[i]; f = 64.0/f; for(int i=0;i<64;++i) p[i] = pp[i]*f; // find starting positions std::size_t g,m,mm; for(g=0; g<64 && p[g] < 1.0; ++g) /*noop*/; for(m=0; m<64 && p[m] >= 1.0; ++m) /*noop*/; mm = m+1; // build alias table until we run out of large or small bars while(g < 64 && m < 64) { // convert double to 64-bit integer, control for precision h[m] = (static_cast<uint64_t>( ceil(p[m]*9007199254740992.0)) << 11); a[m] = g; p[g] = (p[g]+p[m])-1.0; if(p[g] >= 1.0 || mm <= g) { for(m=mm;m<64 && p[m] >= 1.0; ++m) /*noop*/; mm = m+1; } else m = g; for(; g<64 && p[g] < 1.0; ++g) /*noop*/; } // any bars that remain have no alias for(; g<64; ++g) { if(p[g] < 1.0) continue; h[g] = std::numeric_limits<boost::uint64_t>::max(); a[g] = g; } if(m < 64) { q[m] = std::numeric_limits<boost::uint64_t>::max(); a[m] = m; for(m=mm; m<64; ++m) { if(p[m] > 1.0) continue; h[m] = std::numeric_limits<boost::uint64_t>::max(); a[m] = m; } } }

Note that if the values in pp were not immutable, then we would not need p.

**Note 2012/11/25:** updated to fix typos.

**Note 2013/05/03:** fixed typo.

I would be very hesitant to use the low-order bits of a pseudo-random number the way that you have here. What technique does your rand_uint64() generator use and have you tested the low order bits for randomness?

That is a topic for another post. I use an 64-bit xorshift prng coupled with a Weyl prng. Both the high 32-bits and the low 32-bits pass the BigCrush tests.

What would be the problem with simply partitioning the numbers from 1 to 100 into ranges that correspond to the relative probabilities of A, C, G, & T?

Assuming one has confidence in the uniform random

integergenerator, any number 1 through 26 is A, a number 27 through 50 is a G, 51 through 73 is a G, and 74 through 100 is a T.Would it be the cost of the comparisons?

Another possibility is a uniform random sample of a string made up of 27 As, 23 Cs, 24 Gs, and 27 Ts. If one is confident in the random number generator, the letters could simply be in the order ACGT; otherwise, they could be randomly shuffled after the string has been generated and one then samples randomly from that string.

You are talking about the “Loaded Die from Fair Die” method described in “Darts, Dice, and Coins”.

There are several problems with it.

1) It doesn’t work for arbitrary levels of precision. If A has a frequency of 26.66%, you need to draw random numbers from 0 to 10000. Thus as your frequencies become more precise, the amount of memory you need to store the table increase, as does the time you need to initialize it.

2) Also drawing an integer between [0,max) is not as efficient as drawing a number between [0,2^n).

I am not sure I understand the necessity for such precision in sampling. There is apparently already a built-in assumption about chemical reaction rates. Are such rates already folded into those ACGT probabilities?

It depends on the model. In most cases, you would use frequencies estimated from data, which can easily provide precisions over 1 ppm.

For codon models, which contain 64 possibilities, precision is really important because there can be a lot of variation between codon frequencies. Let’s say the rarest (non-zero) codon has a scaled frequency of 1 and the most common 100. You could then round the frequencies to the nearest integer and construct an array, which will have a size below 6400. Okay, that is doable.

But, what do you do if the second rarest has a frequency of 1.5? If you round that to the nearest 1 or 2, you are either increasing or decreasing it’s frequency by about 33%. That isn’t good option.

Regardless, it is not good practice to limit the precision of the simulation because you don’t know what your users will want to test. As I have it now implemented, the alias table supports 53-bits of precision, the same amount as contained in a double floating point number.

Ah; got it!

That’s very impressive. I didn’t realize one could pull that kind of precision out of such data.

You are not using the low order 32 bits in the example. You are using the low order SIX bits (&63). Xor shift registers typically produce lousy low-order bit randomness. Combined with table lookup from another generator it might be fine, but I do not have enough information about your algorithm to evaluate it from your description. Anyway, the low-order 32-bits might be great; I would never assume that the low order SIX bits were just as good.

Renee

You can see the code that I’m using here:

http://scit.us/projects/dawg/browse[…]orshift_64.h

I will reverse the bits to see if it still passes the BigCrush tests.

Renee, I was thinking about your comments, and conceptually, it is better that the table lookup is from the high bits and the alias test uses the lower bits. This is because significance matters less for the alias test than the table lookup.

Producing the following lookup algorithm:

For this specific problem, a binary search works:

u = rand(); return (u < 0.49) ? ((u < 0.26) ? A : C) : ((u < 0.73) ? G : T);

Roy

A binary search does work, but it is not as efficient as the alias method.

Although the randomized data you generate may look much like that in real data sets, the software generation of a tree from an initial set of codons may be random but unrealistic. There may be many biases that are not simulated, say, codon bias, or some epigenetic effects of apparently non-coding sequences correlated with selection of the phenotype. Thus, a randomized generation of tree-data is okay for some kind of ideal, but in any one particular data set there may be very strong biases on part of the data that disallow a random model as a research tool.

Query: How does one test for randomness in a series of real data, say of apparent non-coding sequences?

Update