I am a computational evolutionary geneticist, and in my research I develop software to analyze genetic data and study evolutionary questions. As part of my research, I work with a lot of simulation programs to generate evolutionary datasets. My most widely used program is Dawg, and I am currently putting the finishing touches on a new version.
In simulating molecular sequences, you start by simulating the ancestral sequence at the root of a phylogenetic tree and then evolve that sequence upwards, making point mutations and indels as you go. Depending on type of sequences being generated, the root would be a string of nucleotides, amino acids, or codons. To simulate the root sequence, we draw its characters from a discrete, stochastic distribution. For example, lets say that in your system the frequencies of A, C, G, and T are 26%, 23%, 24%, and 27%. In order to create a root sequences of length k, you simply sample k nucleotides from this distribution, e.g. AAGTGCA or GATTACA.
Therefore, the key step in the simulation of the root sequence is sampling repeatedly from an arbitrary discrete distribution. While I have been doing this for years, I recently went searching for doing it better and came across the following excellent article: Darts, Dice, and Coins, written by Keith Schwarz, a lecturer at Stanford. In this article, he describes many different methods for sampling from a discrete distribution and analyzes their performance. It turns out the best method is the Alias Method, first described in the 1970s and improved by M. Vose in 1991. I will describe it below, but before we get there, here are some alternatives.
Imagine that you want to sample from the following discrete distribution of nucleotides:
Let the heights of these bars be h0, h1, h2, and h3. Since these heights correspond to the probability that a random base is A, C, G, or T, the total area of the histogram is 1. Now to sample from this histogram, you can draw a uniform random number—call it u—between 0 and 1 and find which bar corresponds to that number. If u < h0, you sample an A. If u < h0+h1, you sample a C, etc. This works, but is clearly inefficient since it requires you to search through the histogram from left to right every time you sample a nucleotide. Imagine if you were sampling from 64 codons.