Recently in Lab Notes Category

Tips on Lab Meetings


As I prepare for another lab meeting, I realize that I have no idea what I am doing.

Any experienced PIs have any tips about putting together successful lab meetings?

Keep in mind that my lab consists of about 7 undergraduates, a grad student, and a postdoc.

I am a computational evolutionary geneticist, and in my research I develop software to analyze genetic data and study evolutionary questions. As part of my research, I work with a lot of simulation programs to generate evolutionary datasets. My most widely used program is Dawg, and I am currently putting the finishing touches on a new version.

In simulating molecular sequences, you start by simulating the ancestral sequence at the root of a phylogenetic tree and then evolve that sequence upwards, making point mutations and indels as you go. Depending on type of sequences being generated, the root would be a string of nucleotides, amino acids, or codons. To simulate the root sequence, we draw its characters from a discrete, stochastic distribution. For example, lets say that in your system the frequencies of A, C, G, and T are 26%, 23%, 24%, and 27%. In order to create a root sequences of length k, you simply sample k nucleotides from this distribution, e.g. AAGTGCA or GATTACA.

Therefore, the key step in the simulation of the root sequence is sampling repeatedly from an arbitrary discrete distribution. While I have been doing this for years, I recently went searching for doing it better and came across the following excellent article: Darts, Dice, and Coins, written by Keith Schwarz, a lecturer at Stanford. In this article, he describes many different methods for sampling from a discrete distribution and analyzes their performance. It turns out the best method is the Alias Method, first described in the 1970s and improved by M. Vose in 1991. I will describe it below, but before we get there, here are some alternatives.

Imagine that you want to sample from the following discrete distribution of nucleotides:


Let the heights of these bars be h0, h1, h2, and h3. Since these heights correspond to the probability that a random base is A, C, G, or T, the total area of the histogram is 1. Now to sample from this histogram, you can draw a uniform random number—call it u—between 0 and 1 and find which bar corresponds to that number. If u < h0, you sample an A. If u < h0+h1, you sample a C, etc. This works, but is clearly inefficient since it requires you to search through the histogram from left to right every time you sample a nucleotide. Imagine if you were sampling from 64 codons.

About this Archive

This page is an archive of recent entries in the Lab Notes category.

Journal Club is the previous category.

Legal Issues is the next category.

Find recent content on the main index or look in the archives to find all content.



Author Archives

Powered by Movable Type 4.381

Site Meter