Much of the content on this page is based on Chapters 6 and 7 in Essential Bioinformatics by Jin Xiong.
- Bootstrapping: A statistical technique that tests the sampling errors of a phylogenetic tree.
- Homoplasy: The obscuring of evolutionary distance which occurs because of several consecutive mutations at the same nucleotide positions.
- Among-site variation/among-site heterogenity: Differences in evolutionary rates among nucletoide/amino acid positions. Generally, a portion of sites are variant and the rest are invariant. The distribution of variant sites forllows a gamma distribution.
Newick format: A tree representation format using linear nested parantheses. Taxas are separated by commas. For scaled trees, branch lengths are indicated immediately after the taxon name. Examples:
- 16S RNA
- 16S–23S ITS
Statistical models used to correct homoplasy are called substitution models or evolutionary models.
Assumes that all nucleotides are substituted with equal probability (unrealistic).
- Can only handle reasonably closely related sequences.
d_AB = -(3/4) ln [1-(4/3)p_AB]
d_AB: Evolutionary distance between sequences A and B. p_AB: Observed sequence distance, measured by proportion of substitutions over the entire length of the alignment.
Formula corrected for among-site variation:
d_AB = (3/4)alpha [(1-(4/3)p_AB)^-1/alpha] -1 ?? (Formula is incomplete in Xiong's book. Need to check this out.)
alpha: The gamma correction factor.
Mutation rates for transitions and transversion are assumed to be different (more realisti than Jukes-Cantor)
d_AB = -(1/2) ln(1- 2 p_ti - p_tv) - (1/4) ln (1-2 p_tv)
p_ti: Observed frequency for transition. p_tv: Observed frequency for transversion.
Formula adjusted for among-site variation:
d_AB = (alpha/2)[(1- 2pti - ptv)^-1/alpha - (1/2)(1-2ptv)^-1/alpha - 1/2]
alpha: The gamma correction factor.
Kimura model for protein distance:
d = -ln(1- p -0.2p^2)
p: Observed pairwise distance between two sequences.
More advanced models: TN93, HKY, GT3. Take more parameters into consideration, but not normally used in practice (complex calculation, high variance).
Three estimation methods
Unweighted Pair Group Method Using Arithmetic Average (UPGMA):
- The simplest clustering method.
- Basic assumption: All taxa evolve at a constant rate and are equally distant from the root ("molecular clock" hypothesis). Unlikely to hold for real data.
- Fast calculation speed.
Neighbour Joining (JI)
- The most widely used tree estimation method.
Minimum Evolution (ME)
Maximum Parsimony (MP)
- One of the first methods applied to phylogenetic tree construction.
Maximum Likelyhood (ML)
Bayesian Inference (BI)
- Phylogram (scaled tree): The branch lengths represent the amount of evolutionary divergence.
- Cladogram (unscaled tree): Branch lengths have no phyologenetic meaning.
Examples from the literature of (proposals for) reclassifications of taxonomies:
A daily-updated tree of (sequenced) life as a reference for genome research: http://www.nature.com/srep/2013/130618/srep02015/full/srep02015.html
Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies: http://nar.oxfordjournals.org/content/41/1/e1.full?sid=e66b42ac-a309-47cf-8cd1-94e1229a098e
Molecular phylogenetics: State of the art methods for looking into the past. Trends Genet. 17:262-72.
Phylogenetic Trees Made Easy - a how-to manual. Fourth edition. Barry G Hall.
Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer Associates.
The Phylogenetic Handbook: http://www.amazon.com/dp/0521730716/ref=rdr_ext_tmb
Jin Xiong: Essential Bioinformatics