User:Jarle Pahr/Phylogenetics

From OpenWetWare
Jump to: navigation, search

Much of the content on this page is based on Chapters 6 and 7 in Essential Bioinformatics by Jin Xiong.


  • Bootstrapping: A statistical technique that tests the sampling errors of a phylogenetic tree.
  • Homoplasy: The obscuring of evolutionary distance which occurs because of several consecutive mutations at the same nucleotide positions.
  • Among-site variation/among-site heterogenity: Differences in evolutionary rates among nucletoide/amino acid positions. Generally, a portion of sites are variant and the rest are invariant. The distribution of variant sites forllows a gamma distribution.

Newick format: A tree representation format using linear nested parantheses. Taxas are separated by commas. For scaled trees, branch lengths are indicated immediately after the taxon name. Examples:


Phylogenetic markers

  • 16S RNA
  • RpoB
  • GyrB
  • EF-Tu
  • pgk
  • dnaK
  • 16S–23S ITS





Substitution models

Statistical models used to correct homoplasy are called substitution models or evolutionary models.

Jukes-Cantor model:

Assumes that all nucleotides are substituted with equal probability (unrealistic).

  • Can only handle reasonably closely related sequences.


d_AB = -(3/4) ln [1-(4/3)p_AB]

d_AB: Evolutionary distance between sequences A and B. p_AB: Observed sequence distance, measured by proportion of substitutions over the entire length of the alignment.

Formula corrected for among-site variation:

d_AB = (3/4)alpha [(1-(4/3)p_AB)^-1/alpha] -1  ?? (Formula is incomplete in Xiong's book. Need to check this out.)

alpha: The gamma correction factor.

Kimura model:

Mutation rates for transitions and transversion are assumed to be different (more realisti than Jukes-Cantor)


d_AB = -(1/2) ln(1- 2 p_ti - p_tv) - (1/4) ln (1-2 p_tv)

p_ti: Observed frequency for transition. p_tv: Observed frequency for transversion.

Formula adjusted for among-site variation:

d_AB = (alpha/2)[(1- 2pti - ptv)^-1/alpha - (1/2)(1-2ptv)^-1/alpha - 1/2]

alpha: The gamma correction factor.

Kimura model for protein distance:

d = -ln(1- p -0.2p^2)

p: Observed pairwise distance between two sequences.

More advanced models: TN93, HKY, GT3. Take more parameters into consideration, but not normally used in practice (complex calculation, high variance).

Three estimation methods

Clustering-based methods:

Unweighted Pair Group Method Using Arithmetic Average (UPGMA):

  • The simplest clustering method.
  • Basic assumption: All taxa evolve at a constant rate and are equally distant from the root ("molecular clock" hypothesis). Unlikely to hold for real data.
  • Fast calculation speed.

Neighbour Joining (JI)

  • The most widely used tree estimation method.

Optimality-based methods:

Fitch-Margoliash (FM)

Minimum Evolution (ME)

Character-based methods:

Maximum Parsimony (MP)

  • One of the first methods applied to phylogenetic tree construction.

Maximum Likelyhood (ML)

Bayesian Inference (BI)

Tree representation

  • Phylogram (scaled tree): The branch lengths represent the amount of evolutionary divergence.
  • Cladogram (unscaled tree): Branch lengths have no phyologenetic meaning.


Examples from the literature of (proposals for) reclassifications of taxonomies:




A daily-updated tree of (sequenced) life as a reference for genome research:

Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies:

Molecular phylogenetics: State of the art methods for looking into the past. Trends Genet. 17:262-72.


Phylogenetic Trees Made Easy - a how-to manual. Fourth edition. Barry G Hall.

Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer Associates.

The Phylogenetic Handbook:

Jin Xiong: Essential Bioinformatics