User:R. Eric Collins/MBL/Codon

From OpenWetWare
Jump to: navigation, search


Codon models

  • "a model is an INTENTIONAL simplification..." Daniel L. Hartl
  1. parameter estimation
  2. hypothesis testing
  3. site identification
  • t is mean # of substitutions per CODON site
    • t=3 indicates saturation
  • assumptions/simplifications about molecular evolution:
    • codon frequencies unbiased (1/61 usage)
    • all point mutations equally likely (transition/transition bias)
    • single/multiple mutations at one point
  • models
    • Nei and Gojobori 1986 should be avoided at all cost
    • others should be used with caution BASED ON KNOWING YOUR DATA
    • Goldman-Yang (GY)
    • proportional to target codon
    • Muse-Gaut (MG)
      • proportional to target nucleotide
      • no effect of context: (mammals yes, drosophila no...)
  • likelihood
    • of the data at a site: sum over each codon: product of equilibrium frequency, likelihood of transition from each codon to observed codons
    • of entire alignment: product of likelihoods at each site
    • in some examples: If we average over the tree, we do NOT detect positive selection;
  • Problem: averaging over a pair has very low power if the questions are about “when” or “where”!
    • Solution: Phylogenetic estimation of selection pressure
      • variable ω over branches (when?)
      • variable ω over sites (where?)
      • variable ω over branches and sites (when and where?)
  • branch models (time)
    • episodic large-scale changes
    • site-wise: if have hundreds of sequences, can use this
    • M1a + M2a (plus LRT) is very robust
      • best model FOR WHAT? 'best' model overall may not be the best for my purposes
      • M1a doesn't allow positive selection
      • M2a does allow for positive selection
      • so LRT between them can demonstrate positive selection
      • for most well-behaved datasets, empirical codon bias (pi) can be used
      • M2a is not best explanation but has better LRT properties because more robust to changes in assumptions
    • M7 + M8
      • M7 uses beta distribution (0..1) that explains 'boring' stuff
      • M8 uses beta distribution (0..1) plus omega (>1) to account for positive selection
      • M8 is 'best' model but is more sensitive to changes in assumptions
    • chi-squared & boundary problem
      • the LRT does not follow the χ2 distribution
      • no other mixture distribution works well
      • too costly to do parametric bootstrapping/simulations for each LRT
      • so using chi-square anyway (naively), how bad could it be?
      • even still, the LRT is conservative AND powerful
      • there's a window of sequence divergence where these will work, outside of those (short branches/very long branches) it may not

  • examples
    • need to know data, but how many possible types of "data" are there?
      • how many ways are there for a protein family to evolve?
    • GFP
      • color variation, purifying selection due to antagonistic host interactions
    • proteorhodopsin
    • Listeria
      • looking at variations in omega along the tree, isolating groups/clusters of genes/networks under selection

average of tree average of sites variation in sites and tree