User:R. Eric Collins/MBL/Codon

From OpenWetWare

< User:R. Eric Collins | MBL

Jump to navigation Jump to search

Codon models

"a model is an INTENTIONAL simplification..." Daniel L. Hartl

parameter estimation
hypothesis testing
site identification

t is mean # of substitutions per CODON site
- t=3 indicates saturation

assumptions/simplifications about molecular evolution:
- codon frequencies unbiased (1/61 usage)
- all point mutations equally likely (transition/transition bias)
- single/multiple mutations at one point

models
- Nei and Gojobori 1986 should be avoided at all cost
- others should be used with caution BASED ON KNOWING YOUR DATA
- Goldman-Yang (GY)
- proportional to target codon
- Muse-Gaut (MG)
  - proportional to target nucleotide
  - no effect of context: (mammals yes, drosophila no...)

likelihood
- of the data at a site: sum over each codon: product of equilibrium frequency, likelihood of transition from each codon to observed codons
- of entire alignment: product of likelihoods at each site
- in some examples: If we average over the tree, we do NOT detect positive selection;

Problem: averaging over a pair has very low power if the questions are about “when” or “where”!
- Solution: Phylogenetic estimation of selection pressure
  - variable ω over branches (when?)
  - variable ω over sites (where?)
  - variable ω over branches and sites (when and where?)

branch models (time)
- episodic large-scale changes
- site-wise: if have hundreds of sequences, can use this

- M1a + M2a (plus LRT) is very robust
  - best model FOR WHAT? 'best' model overall may not be the best for my purposes
  - M1a doesn't allow positive selection
  - M2a does allow for positive selection
  - so LRT between them can demonstrate positive selection
  - for most well-behaved datasets, empirical codon bias (pi) can be used
  - M2a is not best explanation but has better LRT properties because more robust to changes in assumptions

- M7 + M8
  - M7 uses beta distribution (0..1) that explains 'boring' stuff
  - M8 uses beta distribution (0..1) plus omega (>1) to account for positive selection
  - M8 is 'best' model but is more sensitive to changes in assumptions

- chi-squared & boundary problem
  - the LRT does not follow the χ2 distribution
  - no other mixture distribution works well
  - too costly to do parametric bootstrapping/simulations for each LRT
  - so using chi-square anyway (naively), how bad could it be?
  - even still, the LRT is conservative AND powerful
  - there's a window of sequence divergence where these will work, outside of those (short branches/very long branches) it may not

examples
- need to know data, but how many possible types of "data" are there?
  - how many ways are there for a protein family to evolve?
- GFP
  - color variation, purifying selection due to antagonistic host interactions
- proteorhodopsin
- Listeria
  - looking at variations in omega along the tree, isolating groups/clusters of genes/networks under selection

average of tree average of sites variation in sites and tree

Retrieved from "https://openwetware.org/mediawiki/index.php?title=User:R._Eric_Collins/MBL/Codon&oldid=526195"

Navigation menu