User:Carl Boettiger/Notebook/Comparative Phylogenetics/2010/02/03

Comparative Phylogenetics

Launching the Open Lab Notebook for Comparative Phylogenetics

Welcome to my lab notebook. This project had its roots in discussions with Chris Martin from the Wainwright lab in April 2009, and began in earnest as my thesis proposal in October 2009, so this notebook still has some catching up to do. Much of that was broadly exploratory and preparing for my Qualifying Exam for Ph.D. Candidacy, which took place on January 13th, 2010, which marks a more effective start date for the actual implementation of the proposal.

So lets pick up from where the proposal left off, starting with a look at what's done so far:

First fundamental task is calculating the likelihood (joint probability of the observed data) under different models. Existing methods have been based entirely on linear models because their solutions are multivariate normal, and the moments can be calculated analytically. The general definition of the joint probability of a model on the tree requires integrating over all the unknown internal nodes. My first task was to demonstrate that I could do this suitably fast and accurately given the transition probabilities along each branch. Initially I simply specified the transition density functions and attempted adaptive mesh integration up the tree. By choosing linear models for the transition densities, I could compare the results of this general approach to their analytic solutions. This approach works but scales very poorly for large trees.

I have since replaced this approach by discretizing the transition probabilities along a finite grid. This turns the integration into matrix multiplications. My initial implementation of this was also flawed, as it calculated the probability of each tip independently. Luckily, calculating the joint probability is in fact computationally simpler, involving using the transition matrix from each branch only once. I've tested this approach against the analytic solution and find that it agrees, often to machine precision, with the analytical results even for small grid sizes. Most of the deviation is introduced by having the range of the grid too small rather than by having too few grid cells. By far the slowest part of this approach is filling out the matrices initially, which will only become slower when I consider the nonlinear (i.e. piecewise linear regimes model) model unless I discover the appropriate tricks. This was the status as of Friday.

Over the weekend I implemented a data structure to handle all the matrices for the regimes approach. In this approach for s regimes, each internal node requires [math]\displaystyle{ S^2 }[/math] matrices, while each tip requires [math]\displaystyle{ S }[/math] vectors if we assume that both it's state and trait value are given.

There are several ways to fill out these matrices under the regime shifts model. The matrix gives the probability of a transition from a node starting with character value x_i starting in regime s_i to character value x_j in regime s_j. Hence on a grid of size [math]\displaystyle{ n }[/math] this involves the [math]\displaystyle{ S^2 }[/math] matrices of size [math]\displaystyle{ n^2 }[/math]. Each of these elements can be computed under the assumption that transitions occur anywhere along branches (as I originally proposed for my quals) or that the occur only at the nodes. The later is computationally much simpler, since I believe it avoids the matrix exponentiation but also ignores branch length. I still need to implement the function for this simpler case. In the former I've needed to make the simplifying assumption that jumps between regimes assume values from the stationary distribution, otherwise I lose the Markov property as trait value depends on path. The equation for this case appears in my proposal and I have implemented in the code. The calculation is already slower than I might have liked.

The current matrix implementation of integrate-up-the-tree uses only a single transition matrix per branch, since it wasn't written for the regimes case. This will require modifying my recursive function a bit to handle having multiple possible transition matrices along each branch (corresponding to different possible states) that I will have to sum over.

I've been investigating APIs for RJMCMC so I don't have to start coding from scratch. It appears that the MCMC routine in MR. Bayes is written from the ground up, rather than as its own library. David Hastie, former Ph.D. student to Peter Green (who invented the method), has some older software called AutoMix for this. I've been in touch with David, who is moving to a new position in April and will be working full time on a replacement MCMC API and may be interested in collaborating on this project.

For the primary objective of the project - handling multiple niches so that the distribution of traits is not always multivariate normal - the crux is still specifying an appropriate model. Transitions at nodes only seems most appealing as a starting point. Of course if I can generate the transition density fast enough from the forward kolmogorov equation of a nonlinear model, that too would be appealing, though that landscape model has its own interpretation challenges.

Another early step would be to use the observed distribution of traits as the stationary distribution and test that model against the linear ones. Of course this requires parameterizing the stationary distribution and accepting this model would imply the lack of any phylogenetic signal -- i.e. branches were sufficiently long enough for the process to reach stationary state.

So far I've focused entirely on calculating the likelihood of the model itself. I've implemented classic multidimensional minimizers -- Nelder-Meade simplex algorithm and a simulated annealing algorithm -- to search over the parameter space for the maximum likelihood parameter set. Eventually I plan to implement this search as Reversible Jump Markov Chain Monte Carlo instead. Any of these approaches still requires the ability to quickly determine the likelihood of a given model, so of course that is my primary focus at this point.

Meanwhile, I'm also starting to explore the Bayesian MCMC approach simply for choosing between BM and various OU models. Eventually I'd like to extend this framework to include the mixed OU models of Butler and King (2004) implemented in OUCH. This should be a good warm-up for the MCMC of my full model.

Comments or suggestions on any of this? Leave a message on the talk page or contact me.