# Discussion on phylogenetic methods

Major overhaul (15 April 2009): I reorganized the wiki pages on phylogenetic methods; the main page is Phylogenetic Methods, and this page is reserved for discussion. I have also broken up this discussion page into several pages because it was getting unwieldy, and I didn't feel like I knew what was here anymore.

If you add something to this page, consider **just putting a summary here and starting a new page for the full content.**

-- Sam Riesenfeld

## Links to relevant discussions

**Which methods do we try?**We have compiled a list of methods for alignment and phylogeny-building that we may want to compare.**Discussion on Jonathan's blog**: There is a large discussion that is very relevant on Jonathan's blog.**Initial conference call discussion**: Notes**Sequence data and simulated data**: Complete sequences are from ComboDB. We want to test methods on simulated metagenomic data.

## JAE Sent Email to Multiple People

See his e-mails. See all responses.

### Sumary of Responses

- Brad Shafer suggests using supermatrix approach and asking Bob Thomson. Robert Thomson says success depends on the pattern of missing data, advises against supertrees, and asks for more details.
- Peter Wainwright says ask Brad and Sanderson, thinks supermatrix is right idea.
- Junhyong Kim says "some kind of supertree method" and gives references to previous work, including work on alignment assembly, which he says has been superceded by work by Lior Pachter.
- Jonathan Badger and postdoc John Mccrow suggest building reference tree of full length sequences and scoring each possible additive position in reference tree for each fragment.
- Brian O ' Meara says use supermatrix approach, though there may be issues if fragments don't overlap enough. He has ideas about estimating branch lengths correctly for fragments.
- Mike Sanderson says: Explore the data with parsimony as a concatenated alignment and then MRP supertrees before going on to model-based methods.
- Bruce Rannala says he'll get back to you.
- Mark Blaxter says to represent nonoverlappers by unresolved polytomies and suggests Bayesian approaches, in particular, MrBayes + Tracer. He gives a full outline of how to do the alignments also.
- Rutger Vos (in response to Jenna rather than Jonathan) says to make a tree out of complete sequences and also a tree out of all sequences and try to figure out how to deal with unstable parts of full tree.
- Lior Pachter never responded (to e-mails from Sam and Katie)

## Katie talked with Steven Brenner about this problem

The Brenner lab (with Bob Edgar and Michael Jordan) have thought about this problem. They shared a grant proposal on the topic of alignments and phylogenetic tree building with metagenomic data. Their main tree building ideas were (1) subset the alignment into blocks without terminal gaps and do super-tree phylogenies on these blocks, and (2) use distance matrix (e.g. NJ) tree methods on a matrix where each pairwise distance is computed using only the columns that overlap for that pair of sequences. They have some new ideas on this problem and want to set up a meeting.

## A recent paper by Cheng et al addresses this problem

Cheng et al consider gappy alignments (from ESTs or metagenomics). They propose a method for tree building called Subdividing Incomplete Alignments (SIA). This is basically a supermatrix approach. Distance matrices are built for subalignments without much missing data. Then, these distance matrices are combined into a single distance matrix using a linear model and Bayesian methods. Distances are assumed to be normally distributed.

## FastTree

Morgan Price (Adam Arkin's lab) shared a manuscript about an algorithm called FastTree. There is a discussion about this on Jonathan's blog. FastTree avoids building a full distance matrix and uses heuristics to explore candidate joins in a very local way. It looks like it performs surprisingly well. The methods include local bootstrap estimates. Performance (speed and memory) optimization are emphasized - FastTree will work on thousands of sequences, where likelihood and distance methods break down.

## Missing data misleads some ML and Bayesian methods

See the full discussion of the Lemmon et al paper on the effect of missing data.

Summary: Aaron received a prepress copy of the Lemmon et al paper on the effect of missing data on ML and Bayesian phylogenetics that will appear in Systematic Biology. He summarized the take-home messages as: treating alignment gaps as missing data can positively mislead inference of topology and branch length. A second type of misleading influence occurs when performing inference on a concatenated alignment of genes that have evolved at different rates. For most of the tests, the ML methods hold up much better than Bayesian methods. In the presence of missing data the Bayesian priors end up strongly shaping the posterior distribution over topologies. Modeling gamma-distributed rate heterogeneity can force sequences with missing data to cluster erroneously.

## Katie discussed the problem with David Swofford (at Duke)

David is interested in the problem, and thinks it is not trivial. He might like to collaborate. I pointed him to the blog and some of the papers we have been reading. His initial thoughts/hunches were:

- supermatrix approaches might not work with so much missing data
- likelihood-based (ML or Bayesian) tree building methods will do better than distance matrix-based methods (e.g. NJ)
- treating reads as taxa (leaves) makes sense
- using a guide tree (AMPHORA or rRNA) is a good idea

## Aaron's harebrained scheme

Aaron has attempted to specify a model for phylogeny of incomplete data. Likelihood for incomplete data

## Robin Kodner and Erick Matsen

Robin and Erick are coming to Jonathan's lab meeting Thursday 2/19/09.

See Sam's summary of Robin's talk from Metagenomics 08. The phylogenetic method she they are using:

0) Build a reference alignment and phylogeny with reference sequences. 1) For each metagenomic fragment **separately**: a) align the fragment to the reference alignment; b) build a phylogeny using the fragment and the subsequences of the reference sequences that align with this fragment. 2) Combine each phylogeny (of one fragment + reference subsequences) into one big phylogeny, leaving polytomies unresolved. Note: PHYML used for steps 0 and 1.

**Collaborating with Robin and Erick**: Katie, Sam, and Jonathan have talked to Robin and Erick about using their program ("pplacer") as a starting point for developing a new method.

## Meeting of Katie and Sam with Steven Brenner, Bob Edgar, Mike Jordan, David Soergel

This was a really informative, good discussion; see the complete notes for details on methods we talked about and possible issues we considered.

Summary:
We talked about several existing and novel approaches to building trees. Roughly, these include:

- Full Maximum Likelihood (ML), treating gaps as missing data: Probably not computationally feasible.
- Iterative approach where reads are assigned to branches (e.g. by a distance method, like Erick Matsen's) and then ML is performed on subtrees/clades
- Subalignment extension

## Albert Wu (LBNL) visited the Pollard Lab

Albert works on alignment-free phylogenetic methods. He has some nice results with large DNA viruses, as well as eukaryotes, the tree of life, and texts. A copy of his slides is here.

## Bob Edgar is working on alignment tools that might be useful

Katie followed up with Bob after our meeting with the Brenner Lab. He has two new tools that he's wondering if we may be interested in using/testing.

See the complete details and discussion. In summary, his tools are:

1) de novo multiple sequence alignment tool for metagenomic data that performs well on fragmented, incomplete sequences

2) frameshift detection tool

Do we want to try either of these? Post your comments here.

**Follow-up:** Bob has sent code to us. Sam will test out his tools on a few data sets. If any look promising, we may do larger evaluations with simulations or will provide simulated data sets to Bob for his own benchmarking.

## Karen Cranston (EOL/Sanderson lab)

I spoke with Karen Cranston, former Sanderson postdoc who recently moved to EOL at the Field Museum. She works a lot with EST data which are similar to metagenomic data (short reads aligned to some full-length reference sequences) and has found that you can often do a good job of recovering phylogenetic information when you have reference data to align the short sequences to. She suggested that the methods that people use to work with EST data ought to work for metagenomic data (variety of ML/Bayesian methods), but it's really important to assess uncertainty in placement of reads using some measure of support (bootstrap or Bayesian posterior probabilities, since there will probably be a lot of uncertainty.

## Alexis Stamatakis has implemented short read placement methods in RAxML

The latest alpha version of RAxML (7.2.1) implements an algorithm for placing short reads on a reference phylogeny using ML. The method can use bootstrap resampling to evaluate likelihood of read placement at each node on a reference phylogeny. This sounds similar to PPlacer, but based on evaluating likelihood of placing each read at each node in the tree, output is a list of bootstrap placements of reads on the tree with confidence for each read.