From OpenWetWare
Jump to navigationJump to search

Placing metagenomic reads on a reference phylogeny

This is a place to discuss the method for building phylogenies from metagenomic reads that I've been trying.

Individual gene families

For an individual gene family, the idea is simply to take the reference sequences from AMPHORA plus the metagenomics reads from that gene family identified by AMPHORA, and run a phylogenetic analysis on the reference plus query sequences together.

There are scripts to take AMPHORA reference sequences plus metagenomic reads from individual gene families and build a phylogeny for that gene family here.

Multiple gene families

For multiple gene families, the idea is similar, but now we take the reference sequences from all 31 gene families in AMPHORA and concatenate them. We then 'tile' the metagenomic reads that were aligned to each gene family into the same alignment. So we end up with something that looks like this (for a case where we have three gene families A, B, and C):

  • mgseqA1 --XXXX-----------------------
  • mgseqA2 -XXXX------------------------
  • mgseqA3 ----XXXXX--------------------
  • mgseqB1 ----------XXXX---------------
  • mgseqB2 -------------XXXX------------
  • mgseqB3 ----------XXXXX--------------
  • mgseqC1 --------------------XXXXXXXXX
  • mgseqC2 ---------------------XXXXX---

We build a phylogeny from this alignment using ML methods, either by inferring a tree from the entire alignment or using the methods in RAxML 7.2.1 to iteratively place the query sequences on the AMPHORA genome tree.

The phylogenetic signal in the reference sequences is used to anchor the metagenomic reads on the resulting tree. Imagine that query sequences A1 and B1 came from the same organism refspp1 on the AMPHORA genome tree. Even though these sequences are from different gene families, they should be placed as sister to each other and refspp1 on the resulting tree, and be separated by a very short or zero branch length.


Differences in rates of evolution among different genes will be accounted for by fitting the gamma rate heterogeneity parameter and by partitioning by gene family in the ML analysis. So it will be meaningful to compare branch lengths among sequences that came from different gene families.

I have this analysis working for the HOT/ALOHA data set but there are only a few hundred sequences in that data set. It may not be possible to run a full ML analysis for larger metagenomic data sets such as GOS which have ~150,000 sequences. In that case, we may be stuck using NJ methods to build the tree from the combined multi-gene alignment. Is there some way we can account for heterogeneity in rates of evolution among sites or genes in this kind of analysis? Or should we restrict ourselves to single-gene-family analyses in this case? This may still not be ideal since there will be among-site variation in rates of evolutino even within a single gene family.

Jonathan suggested scaling branch lengths for each metagenomic sequence by the rate of evolution of that gene. Other suggestions?