Measuring phylogenetic diversity using metagenomic data

General question is how we can measure phylogenetic diversity using metagenomic data. General approach being used right now:

Identify sequences from different gene families (using AMPHORA)
Align metagenomic reads identified by AMPHORA to the reference sequence database of AMPHORA
Build a tree from the combined metagenomic/reference sequences
- Trying several different tree inference methods (ML, Bayesian, MinEv, etc.)
Use the resulting phylogeny to estimate various measures of phylogenetic diversity and community structure

Project: Estimating phylogenetic diversity using AMPHORA marker genes

Build 31 separate trees (one for each gene family)
Concatenate/tile sequences into a single large alignment/supermatrix and build one tree

Sample size issues
- For individual gene family alignments, most genes have few sequences, making it hard to estimate phylogenetic diversity
- The combined alignment with all 31 gene families is large (7381 sites, 1075 sequences) but can be analyzed in reasonable time using RAxML (~12-24 hours on 8 cores).

Then prune reference sequences, leaving just metagenomic reads on tree (but tree was built using reference sequences as a phylogenetic scaffold)

Seems to be working well, most mg reads get placed fairly close to some sequence in the reference tree. Phylogenies will be posted here.

Total/mean branch length within/among samples
Look at whether samples close together in space/environment more phylogenetically similar
- i.e. cluster samples based on community phylodiversity similarity
Look at whether alpha diversity/phylodiversity changes in space/environment
Compare taxonomic (16S OTU) and phylogenetic diversity
Compare taxonomic assignments (from AMPHORA) to taxonomic assignments from COGs/etc.
Identify nodes on tree over/underrepresented in different samples (model habitat evolution)