Functional Convergence between Communities
Initiator: Thomas Sharpton
This page catalogs a brain dump/notes in regards to the identification of convergent gene family evolution between communities. A more structured/formalized round of documentation will take place at a later date. For now these should be viewed as scratch notes.
If top down selection pressures have shaped the metagenome, then taxonomically disperate but environmentally similar communities should exhibit more similar phylogenetic diversity patterns (counts, UniFrac, Fst, etc) than environmentally dissimilar by taxonomically similar communities.
Apply the genome comparison method of Lozupone et al PNAS 2008 to metagenomic data. To summarize, take at least three communities, at least two of which share environmental conditions, and do the following:
- Pairwise UniFrac the 16S rRNA sequences to build a pairwise distance matrix of UniFrac values
- Pairwise UniFrac a gene family (or families) of interest (functionally relevant) to build a corresponding distance matrix
- Cluster environments by taxonomy and then by gene family and evaluate whether the two trees are concordant.
- If not, map gene family functional traits onto 16S tree and determine if the phylogeny exhibits homoplasies (parsimony)
UniFrac is actually a python module and operates by either negotiating an interactive session or reading in a python script. For automation purposes, it makes most sense to write a wrapper that generate a python script leveraging the UniFrac module and then triggers the initiation script. To get the wrapper off the ground, I'll begin by generating a tool that creates a python script that executes the following functions:
- read in a tree file (loadTreeFromFile)
- read in an environment file (loadEnvs)
- remove tree branches that have no assigned environment (pruneTree)
- generate a distance matrix for environments using UniFrac metric (makeEnvDistanceMatrix)