Availability and dependencies
The scripts are available as an archive. The data included with the scripts are the results of running AMPHORA on the HOT/ALOHA metagenomic data set. Download the scripts and associated data here: File:MgTreeBuildingScripts.zip
These scripts require Perl, Bio::Phylo and BioPerl modules for Perl, Muscle, and RAxML-HPC-multithreaded.
This directory contains scripts and data files to automate the process of inferring phylogenetic relationships among metagenomic reads from different gene families, as identified by AMPHORA.
Many of the scrips are kludgy and assume things about the data, no error checking, you'll need to modify them to work with datasets other than the output generated by running AMPHORA on the HOT/ALOHA data set, and to work optimally on systems other than an 8-core multithreaded processor, although this should be easy to do.
Three methods are available to infer phylogenetic relationships among metagenomic reads in each gene family. Tree inference is performed using RAxML.
- A tree is inferred from the metagenomic sequences alone. (mg)
- A tree is inferred from the metagenomic sequences, plus the sequences in the AMPHORA reference alignment. The reference sequences are then pruned out of the tree, leaving just the meteagenomic sequences as the tips. (mgr)
- A tree is inferred from the metagenomic sequences, plus the sequences in the AMPHORA reference alignment, and tree topology is constrained by the AMPHORA reference phylogeny. The reference sequences are then pruned out of the tree, leaving just the meteagenomic sequences as the tips. (mgrConstrained)
- text file containing list of gene families to process
- (= 31 families in AMPHORA by default)
- directory containing metagenomic reads identified by AMPHORA
- script to infer phylogenetic relationships among metagenomic reads
- script to infer phylogenetic relationships among metagenomic reads based on combined ML analysis of reference alignments plus metagenomic reads
- script to infer phylogenetic relationships among metagenomic reads based on combined ML analysis of reference alignments plus metagenomic reads, with tree topology constrained by the AMPHORA reference tree topology for each gene
- AMPHORA reference alignments for each gene family
- AMPHORA reference phylogeny for each gene family
- Perl scripts used by the tree building script
Directories created by the scripts
- This directory contains the RAxML working files and logs for each gene family
- This directory contains the results of the analyses. Exact filenames vary depending on whether the tree was built with metagenomic sequences only (mg), metagenomic sequences plus reference alignment (mgr), or metagenomic sequences plus reference alignemnts constrained by reference tree topology (mgrConstrained).
Results files created by the scripts
Files in /results include (where X is a gene family and Y is one of mg, mgr, mgrConstrained):
- aligned sequences in fasta format
- aligned sequences in relaxed Phylip format
- metagenomic sequence occurrence in environmental samples
- this is a Phylocom-formatted sample file (http://phylodiversity.net/phylocom)
- each row contains sampleID<tab>Abundance<tab>sequenceID
- the phylogenetic tree linking all metagenomic reads in a gene family