Introduction to ss-rRNA Taxonomy Assigning Pipeline (STAP)
The comparative analysis of ss-rRNA sequences is one of the most powerful approaches for studying phylogenetic relationships among all organisms. Our ss-rRNA Taxonomy Assigning Pipeline (STAP) combines publicly available packages such as, PHYML, BLASTN, and CLUSTALW with our own automatic alignment masking and tree parsing programs. STAP makes automatic taxonomic assignments for ss-rRNAs based on neighbor-joining or maximum likelihood phylogenetic trees rather than on the top BLASTN hits, and thus its results are more reliable and accurate. Most importantly, the automation yields results comparable to those achievable by manual analysis, yet offers speed and capacity that are unattainable by manual efforts.
First, ss-rRNA sequences obtained either by PCR of environmental samples or by metagenomic shotgun sequencing are searched against our ss-rRNA database by BLASTN to select a related data set. STAP then automatically generates, masks, and trims the multiple sequence alignments. Next, it builds a phylogenetic tree by either the maximum likelihood or neighbor-joining method. Automated analysis of the tree yields taxonomic assignments for each query sequence.
A paper describing STAP has been published (7/2/08) in PLoS One.
- Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS ONE 3(7): e2566. doi:10.1371/journal.pone.0002566
There are two files needed to be downloaded:
db.tar.gz : the STAP database package, available from http://bobcat.genomecenter.ucdavis.edu/STAP/db.tar.gz [84.6 MB before unzipping, 652.4 MB after opening]
rRNA_scripts.tar.gz: the STAP program file package, available from http://bobcat.genomecenter.ucdavis.edu/STAP/rRNA_scripts.tar.gz [24 kb before unzipping, 100 kb after opening]
These are the requirements to run the STAP pipeline:
- Linux OS (kernel version 2.6 or later)
- Perl (version 5.0 or later)
- NCBI blast package (download ncbi.tar.gz from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/)
- PHYML (more information from http://atgc.lirmm.fr/phyml/scripts/binaries.php) [Note: You might need to change the permissions of this file, using "chmod 577"to execute it.]
- CLUSTALW (download from www.ebi.ac.uk/Tools/clustalw)
- Perl Module: BioPerl 1.4 or later (more information from http://www.bioperl.org/wiki/Main_Page)
- Perl Module: XML::DOM (more information from http://search.cpan.org/~tjmather/XML-DOM-1.44/lib/XML/DOM.pm)
Note: If you are using MacOS, you might find it is easiest to install both Bioperl and XML-DOM using fink: http://www.finkproject.org/download/index.php. For more information, see: http://www.bioperl.org/wiki/Getting_BioPerl#MacOS_X_using_fink. Both Bioperl and XML-DOM will be installed if you use this. The one possible problem is that fink puts the bioperl libraries in /sw/lib, so you might need to add this to your path (e.g., export PERL5LIB=$PERL5LIB:/sw/lib/perl5/5.8.6)
- download database db.tar.gz into the desirable directory
tar -xzvf db.tar.gz
cd db_dirWrite down the full path of the database directory db_dir
- Download rRNA_scripts.tar.gz into the desirable directory
tar -xzvf rRNA_scripts.tar.gz
- Launch the perl setup program.
perl setup.plAnswer the questions when prompt to set the pipline options.
- To run STAP, run this script:
rRNA_pipeline_for_one.plThis script can be located anywhere you want, but the directory rRNA_pipeline_scripts and its content should be kept intact.
Run the Program
- Prepare the input sequence file in fasta format, one sequence per input file
Example: file test.seq >accession TTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGAAACGAGAAGTAGCTTGCTACTTCGGCGT CGAGCGGCGGACGGGTGAGTAATGCATAGGAAGTTGCCCAGTAGAGGGGGATAACCATTGGAAACGATGG CTAATACCGCATAACCTCTTCGGAGCAAAGCGGGGGACCTTCGGGCCTCGCGCTACTGGATACGCCTATG TGGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGATC AGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGG CGAAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTTGTGA GGAAGGTTTCGTAGTTAATAACTGCGTTGCTTGACGTTAGCAACAGAAGAAGCACCGGCTAACTCCGTGC CAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGG TTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAAGGTCATTTGAAACTGGCAAACTAGAGTA CTGTAGAGGGGGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCAGTGGCGA AGGCGGCCCCCTGGACAGATACTGACACTCAGATGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCT GGTAGTCCACGCCGTAAACGATGTCTACTTGGAGGTTGTGGCCTTGAGCCGTGGCTTTCGGAGCTAACGC GTTAAGTAGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAATGAATTGACGGGGGCCCGCACAA GCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCAGAGAACTT TCCAGAGATGGATTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGT GAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTGTTTGCCAGCGAGTAATGTCGGGAAC TCCAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAG TAGGGCTACACACGTGCTACAATGGCGCATACAGAGGGCAGCAAGCTAGCGATAGTGAGCGAATCCCAAA AAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGA TCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGC AAAAGAAGTAGGTAGTTTAACCTTCGGGAGGACGCTTACCACTTTGTGGTTCATGACTGGGGTGAAGTCG TAACAAGGTAGCCCTAGGGGAACC
- Run the Program:
rRNA_pipeline_for_one.pl -i [input]where [input] is replaced by the name of the input sequence file.Example:
rRNA_pipeline_for_one.pl -i test.seqMore parameters:
-i (required) Input format (one sequence per file) -n (optional) The number of unclassified database ss-rRNA nearest-neighbors to ignore when identifying the nearest-neighbor in the first tree Allowed entries: 0-20 Default: 5 -t (optional) When building the second tree, the number of taxonomic levels to step up from the taxonomic assignment made in the first tree. Default: 1 -o (optional) Output directory Default: the current directory -d (optional) Domain of the input sequence Allowed entries: P (prokaryote) E (eukaryote) If left blank, the program will assign the domain automatically.
Read the results
The following files will be generated by the STAP program:
accession.seq: sequence file accession.mltree1: the first tree accession.mltree1.with_taxonomy: The first tree with taxonomy infomation attached accession.mltree1.tab: the parsing output of the first tree (based upon mid-point rooting) accession.mltree2: the second tree accession.mltree2.with_taxonomy: the second tree with taxonomy infomation attached accession.mltree1.tab: the parsing output of the scond tree (rooted by an outgroup defined by the first tree) accession.results: the summary of the taxonomy assgnments
Format of the tree parsing results:
Each line is tab-delimited with the information illustrated in the following figure: The output file is a tab-delimited text file, i.e one record per line, the data fields separated by tabs. The contents of each column is explained in the following list and figure:
Column 1: Accession of the input sequences Column 2: Accession of a database sequence Column 3: Distance between the input and the database entry (highlighted by the blue line) Column 4: Distance between the nodes that the input and the database entry are attached to (highlighted by the orange line) Column 6: The branching position of the database sequence (marked by the red circle) along the path that connects the input and the outgroup (highlighted by the red line) Cloumn 7: The taxonomy annotation of the database entry
The parsing results are sorted by column 6, 5, 4 and 3 to identify the nearest neighbor of the input sequence.
Format of the final result summary file:
The results summary file is also a tab-delimited text file. The contents of each column is illustrated in the following list and figure:
Column 1: Accession of the input sequence Column 2: The top Blast hit and its taxonomy Column 3: The nearest neighbor in the first tree and its taxonomy Column 4: The nearest neighbor in the second tree and its taxonomy. This taxonomy is the final taxonomy assignment to the query. Column 5: The domain of the query sequence if automatic domain identification process was triggered. Domain is determined by a representative neighbor joining tree illustrated by the following figure. The branching position of the query related to the domain braching point (indicated by the red triangle) in terms of number of nodes (indicated by the rec circles) is also reported in this column
Increasing the speed of the analysis
Two approaches can be used to decrease the analysis time.
- If the user knows the domain of a query sequence, the -d option should always be defined to skip the timeconsuming domain-identification step.
- A linux cluster is required for processing massive datasets. The STAP program can be run in parallel on nodes of a linux cluster (see http://www.rocklinux.org/wiki/Main_Page for more details).
An extra program to align large ss-rRNA sequences to each other
In the package, we include a script to align large ss-rRNA sequences against each other. The script is align_to_profile.pl. It takes a ss-rRNA multifasta format and builds alignments of all the sequences in the input file using the clustalw profile aligning method.
Usage: align_to_rRNA_profile.pl -i [input] -d [domain] -o [output] Input parameters: -i: sequence file in multifasta format -d: domain (A for archaeal, B for bacteria, E for eukaryotes, P for prokaryotes, X for all domain; Default X) -o: output file (in aligned multifasta format)
For large amount of sequences, it is recommended to break the multifasta file into smaller sections and run align_to_profile.pl on a linux cluster. As long as the same domains are defined for all the processes, the output files of align_to_profile.pl can be concatenated into a large aligned multifasta file in which all the sequences are aligned to each other. In the package, we include the script trim_all_gap.pl to remove the all gapped columns of the concatenated multiple alignments.
Usage: trim_all_gap.pl -i [input] -o [output]