Simulation pipeline

From OpenWetWare
Jump to navigationJump to search

The simulation pipeline generates specific kinds of metagenomic data sets and alignments. Details about dependencies, how to use the script, what files are output, etc., are covered in the README (see the Downloads section).
See my personal iseem page for information on known issues and ongoing updates to the simulation pipeline code or simulation data sets.
Feel free to e-mail me for more information.
--Sam Riesenfeld

A high-level view of the simulation pipeline

Update (June 2009): Some steps have changed, especially at the end of the pipeline, where AMPHORA scripts are used rather than calling hmmalign directly.

Roughly speaking, the pipeline goes through the following steps:

  1. Pick a gene, say rpoB.
    1. If a reference database has not already been built (of if a new one is desired), build a reference database for rpoB by sampling some number, say 100, of the sequences from the AMPHORA Reference Sequences.
    2. Otherwise, use the existing reference database for rpoB in the following steps. (Note that this is usually not the same as the file of AMPHORA Reference Sequences.)
  2. Sample some number, say 15, rpoB peptide sequences from the reference database created in step (1) or in a previous run.
  3. Create a taxonomic profile for the sequences (uniform, for now) to be used by MetaSim.
  4. Run MetaSim simulation on the chosen sequences to generate some number, say 100, reads.
  5. Optionally, parse these reads so that there is at most one read per complete gene sequence (e.g., 15 reads).
  6. Format a blast database, if one hasn't already been formatted, from the AMPHORA Reference Sequences for rpoB.
  7. Run blastx on these metagenomic reads and figure out which frames they should be translated in.
  8. Run transeq on the reads to translate them in the correct frames into peptide reads.
  9. Use AMPHORA scripts to align the peptide reads with the AMPHORA profile hmm for rpoB, and then trim the alignment according to the mask (not strict) for rpoB in AMPHORA.

You can skip ahead to almost any point in the pipeline, using options to specify the files you want to use rather than generating them on the fly.


  • E-mail me for a current version of the command-line executable and related files (e.g., README file), since they are changing quite often.
  • Example files: File:Sim1.tgz contains most files for one run of the full pipeline (these files were created by an older version of the pipeline -- new simulation files should be up soon). See the README for a guide on file names/extensions.