Paper 16S Based OTU Identification Pipeline

From OpenWetWare
Jump to navigationJump to search

The text here is by no means set in stone, but is intended to provide momentum towards the development of this manuscript. All ideas, changes and suggestions are welcome!


We have developed a generalized workflow capable of identifying operational taxon units (OTUs) from SSU-rRNA sequences found in metagenomic sequence libraries. The primary goal of this paper is to communicate

  1. the inner workings of the computational workflow,
  2. its ability to produce reasonable data, and
  3. that it is available for general use.

Additional manuscripts (e.g. James, Josh) will describe the application of the workflow to GOS data. We may yet use this as a place to describe OTU diversity across multiple metagenomic datasets. We will not release a web-based GUI at this time; CAMERA will adopt this software development and release at a later date (communication with CAMERA in progress).

A progress report of current OTU data can be found here.

Current Draft

Old Drafts

Potential Journals for Publication

Ideally, we publish this work in a high impact journal read by researchers from a variety of fields, including bioinformatics, evolution, ecology, and environmental microbiology.

Some potential journals, in no particular order

  1. Nature Methods
  2. ISME Journal
  3. Genome Research
  4. PLoS Computational Biology
  5. Bioinformatics
  6. BMC Bioinformatics



  • Importance of identifying/classifying microbial taxonomic diversity
    • JG: This is what I'd like to chat the most about. Why is it important? Everyone does it, but why? Taxonomic diversity has been the key diversity measure for plant and animal studies, but in this case the species unit (often) captures information about traits. That is not necessarily true for sequence similarity based microbe OTU measures. The primary reason we want OTU measures is for comparing diversity among different studies/habitats/systems. What are some of the other reasons?
    • Baseline way of comparing diversity across different habitats
    • SJR: One reason seems to be that it permits work like Josh's research on understanding how different types of microbes are distributed in geographical or niche spaces. This doesn't argue for a particular definition of OTU, I guess, just that the notion is useful.
  • Brief history of OTUs, SSU-rRNA, and targeted sequencing approaches (including gene chips and targeted pyrosequencing)
  • Potential of metagenomics in improving knowledge of microbial taxonomic diversity and current limitations
    • JG: another point worth mentioning is that it is a source of data that will become increasingly common.
      • Metagenomics will be primary data source, but no way of identifying OTUs from it
  • Current methods and limitations
    • Replace read with best full length 16S sequence hit and do PID based methods
      • SJR: Would it be worthwhile to explicitly compare what we are doing with this approach for at least a few of the simulations? So far, we have just compared our approach to using PID on the full-length sequences for the metagenomic reads (which would not be known for a real sample).
        • TS: It's an interesting idea, but we'd have to use sequences other than the simulations as those are all found in GenBank - best hit will be biased towards the source taxon. That said, there will be some cases where the top hit is *not* the source taxon and it might be interested to characterize that. I think this is already a well documented limitation of blast, so we could just cite it.
    • Rusch et al assembled reads into 16S contigs - requires deep sequencing and may lead to chimeras
    • Something about PID and how it can't work for metagenomics, may not even work for pyrotags
  • Introduction to our procedure and its availability
    • JG: what are the main points you want to highlight here? We need to say here how our procedure fills in the major gaps outlined above. Do you think it is important to emphasize that we have a phylogenetically-informed approach to estimating OTUs? Or do you want to steer away from that?

Note: Do we need to discuss binning tools like MEGAN and PhyloPythia here?

  • JG: would doing so support your primary thesis, that the OTU pipeline is a much needed tool?
  • Do we want to bring up some of the problems regarding Percent ID approach, its overestimation of diversity and error
  • Do we want to highlight phylogenetic distance method?
  • Do we have a way of showing that Phylogenetic method is better than PID
  • TS: If we run with why phylogenetic analysis is best, we might opt to compare how various parts of 16S tree out - this may be a way to compare V1-V2 pyrotags to V6 pyrotags (can't do direct comparison with current methods)
  • What are some hot papers with OTUs in one of the figures


Workflow and Simulation

  • Description of workflow
    • See methods
    • Figure: Workflow
    • Figure: Ability of blast score per nucleotide to discriminate between domains?
      • JG: Can you remind me what this figure looks like, where it is?
    • Describe workflow decision logic (why split into domains, why INFERNAL, why build a tree, etc.)
    • Mention somewhere around here the scaling cutoffs (point to appropriate sections) and read length filters
    • Differential modules
      • Phylogenetic program
      • Clustering methodology
  • Sensitivity/Specificity analysis from simulation study
    • Figure: ROC Curve and accuracy curves for few cutoffs, accuracy based scaling curve
    • Better than random tree structure
    • Reads co-cluster
    • Sources of error/noise
      • 16S Sequence variation is not
      • Alignment position is not
      • Alignment column information is not
      • Read length may be
  • Comparison to PID
    • SJR: As noted above, what about explicitly comparing with the PID of top blast hits, which seems to be the other obvious approach, rather than just the real full-length sequences for the metagenomic reads?
  • Comparison to taxonomy terms
    • JG: what is this again?

Analysis of Real Data

  • Application to real data using adjusted methods
    • GOS
      • Comparison of GOS OTUs to Assembled Metagenomics and to PCR
        • Figure: Rarefaction curves + OTU discovery Venn Diagram
      • Novel taxa from metagenomics
        • Figure: novel clade found in many GOS locations, never seen in any PCR based study (using GreenGenes)
        • SJR: Just for our own reassurance, I am wondering about rebuilding this tree with RAxML just to see if you get the same novel clade(s)? It could just run while we write the paper?
    • Distal Gut
      • Rarefaction curves
      • PCR OTUs v. Metagenomics OTUs
  • Try simulated pyrotags from known taxa? quick and dirty would be V6 region or the metaHIT tags, pop into pipeline and compare to PID v. taxonomy terms


  • Overview and contributions made by the method
    • JG: let's discuss again if we have covered all the main strengths/contributions
      • Phylo distance handles non overlapping reads
    • JG: worthwhile saying that Phylodistance may be more biologically informative?
      • Modular pipeline and method for adjusting cutoffs can be pegged to one another
      • JG: I don't understand this
      • High throughput method to find diversity
      • PCR bias absent
  • Caveats, limitations
  • GOS and distal gut OTUs are publicly available
  • Shoutout to ongoing iSEEM work regarding OTU identification with non-traditional markers (AMPHORA, etc).


  • Identification of SSU-rRNA from metagenomic data
    • Phylogenetically informed STAP database for each domain of life
    • Blastn sequence search
    • Partition hits into domains
  • Multiple sequence alignment of SSU-rRNA sequences
    • RDP INFERNAL rRNA models
    • cmalign (INFERNAL) independently threads each sequence through model
    • stitch RDP reference alignment used for building model to alignment of hits
  • Phylogenetic analysis
    • FastTree plus pseudocounts. Too much data for RAxML to eval in reasonable time.
  • Identification of OTUs
    • Conversion of tree into distance matrix using R script using functions from the ape package
    • MOTHUR clusters sequences using this matrix with a cutoff defined by the user

To Do

Thoughts from 04-27-2010 Conference Call

What is ideal threshold to use in pipeline, should we filter reads by length? How does TPR/TNR perform when we use our "default" cutoff(above) RAXML w/ Sam James and cluster methods Should this include number of clusters or cluster sizes?

Note: we can slip other modules into the pipeline

  • other phylogenetic methods
  • other clustering methods (see James/Sam work)

Translate between Full length sequence id and short reads - may be a big deal.

What are the main takeaways

  • Phylo distance handles non overlapping reads
  • Modular pipeline and method for adjusting cutoffs can be pegged to one another
  • High throughput method to find diversity
  • PCR bias absent

Important Papers Related to this Project

This list is in no particular order and will be updated continuously. Please feel free to contribute.

Recent applications of OTU identification