Moore Notes 11 11 09

Subgroup call about OTU pipeline

Conversion of large tree into a distance matrix
- TS: recursive calculation from leaves to root
- TS/SR: create table storing distance from a node to all descendants, and between all pairs of leaves below the node
- SR: every edge is in N calculations where N is the number of leaves below the node
- JL and KP: could parallelize by splitting at some highup nodes
- JG: what language writing in? TS: Perl for now
- SK: is this BioPerl? Try to track its memory/compute time, TS: Yes, BioPerl. Think memory is OK. Can do a test on I/O vs. distance calc.
  - read and write of 15,000 leaf tree isn't too slow
  - distance on 20 leaf tree is ~20 seconds

CAMERA
- KP: workflow would be great, get them to do it
- JG: should we try using their computers? Maybe. UCSF cluster working OK.
- TS: they should maintain
- Should be ready to hand off to CAMERA by end of year
- JG: will draft email to Paul, send to Katie

Pipeline paper
- would be good to have a workflow at publication

How much optimization of code should we do?
- Slow places (so far)
  - alignment of reads to profile (not too bad)
  - alignment QC/filtering (really slow)
  - tree construction (only 1 hour)
  - distance matrix from tree
- CAMERA can do some of this
- How many times will we run these types of analyses?
- JG: Let's cross that bridge when we come to it
- KP: Looks like filtering is the first thing to improve if we do more optimization (besides distance matrix - working on now)

MOTHUR
- SK: did some tests
  - will load a large distance matrix
  - but will not always associated read group (i.e. community) information
    - JL: why do we need this?
    - SK: don't necessarily need it (e.g. for OTUs) unless you want to do more community analyses in MOTHUR (e.g. rarefaction, Chao estimators)
  - fails to load a large FASTTREE tree file (their example trees work)
- Josh's wrapper
  - Does the script have the needed arguments?
    - JO: do you have the read OTU command? It uses the groups file to connect reads to samples
    - JG: what is left out?
    - TS: community analyses (per JO's question), e.g. rarefaction, Chao
    - SK: OTU downstream analyses, alignment/sequence manipulations,
    - KP: Maybe don't need more than OTUs for pipeline (as long as files we output could be read back into MOTHUR for other analyses)
  - JL: Issue with it doing all cutoffs above the specified cutoff
- SK: How does this fit into pipeline?
  - KP: after distance matrix calculation, it is a batch mode way to run MOTHUR
  - KP: Are there other ways to do this last step that we should include?
    - TS: Yes, STAP.
    - JG: Sequence similarity (% ID).
  - SK: We want to use the tree (need to for reads, because they don't overlap)
  - TS: MEGAN does phylotyping - not exactly the same
  - SK: Everybody basically uses MOTHUR/DOTUR
  - JG: Does MOTHUR make sense with our tree distance?
    - JO/TS: We feel good about the approach
    - tree gives distance that is in units of expected substitutions (normalized by length)
  - JG: What about different tree building methods (PCR-based vs. metagenomic read-based 16S tree)
    - compare phylogeny vs. % identity distances on PCR/full length data

Sam's TO DO list for simulations
- reminder: signup for items and comment/edit
- some items can happen in parallel
- will try two different read lengths

Moore Notes 11 11 09

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools