Moore Notes 11 11 09

From OpenWetWare
Jump to navigationJump to search

Subgroup call about OTU pipeline

  • Conversion of large tree into a distance matrix
    • TS: recursive calculation from leaves to root
    • TS/SR: create table storing distance from a node to all descendants, and between all pairs of leaves below the node
    • SR: every edge is in N calculations where N is the number of leaves below the node
    • JL and KP: could parallelize by splitting at some highup nodes
    • JG: what language writing in? TS: Perl for now
    • SK: is this BioPerl? Try to track its memory/compute time, TS: Yes, BioPerl. Think memory is OK. Can do a test on I/O vs. distance calc.
      • read and write of 15,000 leaf tree isn't too slow
      • distance on 20 leaf tree is ~20 seconds
  • CAMERA
    • KP: workflow would be great, get them to do it
    • JG: should we try using their computers? Maybe. UCSF cluster working OK.
    • TS: they should maintain
    • Should be ready to hand off to CAMERA by end of year
    • JG: will draft email to Paul, send to Katie
  • Pipeline paper
    • would be good to have a workflow at publication
  • How much optimization of code should we do?
    • Slow places (so far)
      • alignment of reads to profile (not too bad)
      • alignment QC/filtering (really slow)
      • tree construction (only 1 hour)
      • distance matrix from tree
    • CAMERA can do some of this
    • How many times will we run these types of analyses?
    • JG: Let's cross that bridge when we come to it
    • KP: Looks like filtering is the first thing to improve if we do more optimization (besides distance matrix - working on now)
  • MOTHUR
    • SK: did some tests
      • will load a large distance matrix
      • but will not always associated read group (i.e. community) information
        • JL: why do we need this?
        • SK: don't necessarily need it (e.g. for OTUs) unless you want to do more community analyses in MOTHUR (e.g. rarefaction, Chao estimators)
      • fails to load a large FASTTREE tree file (their example trees work)
    • Josh's wrapper
      • Does the script have the needed arguments?
        • JO: do you have the read OTU command? It uses the groups file to connect reads to samples
        • JG: what is left out?
        • TS: community analyses (per JO's question), e.g. rarefaction, Chao
        • SK: OTU downstream analyses, alignment/sequence manipulations,
        • KP: Maybe don't need more than OTUs for pipeline (as long as files we output could be read back into MOTHUR for other analyses)
      • JL: Issue with it doing all cutoffs above the specified cutoff
    • SK: How does this fit into pipeline?
      • KP: after distance matrix calculation, it is a batch mode way to run MOTHUR
      • KP: Are there other ways to do this last step that we should include?
        • TS: Yes, STAP.
        • JG: Sequence similarity (% ID).
      • SK: We want to use the tree (need to for reads, because they don't overlap)
      • TS: MEGAN does phylotyping - not exactly the same
      • SK: Everybody basically uses MOTHUR/DOTUR
      • JG: Does MOTHUR make sense with our tree distance?
        • JO/TS: We feel good about the approach
        • tree gives distance that is in units of expected substitutions (normalized by length)
      • JG: What about different tree building methods (PCR-based vs. metagenomic read-based 16S tree)
        • compare phylogeny vs. % identity distances on PCR/full length data
  • Sam's TO DO list for simulations
    • reminder: signup for items and comment/edit
    • some items can happen in parallel
    • will try two different read lengths