Moore Notes 11 11 09
From OpenWetWare
Jump to navigationJump to search
Subgroup call about OTU pipeline
- Conversion of large tree into a distance matrix
- TS: recursive calculation from leaves to root
- TS/SR: create table storing distance from a node to all descendants, and between all pairs of leaves below the node
- SR: every edge is in N calculations where N is the number of leaves below the node
- JL and KP: could parallelize by splitting at some highup nodes
- JG: what language writing in? TS: Perl for now
- SK: is this BioPerl? Try to track its memory/compute time, TS: Yes, BioPerl. Think memory is OK. Can do a test on I/O vs. distance calc.
- read and write of 15,000 leaf tree isn't too slow
- distance on 20 leaf tree is ~20 seconds
- CAMERA
- KP: workflow would be great, get them to do it
- JG: should we try using their computers? Maybe. UCSF cluster working OK.
- TS: they should maintain
- Should be ready to hand off to CAMERA by end of year
- JG: will draft email to Paul, send to Katie
- Pipeline paper
- would be good to have a workflow at publication
- How much optimization of code should we do?
- Slow places (so far)
- alignment of reads to profile (not too bad)
- alignment QC/filtering (really slow)
- tree construction (only 1 hour)
- distance matrix from tree
- CAMERA can do some of this
- How many times will we run these types of analyses?
- JG: Let's cross that bridge when we come to it
- KP: Looks like filtering is the first thing to improve if we do more optimization (besides distance matrix - working on now)
- Slow places (so far)
- MOTHUR
- SK: did some tests
- will load a large distance matrix
- but will not always associated read group (i.e. community) information
- JL: why do we need this?
- SK: don't necessarily need it (e.g. for OTUs) unless you want to do more community analyses in MOTHUR (e.g. rarefaction, Chao estimators)
- fails to load a large FASTTREE tree file (their example trees work)
- Josh's wrapper
- Does the script have the needed arguments?
- JO: do you have the read OTU command? It uses the groups file to connect reads to samples
- JG: what is left out?
- TS: community analyses (per JO's question), e.g. rarefaction, Chao
- SK: OTU downstream analyses, alignment/sequence manipulations,
- KP: Maybe don't need more than OTUs for pipeline (as long as files we output could be read back into MOTHUR for other analyses)
- JL: Issue with it doing all cutoffs above the specified cutoff
- Does the script have the needed arguments?
- SK: How does this fit into pipeline?
- KP: after distance matrix calculation, it is a batch mode way to run MOTHUR
- KP: Are there other ways to do this last step that we should include?
- TS: Yes, STAP.
- JG: Sequence similarity (% ID).
- SK: We want to use the tree (need to for reads, because they don't overlap)
- TS: MEGAN does phylotyping - not exactly the same
- SK: Everybody basically uses MOTHUR/DOTUR
- JG: Does MOTHUR make sense with our tree distance?
- JO/TS: We feel good about the approach
- tree gives distance that is in units of expected substitutions (normalized by length)
- JG: What about different tree building methods (PCR-based vs. metagenomic read-based 16S tree)
- compare phylogeny vs. % identity distances on PCR/full length data
- SK: did some tests
- Sam's TO DO list for simulations
- reminder: signup for items and comment/edit
- some items can happen in parallel
- will try two different read lengths