Moore Notes 11 17 10 PI

From OpenWetWare
Jump to: navigation, search

Group Call

  • PhylOTU manuscript is resubmitted
  • Niche mapping figures
    • Range maps are really just for abundant organisms
    • Hard limit of occurring at 10 sites in the world to do niche mapping
      • What global abundance does this correspond to? can we quantify?
    • Worse for microbes, but happens with other organisms too
    • Would be interesting to run on OTUs, but there are a lot of issues with that
    • Omission rates figure
      • only pick up taxa that are "closely" related to things we've seen before (50% RDP cutoff)
      • Wouldn't be a problem if the rare taxa / endemics scale spatial with total richness (not likely to be true - isn't for macroorganisms)
      • Main issue is whether the patterns we describe may be biased by only using common taxa
      • Focus on order or family most likely
        • But what can we say about the range of a taxa at this high level of taxonomy?
        • Higher level (order) for summary/meta-maps vs. finer level for maps of individual taxa
        • Scale of the map is important too
    • There is literature on how biogeographic patterns change at different levels of taxonomic resolution (e.g., species/genus ratios)
      • Not just abundant organisms
      • Could we make quantitative comparisons with marine macro organisms?
    • Will use an RDP bootstrap cutoff of 50%
      • 99% of sequences are v6 pyrotags, and this is a good cutoff for them
      • agrees well with Greengenes
    • Guillaume is error checking the code
    • Writing up paper
  • AMPHORA-2
    • BLAST step to pull out sequences in a family
      • HMMR3 is fast enough to use the profile HMM, but is less sensitive then
      • Currently using MCL clustering to identify groups within list of sequences and use one sequence per group
      • Other ideas:
        • Build tree and pick representatives using maxPD
          • Can over sample clades with long branches
        • Emit sequences from HMM
          • Do the simulated sequences represent the family well? (since sequences are "averages", not real sequences)
        • Kimmen's software should do this, but is geared towards clustering at a higher level (subfamilies within a superfamily)
          • May have some of the same problems as maxPD
        • Prune tree from inside out - need an algorithm
        • Would node imbalance statistic be useful here?
        • Build a simple, fast feature-based classifier
      • Goal: to pick sequences that are optimal for detecting different phylogenetic lineages within a family
    • Focus on input and outputs for each module, so that other methods can be swapped in at any step
      • Build a modular workflow that can be easily extended
    • Pplacer is running to build trees
    • Next
      • More families
      • Better modules
    • Steve can be a tester when the time comes