Moore Notes 5 26 15

Action Items for Response to PCB


  • The question of novelty is the biggest barrier to publication. Here's our response:
    • 1. the simulation results, which not only clarify how we should annotate metagenomes, but clarify how metagenomic analyses should be set up (e.g., read depth). I don't think reviewers disagree here, and the additional analyses (assuming we do them) only serve to expand this argument. Highlight how we systematically break down metagenome annotation into smallest components.
    • 2. Identification of an analysis of IBD function across studies (assuming we do this). Tom check the MGS paper. Curtis metagenome paper?
    • 3. Adaptive classification. Take reviewer 3's suggestion of adapting thresholds based on per read properties and clarify the fact that (between this new addition and the --auto option) shotmap learns from the data to automatically apply the settings that maximize accuracy.
      • Accuracy of adaptive classification?


    • SN walking us through pdf deck. Fig 1 showing table 1 in a barplot
    • Figure 2 - Simulated longer reads. As lengths increase, classify by orfs is critical. Also, alignment length really matters for longer reads.
    • Figure 3 - Updated figure on score cutoffs. Sequence error (w/in normal ranges), doesn't have a big impact on performance. More to the point, you can use the same bitscore threshold. Note that in these figures, error bars correspond to the bitscores you would select that would produce abundance estimates within 1% error from the optimum
    • Figure 3 also describes taxonomic exclusion analysis. Old analysis was leave-one-out at the genome level, this is at the taxonomic level. Iterates over different taxonomic levels. Finds that optimal thresholds are generally robust to community composition. But, relative abundance error can be way off. Why are genus/family and order/class showing up so similarly. Suggests that database needs to phylogenetically reflect community. Note limitation of taxonomic restriction of the simulations - can't classify into lineage if only contains one genome that has since been excluded.
    • Maybe Figure 4 y-axis should be Bray-Curtis Dissimilarity Error
    • Figure 5 shows empirical false positive rates using negative data
    • Deposit data in public repo - westway


    • Adaptive classification (considers read-specific properties)
    • Streamlined installation
    • Virtual machine
    • Having hard time finding PBS server. Maybe Amazon implementation of VM is OK?
    • Code will go in CPAN and CRAN ultimately
    • Not sure what do about the "multiple-database analysis" comment. Could create a wrapper that dumps results for a series of databases defined by the user. May ignore for now.

Data analysis

    • IBD cohort novelty
    • AGS and IBD - How does it impact results and is difference in AGS in CD supported in literature?
    • DB - is the input database different?
    • Optimal threshold? What was used previously? What was used here?
    • Katie comments: rarefaction only to abundant enough families
    • Are p-values highly correlated across studies? Yes.
    • Do we want to perform tests on all KOs or just MetaHIT
    • AGS: n_reads_sampled should be number of reads analyzed in the metagenome
  • Will request extension without new reviewers