Moore Notes 1 26 11

From OpenWetWare
Jump to navigationJump to search

Group Call

  • Niche mapping
    • James: GLM results from Josh (occupation probabilities at Genus level) used to make a genus-area plot
    • Tom: Richness plots http://docpollard.org/marine_diversity/index.html
      • Jonathan - put it in dryad http://datadryad.org/
      • Morgan - should plots have the same color scale? Or at least group taxa into sets with common color scale
      • Pelagibacter has a low AUC (i.e., model doesn't predict occurrence well, because environmental data don't have much information about presence because pelagibacter is in many places)
    • Josh: Sending around an updated outline of the paper with new figures
  • Protein family database histograms
    • Profile HMMs ability to re-classify own sequences correctly using all hits, not just best hit
      • 246,000 families have perfect recall
      • 47,000 families have perfect precision and perfect recall
    • Compare families to PFAMs and TIGRFAMs
      • identify families with similar PFAM distributions (could be members of a superfamily)
      • annotate domains present in each family and how many sequences in the family have them
    • Proceed with 47,000 perfect families to see how that goes in next steps of analysis
      • What is their size distribution? And phylogenetic diversity? (range is 2 to 400, mostly very small) plot
    • Tom: how to improve precision?
      • Global alignment
    • Next steps:
      • Read classifier with database capability
        • Assignment of reads to families (based on hmmsearch results and simulations from genomes or sub-sampling metagenomic data)
        • Alignment of reads to profiles
        • Integration of reads into family phylogenies or at least some taxonomy assignment
      • Add new genome sequences (e.g., HMP) - round 2 of family production using these
        • Jonathan - would it be more productive to search metagenomic data (e.g., with CDHit)? Could be better to finish the pipeline on the families we have.
      • A user should be able to provide their own families
      • Release Guillaume's tools for generating families (using seed families)
      • Release precision/recall evaluation methods
      • Morgan: annotating the families with functions (e.g., PFAMs, IMG annotations?, GO terms)
      • Algorithmic/computational issues to make searches run efficiently (especially for large, shotgun data sets)
      • Read length (Sanger length works, short reads may be hard to assign with HMMR)