Moore Notes 2 12 14

From OpenWetWare
Jump to navigationJump to search

Subgroup to discuss SIMAP

  • Participants: Katie, Stephen, Tom, Patrick
  • What are our specific goals?
    • Stephen: gene catalog of gut microbiome genes (from shotgun metagenomes) and map to annotations
      • Tom: what about novel genes only in metagenomes?
      • Stephen: could do some assembly from shotgun metagenomes
      • Tom: need to avoid chimeras (see Bork paper)
      • Patrick: could SIMAP be used to identify chimeras?
      • Stephen: value is one mapping of reads from a sample to gene catalog and then to many gene families/annotations
      • Stephen: main challenge is partial length sequences (could focus on full-length gene predictions, though these have error)
    • Katie: same for marine gene catalog
    • Tom: same for other phylogenetically diverse communities (i.e., divergent from databases)
    • Updating genome-based gene family resources (e.g., Sfams)
      • Tom: value of this is marginal potentially, though HMMs are useful for new genome annotation
      • Probably greatest value is for environments that are not well studied, do not have lots of metagenomes
    • Patrick: expanding smaller databases such as BioCyc
      • Will follow up with Peter Karp on Feb 28
  • Can these goals be fulfilled using existing protein family resources (eggNOG, figfams, BioCyc, KEGG, etc.)?
    • Not metagenomic gene catalogs
    • Versus updating Sfams
      • Benefit of Sfams is more diversity
      • Cost is lots of effort
  • If we were interested in a larger scope project
    • Might be interesting to integrate other kinds of information into the protein family network.
      • e.g., protein family co-occurence (within genomes, operons, environments, samples, etc.).
      • This could give us a hint as to the function of novel protein families (if they co-occur with a known family)
      • Could also help us discover completely novel gene clusters, protein complexes, metabolic pathways, etc.
    • Tom is writing a proposal about environmental co-occurence
      • Depends on having sufficient number of metagenomes
    • Genome co-occurence and operon co-occurence would be very useful for getting at function
      • Tom: also helps with genome annotation
      • Tom: different ways of defining function are useful
  • If we were to build families which sequences would we be interested in?
    • Specific to an environment?
    • To a set of taxa?
    • Do they need to be full length?
    • Include metagenomic data?
      • We would likely need to come up with some new strategies to deal with clustering partial length sequences into families.
  • Would this be a limited-scope project mainly targeted for a specific application?
    • Example: identification of novel protein families in the gut? or surface ocean waters?
    • Versus larger project designed to provide a new kind of resource to the community
    • Focus on what we need to do our science
    • Start with a specific application for our lab then see if a resource is a good idea
  • What SIMAP searches might we need
    • Genes vs. genes within catalog
    • Genes in catalog vs. other gene sets
      • For these two, SIMAP may be useful
      • Stephen will contact them
    • Reads vs. gene catalog
      • Probably not SIMAP, but bow tie or something similar