Moore Notes 2 12 14

From OpenWetWare

Jump to navigation Jump to search

Subgroup to discuss SIMAP

Participants: Katie, Stephen, Tom, Patrick

What are our specific goals?
- Stephen: gene catalog of gut microbiome genes (from shotgun metagenomes) and map to annotations
  - Tom: what about novel genes only in metagenomes?
  - Stephen: could do some assembly from shotgun metagenomes
  - Tom: need to avoid chimeras (see Bork paper)
  - Patrick: could SIMAP be used to identify chimeras?
  - Stephen: value is one mapping of reads from a sample to gene catalog and then to many gene families/annotations
  - Stephen: main challenge is partial length sequences (could focus on full-length gene predictions, though these have error)
- Katie: same for marine gene catalog
- Tom: same for other phylogenetically diverse communities (i.e., divergent from databases)
- Updating genome-based gene family resources (e.g., Sfams)
  - Tom: value of this is marginal potentially, though HMMs are useful for new genome annotation
  - Probably greatest value is for environments that are not well studied, do not have lots of metagenomes
- Patrick: expanding smaller databases such as BioCyc
  - Will follow up with Peter Karp on Feb 28

Can these goals be fulfilled using existing protein family resources (eggNOG, figfams, BioCyc, KEGG, etc.)?
- Not metagenomic gene catalogs
- Versus updating Sfams
  - Benefit of Sfams is more diversity
  - Cost is lots of effort

If we were interested in a larger scope project
- Might be interesting to integrate other kinds of information into the protein family network.
  - e.g., protein family co-occurence (within genomes, operons, environments, samples, etc.).
  - This could give us a hint as to the function of novel protein families (if they co-occur with a known family)
  - Could also help us discover completely novel gene clusters, protein complexes, metabolic pathways, etc.
- Tom is writing a proposal about environmental co-occurence
  - Depends on having sufficient number of metagenomes
- Genome co-occurence and operon co-occurence would be very useful for getting at function
  - Tom: also helps with genome annotation
  - Tom: different ways of defining function are useful

If we were to build families which sequences would we be interested in?
- Specific to an environment?
- To a set of taxa?
- Do they need to be full length?
- Include metagenomic data?
  - We would likely need to come up with some new strategies to deal with clustering partial length sequences into families.

Would this be a limited-scope project mainly targeted for a specific application?
- Example: identification of novel protein families in the gut? or surface ocean waters?
- Versus larger project designed to provide a new kind of resource to the community
- Focus on what we need to do our science
- Start with a specific application for our lab then see if a resource is a good idea

What SIMAP searches might we need
- Genes vs. genes within catalog
- Genes in catalog vs. other gene sets
  - For these two, SIMAP may be useful
  - Stephen will contact them
- Reads vs. gene catalog
  - Probably not SIMAP, but bow tie or something similar

Retrieved from "https://openwetware.org/mediawiki/index.php?title=Moore_Notes_2_12_14&oldid=990252"

Navigation menu