Moore Notes 2 12 14
From OpenWetWare
Jump to navigationJump to search
Subgroup to discuss SIMAP
- Participants: Katie, Stephen, Tom, Patrick
- What are our specific goals?
- Stephen: gene catalog of gut microbiome genes (from shotgun metagenomes) and map to annotations
- Tom: what about novel genes only in metagenomes?
- Stephen: could do some assembly from shotgun metagenomes
- Tom: need to avoid chimeras (see Bork paper)
- Patrick: could SIMAP be used to identify chimeras?
- Stephen: value is one mapping of reads from a sample to gene catalog and then to many gene families/annotations
- Stephen: main challenge is partial length sequences (could focus on full-length gene predictions, though these have error)
- Katie: same for marine gene catalog
- Tom: same for other phylogenetically diverse communities (i.e., divergent from databases)
- Updating genome-based gene family resources (e.g., Sfams)
- Tom: value of this is marginal potentially, though HMMs are useful for new genome annotation
- Probably greatest value is for environments that are not well studied, do not have lots of metagenomes
- Patrick: expanding smaller databases such as BioCyc
- Will follow up with Peter Karp on Feb 28
- Stephen: gene catalog of gut microbiome genes (from shotgun metagenomes) and map to annotations
- Can these goals be fulfilled using existing protein family resources (eggNOG, figfams, BioCyc, KEGG, etc.)?
- Not metagenomic gene catalogs
- Versus updating Sfams
- Benefit of Sfams is more diversity
- Cost is lots of effort
- If we were interested in a larger scope project
- Might be interesting to integrate other kinds of information into the protein family network.
- e.g., protein family co-occurence (within genomes, operons, environments, samples, etc.).
- This could give us a hint as to the function of novel protein families (if they co-occur with a known family)
- Could also help us discover completely novel gene clusters, protein complexes, metabolic pathways, etc.
- Tom is writing a proposal about environmental co-occurence
- Depends on having sufficient number of metagenomes
- Genome co-occurence and operon co-occurence would be very useful for getting at function
- Tom: also helps with genome annotation
- Tom: different ways of defining function are useful
- Might be interesting to integrate other kinds of information into the protein family network.
- If we were to build families which sequences would we be interested in?
- Specific to an environment?
- To a set of taxa?
- Do they need to be full length?
- Include metagenomic data?
- We would likely need to come up with some new strategies to deal with clustering partial length sequences into families.
- Would this be a limited-scope project mainly targeted for a specific application?
- Example: identification of novel protein families in the gut? or surface ocean waters?
- Versus larger project designed to provide a new kind of resource to the community
- Focus on what we need to do our science
- Start with a specific application for our lab then see if a resource is a good idea
- What SIMAP searches might we need
- Genes vs. genes within catalog
- Genes in catalog vs. other gene sets
- For these two, SIMAP may be useful
- Stephen will contact them
- Reads vs. gene catalog
- Probably not SIMAP, but bow tie or something similar