Moore Notes 1 26 11

Group Call

Niche mapping
- James: GLM results from Josh (occupation probabilities at Genus level) used to make a genus-area plot
  - What is the appropriate sampling depth result to consider per grid cell?
  - See power law when sampling is low per grid cell, but saturates very quickly at higher sampling depth
    - 50 reads per grid cell plot
    - 100 reads per grid cell plot
- Tom: Richness plots http://docpollard.org/marine_diversity/index.html
  - Jonathan - put it in dryad http://datadryad.org/
  - Morgan - should plots have the same color scale? Or at least group taxa into sets with common color scale
  - Pelagibacter has a low AUC (i.e., model doesn't predict occurrence well, because environmental data don't have much information about presence because pelagibacter is in many places)
- Josh: Sending around an updated outline of the paper with new figures

Protein family database histograms
- Profile HMMs ability to re-classify own sequences correctly using all hits, not just best hit
  - 246,000 families have perfect recall
  - 47,000 families have perfect precision and perfect recall
- Compare families to PFAMs and TIGRFAMs
  - identify families with similar PFAM distributions (could be members of a superfamily)
  - annotate domains present in each family and how many sequences in the family have them
- Proceed with 47,000 perfect families to see how that goes in next steps of analysis
  - What is their size distribution? And phylogenetic diversity? (range is 2 to 400, mostly very small) plot
- Tom: how to improve precision?
  - Global alignment
- Next steps:
  - Read classifier with database capability
    - Assignment of reads to families (based on hmmsearch results and simulations from genomes or sub-sampling metagenomic data)
    - Alignment of reads to profiles
    - Integration of reads into family phylogenies or at least some taxonomy assignment
  - Add new genome sequences (e.g., HMP) - round 2 of family production using these
    - Jonathan - would it be more productive to search metagenomic data (e.g., with CDHit)? Could be better to finish the pipeline on the families we have.
  - A user should be able to provide their own families
  - Release Guillaume's tools for generating families (using seed families)
  - Release precision/recall evaluation methods
  - Morgan: annotating the families with functions (e.g., PFAMs, IMG annotations?, GO terms)
  - Algorithmic/computational issues to make searches run efficiently (especially for large, shotgun data sets)
  - Read length (Sanger length works, short reads may be hard to assign with HMMR)

Moore Notes 1 26 11

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools