Moore Notes 1 26 11
From OpenWetWare
Jump to navigationJump to search
Group Call
- Niche mapping
- James: GLM results from Josh (occupation probabilities at Genus level) used to make a genus-area plot
- What is the appropriate sampling depth result to consider per grid cell?
- See power law when sampling is low per grid cell, but saturates very quickly at higher sampling depth
- Tom: Richness plots http://docpollard.org/marine_diversity/index.html
- Jonathan - put it in dryad http://datadryad.org/
- Morgan - should plots have the same color scale? Or at least group taxa into sets with common color scale
- Pelagibacter has a low AUC (i.e., model doesn't predict occurrence well, because environmental data don't have much information about presence because pelagibacter is in many places)
- Josh: Sending around an updated outline of the paper with new figures
- James: GLM results from Josh (occupation probabilities at Genus level) used to make a genus-area plot
- Protein family database histograms
- Profile HMMs ability to re-classify own sequences correctly using all hits, not just best hit
- 246,000 families have perfect recall
- 47,000 families have perfect precision and perfect recall
- Compare families to PFAMs and TIGRFAMs
- identify families with similar PFAM distributions (could be members of a superfamily)
- annotate domains present in each family and how many sequences in the family have them
- Proceed with 47,000 perfect families to see how that goes in next steps of analysis
- What is their size distribution? And phylogenetic diversity? (range is 2 to 400, mostly very small) plot
- Tom: how to improve precision?
- Global alignment
- Next steps:
- Read classifier with database capability
- Assignment of reads to families (based on hmmsearch results and simulations from genomes or sub-sampling metagenomic data)
- Alignment of reads to profiles
- Integration of reads into family phylogenies or at least some taxonomy assignment
- Add new genome sequences (e.g., HMP) - round 2 of family production using these
- Jonathan - would it be more productive to search metagenomic data (e.g., with CDHit)? Could be better to finish the pipeline on the families we have.
- A user should be able to provide their own families
- Release Guillaume's tools for generating families (using seed families)
- Release precision/recall evaluation methods
- Morgan: annotating the families with functions (e.g., PFAMs, IMG annotations?, GO terms)
- Algorithmic/computational issues to make searches run efficiently (especially for large, shotgun data sets)
- Read length (Sanger length works, short reads may be hard to assign with HMMR)
- Read classifier with database capability
- Profile HMMs ability to re-classify own sequences correctly using all hits, not just best hit