Moore Notes 7 21 10
From OpenWetWare
Jump to navigationJump to search
General business
- Jess attendied Moore Foundation senior invetigator meeting - theme was metatranscriptomics. Lots of interest in whether iSEEM has made progress on building phylogenies using metagenomic data.
Morgan: clustering protein families and samples (Presentation)
- Pipeline - take all PFAMs at domain level (around 11K HMM's). Use HMMer to compare each PFAM against metagenomic data sets.
- CAMERA metagenomic proteins (43M proteins, around 100 samples)
- BGI Human gut assembled reads (6.5M reads, around 100 samples)
- Results in matrix where rows are PFAMs and columns are samples, cells are counts of reads in different PFAMs
- Normalize samples by column totals
- Measure correlations among PFAMS and samples using Pearson correlations
- Try other measures of similarity?
- Correlation assumes normality, linear relationships, which probably not the case for these data
- e.g. Bray-Curtis or Sorensen distance among samples/PFAMs
- Can you specify a model that will in turn determine which similarity measure you use?
- Visualize results in R/Cytoscape
- Figures to visualize distribution of correlations among PFAMs and samples
- Average among-PFAM correlation is close to zero
- i.e. most PFAMs don't overlap/co-occur at all
- Most among-sample correlations close to 1, especially for BGI data set
- sign of low variation among samples?
- may be a result of the correlation statistic - i.e. correlations between Poisson distributed variables will tend to be either 0/1.
- Average among-PFAM correlation is close to zero
- Heatmaps to visualize PFAM x sample occurrence - red is high abundance, black is low/zero
- Cytoscape visualization of clusters of PFAMs with high correlations (>0.9)
- See a few big clusters of abundant PFAMs that co-occur and lots of rare PFAMs that don't co-occur with other PFAMs very much
- Maybe try methods for distances in sparse matrices and/or co-occurrence methods that account for differences in relative abundance
- Visualization of clustering of samples within CAMERA
- See clustering mostly based on study
- Would be useful to combine different data sets (i.e. joint analysis of GOS/BGI data)
- Would be interesting to look at how PFAMs are assembled into communities/networks
- Link to community assembly literature - i.e. assembly and co-occurrence of PFAM networks in space, environment
- Should contact Ed Connor to discuss this data set/approach - he is very interested in these types of data
- What are effects of taxonomic/phylogenetic clustering on the patterns being observed?
- i.e are differences among GOS studies due to PFAM differences or because different organisms are there?
- The PFAM approach is a 'bag of genes' approach that ignores taxonomic relationships among organisms
- There are ecological methods to deal with multiple data sets - i.e. compare taxonomic vs. PFAM signal in community structure
- iSEEM people have been discussing this type of approach
- How might this work? Bin reads into taxonomic groups/OTUs, plus PFAMs
- Could use '4th corner problem' approaches
- It would be useful to compare similarity among samples measured in numerous different ways
- Differences among protein/functional similarity vs. taxonomy vs. phylogeny vs. etc.
- When we look at taxonomic composition of a community we can calculate phylogenetic similarity among those taxa
- Is there an analogy for similarity of PFAMs? i.e. can PFAMs be grouped together into a phylogeny or other more continuous measure of variation among PFAMs?
- You could build a phylogeny for individual protein families, can't really combine them into a single tree very easily?
- You could assign PFAMs to (e.g.) KEGG and then use the KEGG hierarchical ontology as a hierarchical measure of similarity among PFAMs
- Basically binning reads into PFAMS in this way creates a community data set, you could apply any ecological community analysis to these data
- i.e. rarefaction or Chao estimator to predict total number of PFAMs in a sample (PFAM richness, estimated # of unobserved PFAMs)
- Any ecological similarity/co-occurrence measure among samples/PFAMs
- Indicator species analysis (identify PFAMs associated with a particular habitat)
- Find PFAM markers for different genomes or samples
- To get good counts it would be good to repeat with raw reads instead of protein predictions/assemblies
- hard to do PFAM binning on raw Illumina reads due to short length, could develop new scores that would work for fragments. Currently using assembled reads.
- Anyone interested in discussing further/playing with the data email Morgan and we can discuss how to proceed