Moore Notes 7 21 10

From OpenWetWare
Jump to navigationJump to search

General business

  • Jess attendied Moore Foundation senior invetigator meeting - theme was metatranscriptomics. Lots of interest in whether iSEEM has made progress on building phylogenies using metagenomic data.

Morgan: clustering protein families and samples (Presentation)

  • Pipeline - take all PFAMs at domain level (around 11K HMM's). Use HMMer to compare each PFAM against metagenomic data sets.
    • CAMERA metagenomic proteins (43M proteins, around 100 samples)
    • BGI Human gut assembled reads (6.5M reads, around 100 samples)
  • Results in matrix where rows are PFAMs and columns are samples, cells are counts of reads in different PFAMs
  • Normalize samples by column totals
  • Measure correlations among PFAMS and samples using Pearson correlations
    • Try other measures of similarity?
    • Correlation assumes normality, linear relationships, which probably not the case for these data
    • e.g. Bray-Curtis or Sorensen distance among samples/PFAMs
    • Can you specify a model that will in turn determine which similarity measure you use?
  • Visualize results in R/Cytoscape
  • Figures to visualize distribution of correlations among PFAMs and samples
    • Average among-PFAM correlation is close to zero
      • i.e. most PFAMs don't overlap/co-occur at all
    • Most among-sample correlations close to 1, especially for BGI data set
    • sign of low variation among samples?
    • may be a result of the correlation statistic - i.e. correlations between Poisson distributed variables will tend to be either 0/1.
  • Heatmaps to visualize PFAM x sample occurrence - red is high abundance, black is low/zero
  • Cytoscape visualization of clusters of PFAMs with high correlations (>0.9)
    • See a few big clusters of abundant PFAMs that co-occur and lots of rare PFAMs that don't co-occur with other PFAMs very much
    • Maybe try methods for distances in sparse matrices and/or co-occurrence methods that account for differences in relative abundance
  • Visualization of clustering of samples within CAMERA
    • See clustering mostly based on study
    • Would be useful to combine different data sets (i.e. joint analysis of GOS/BGI data)
  • Would be interesting to look at how PFAMs are assembled into communities/networks
    • Link to community assembly literature - i.e. assembly and co-occurrence of PFAM networks in space, environment
    • Should contact Ed Connor to discuss this data set/approach - he is very interested in these types of data
  • What are effects of taxonomic/phylogenetic clustering on the patterns being observed?
    • i.e are differences among GOS studies due to PFAM differences or because different organisms are there?
    • The PFAM approach is a 'bag of genes' approach that ignores taxonomic relationships among organisms
    • There are ecological methods to deal with multiple data sets - i.e. compare taxonomic vs. PFAM signal in community structure
      • iSEEM people have been discussing this type of approach
      • How might this work? Bin reads into taxonomic groups/OTUs, plus PFAMs
      • Could use '4th corner problem' approaches
  • It would be useful to compare similarity among samples measured in numerous different ways
    • Differences among protein/functional similarity vs. taxonomy vs. phylogeny vs. etc.
  • When we look at taxonomic composition of a community we can calculate phylogenetic similarity among those taxa
    • Is there an analogy for similarity of PFAMs? i.e. can PFAMs be grouped together into a phylogeny or other more continuous measure of variation among PFAMs?
    • You could build a phylogeny for individual protein families, can't really combine them into a single tree very easily?
    • You could assign PFAMs to (e.g.) KEGG and then use the KEGG hierarchical ontology as a hierarchical measure of similarity among PFAMs
  • Basically binning reads into PFAMS in this way creates a community data set, you could apply any ecological community analysis to these data
    • i.e. rarefaction or Chao estimator to predict total number of PFAMs in a sample (PFAM richness, estimated # of unobserved PFAMs)
    • Any ecological similarity/co-occurrence measure among samples/PFAMs
    • Indicator species analysis (identify PFAMs associated with a particular habitat)
    • Find PFAM markers for different genomes or samples
  • To get good counts it would be good to repeat with raw reads instead of protein predictions/assemblies
    • hard to do PFAM binning on raw Illumina reads due to short length, could develop new scores that would work for fragments. Currently using assembled reads.
  • Anyone interested in discussing further/playing with the data email Morgan and we can discuss how to proceed