Moore Notes 7 21 10

General business

Jess attendied Moore Foundation senior invetigator meeting - theme was metatranscriptomics. Lots of interest in whether iSEEM has made progress on building phylogenies using metagenomic data.

Pipeline - take all PFAMs at domain level (around 11K HMM's). Use HMMer to compare each PFAM against metagenomic data sets.
- CAMERA metagenomic proteins (43M proteins, around 100 samples)
- BGI Human gut assembled reads (6.5M reads, around 100 samples)

Results in matrix where rows are PFAMs and columns are samples, cells are counts of reads in different PFAMs

Measure correlations among PFAMS and samples using Pearson correlations
- Try other measures of similarity?
- Correlation assumes normality, linear relationships, which probably not the case for these data
- e.g. Bray-Curtis or Sorensen distance among samples/PFAMs
- Can you specify a model that will in turn determine which similarity measure you use?

Heatmaps to visualize PFAM x sample occurrence - red is high abundance, black is low/zero

Cytoscape visualization of clusters of PFAMs with high correlations (>0.9)
- See a few big clusters of abundant PFAMs that co-occur and lots of rare PFAMs that don't co-occur with other PFAMs very much
- Maybe try methods for distances in sparse matrices and/or co-occurrence methods that account for differences in relative abundance

Visualization of clustering of samples within CAMERA
- See clustering mostly based on study
- Would be useful to combine different data sets (i.e. joint analysis of GOS/BGI data)

Would be interesting to look at how PFAMs are assembled into communities/networks
- Link to community assembly literature - i.e. assembly and co-occurrence of PFAM networks in space, environment
- Should contact Ed Connor to discuss this data set/approach - he is very interested in these types of data

It would be useful to compare similarity among samples measured in numerous different ways
- Differences among protein/functional similarity vs. taxonomy vs. phylogeny vs. etc.

When we look at taxonomic composition of a community we can calculate phylogenetic similarity among those taxa
- Is there an analogy for similarity of PFAMs? i.e. can PFAMs be grouped together into a phylogeny or other more continuous measure of variation among PFAMs?
- You could build a phylogeny for individual protein families, can't really combine them into a single tree very easily?
- You could assign PFAMs to (e.g.) KEGG and then use the KEGG hierarchical ontology as a hierarchical measure of similarity among PFAMs

Basically binning reads into PFAMS in this way creates a community data set, you could apply any ecological community analysis to these data
- i.e. rarefaction or Chao estimator to predict total number of PFAMs in a sample (PFAM richness, estimated # of unobserved PFAMs)
- Any ecological similarity/co-occurrence measure among samples/PFAMs
- Indicator species analysis (identify PFAMs associated with a particular habitat)
- Find PFAM markers for different genomes or samples

To get good counts it would be good to repeat with raw reads instead of protein predictions/assemblies
- hard to do PFAM binning on raw Illumina reads due to short length, could develop new scores that would work for fragments. Currently using assembled reads.

Anyone interested in discussing further/playing with the data email Morgan and we can discuss how to proceed