Moore Notes 4 27 15

From OpenWetWare
Jump to navigationJump to search
  • Participants
    • Jonathan, Stacia, Stephen, Patrick, Josh, Ladan, Guillaume, Dongying
  • Jonathan: went to JGI 5 year planning meeting, someone from NSERC wrote to ask if we had any uses for their compute time.
    • Task: think about how we might use that time. Jonathan will inquire to learn more about details.
  • Stacia's presentation.
    • Fair warning, some areas where there may be a bug or misfiltered data.
    • This work is part of a subcontract that Katie has with EFI, which has the goal of taking unknown, unannotated protein families and use crystal structure with in vivo and in vitro experiments to characterize them.
    • Our goal is characterize protein families that are unannotated, abundant in metagenomes, conserved across broad phylogenetic range.
    • Working with unannotated SFams, profiling in HMP samples.
      • FunkSFams: start with all, remove with PFamA annotations, remove w/ <3 members, remove Type I and Type II families (over and under partitioned families). Total of 45,611 families.
      • To ascertain PD, calculated LCA from a set of taxon strings of proteins in each family to estimate its "span". Not weighted for abundance of members at each taxonomic rank. Focused on span 5,6,7 families because they correspond to those composed of different order, phyla, domains.
  • Annotated SFams with PFamB via md5nr using Diamond. Went SFam to md5nr to UniProtID, to SwissProt/TrEMBL. Annotated in a large number of spaces as a result.
    • Used protein with greatest number of fields to annotate the family.
    • One challenge is that a family that is poorly annotated in one database space may be defined in another.
      • Some databases may be more trustworthy than others.
    • Might need to tune stringency. Preprocessing to compile information on annotations across proteins in a family.
      • Might need to verify some of the mappings across databases.
    • What about contacting EFI for guidance on which annotations are trustworthy?
    • Could consider annotations across a family to assess annotations confidence.
  • Compared FunkSFams to the HMP data. 3620 of the FunkSFams had at least 1270 reads mapped to them. Calculated RPKG for all families in all samples.
    • Note that the RPKG formula used here should use gene length in kilobases
    • The prevalence of these families (number of samples observed in) was plotted for these families. Roughly 1/3 of these families are found in only 1 sample.
    • Did any have none?
    • What about correcting for library depth or classification rate across samples?
    • Concerned that some annotated families may have slipped into this analysis
    • Clustering based on scaled RPKG shows that these families tend to group by body site.
    • PCA plot shows that these families separate by body site.
    • Might try using IGC_tools/MetaQuery to assess distributions across a larger number of samples.
  • Patrick is looking at associations between phenotypes and funkfams and is include nuisance variables (e.g., geographic location). There do appear to be some things that appear to vary by diet across body sites (though the results are still early).
    • Question raised about whether this tells us something interesting about unusual sites or unusual functions. There could be many explanations for why there may be an overrepresentation of a funksfam at a particular site. Similar analysis of known functions will be critical to provide the contrast needed to assess those funkfams that may be especially interesting to explore.