User:Morgan G. I. Langille/Notebook/Project management

From OpenWetWare
Jump to: navigation, search


Need to make list of things to be done for roche genome paper.

  1. Outline paper
  2. Organize files for roche genomes and new NCBI completed halophile genomes
  3. Re-do crispr analysis for new NCBI genomes.
  4. Look at homologs of genes identified in new Science metabolic paper.


  • map pfams from Xingpeng's analysis to GO cellular components to figure out which one is the most represented (are they membrane bound?)
  • OR pull out proteins that had these pfams and run them through psortb


  • need to think about ways to identify pfams that have different counts to each other and to whole genomes.
    • take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
    • Do we see pathways that are over/under-represented that are not expected based on:
      • genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.

Pfam Subtraction Pipeline

  1. Obtain taxon assignments for metagenomics sample
  2. Retrieve taxon id from name
  3. Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
  4. Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
  5. subtract those pfams from total metagenomic pfam counts
  6. Look at leftover pfams and see what is interesting
  7. Possibly search through pre-computed pfam genomes to find genome with similar pfam composition

Protein family stuff with Steve

  • chat with steve on Monday (ask about rarefaction curves)

Rough Ideas

Starting with PFAM counts across all GOS samples

  • Looking at samples
    • alpha diversity of GOS samples (measure total protein diversity in each sample)
      • provide a listing of most diverse samples and indicated if those are environmentally related
    • beta diversity of GOS samples (are the samples related...presumbly yes)
      • show a tree and possible a network describing the relatedness of the samples
    • estimate total number of different pfams in the ocean by generating rarefaction curve and using chao estimator
  • Looking at families
    • alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
      • provide list of most diverse families and maybe suggest why those are so diverse?
    • beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
      • map to GO terms to see if similar function
    • chao index
      • estimate total number of proteins for each family in the ocean (what is the most prevalent)

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional