User:Morgan G. I. Langille/Notebook/Project management

Halophiles

Need to make list of things to be done for roche genome paper.

map pfams from Xingpeng's analysis to GO cellular components to figure out which one is the most represented (are they membrane bound?)
OR pull out proteins that had these pfams and run them through psortb

need to think about ways to identify pfams that have different counts to each other and to whole genomes.
- take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
- Do we see pathways that are over/under-represented that are not expected based on:
  - genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.

Obtain taxon assignments for metagenomics sample
Retrieve taxon id from name
Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
subtract those pfams from total metagenomic pfam counts
Look at leftover pfams and see what is interesting
Possibly search through pre-computed pfam genomes to find genome with similar pfam composition

Starting with PFAM counts across all GOS samples

Looking at samples
- alpha diversity of GOS samples (measure total protein diversity in each sample)
  - provide a listing of most diverse samples and indicated if those are environmentally related
- beta diversity of GOS samples (are the samples related...presumbly yes)
  - show a tree and possible a network describing the relatedness of the samples
- estimate total number of different pfams in the ocean by generating rarefaction curve and using chao estimator

Looking at families
- alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
  - provide list of most diverse families and maybe suggest why those are so diverse?
- beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
  - map to GO terms to see if similar function
- chao index
  - estimate total number of proteins for each family in the ocean (what is the most prevalent)

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional