User:Morgan G. I. Langille/Notebook/Project management

Halophiles
Need to make list of things to be done for roche genome paper.
 * 1) Outline paper
 * 2) Organize files for roche genomes and new NCBI completed halophile genomes
 * 3) Re-do crispr analysis for new NCBI genomes.
 * 4) Look at homologs of genes identified in new Science metabolic paper.

Darpa

 * map pfams from Xingpeng's analysis to GO cellular components to figure out which one is the most represented (are they membrane bound?)
 * OR pull out proteins that had these pfams and run them through psortb

Erebus

 * need to think about ways to identify pfams that have different counts to each other and to whole genomes.
 * take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
 * Do we see pathways that are over/under-represented that are not expected based on:
 * genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.

Pfam Subtraction Pipeline

 * 1) Obtain taxon assignments for metagenomics sample
 * 2) Retrieve taxon id from name
 * 3) Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
 * 4) Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
 * 5) subtract those pfams from total metagenomic pfam counts
 * 6) Look at leftover pfams and see what is interesting
 * 7) Possibly search through pre-computed pfam genomes to find genome with similar pfam composition

Protein family stuff with Steve

 * chat with steve on Monday (ask about rarefaction curves)

Rough Ideas
Starting with PFAM counts across all GOS samples
 * Looking at samples
 * alpha diversity of GOS samples (measure total protein diversity in each sample)
 * provide a listing of most diverse samples and indicated if those are environmentally related
 * beta diversity of GOS samples (are the samples related...presumbly yes)
 * show a tree and possible a network describing the relatedness of the samples
 * estimate total number of different pfams in the ocean by generating rarefaction curve and using chao estimator


 * Looking at families
 * alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
 * provide list of most diverse families and maybe suggest why those are so diverse?
 * beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
 * map to GO terms to see if similar function
 * chao index
 * estimate total number of proteins for each family in the ocean (what is the most prevalent)

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional