Moore Notes 10 14 09

Discussion of progress report sections

MicrobeDB, database of information on organisms and their genomes. Includes characteristics of genomes plus full sequence of genome, table of genes, etc. Gets downloaded monthly and assigned version number. It is running on edhar. There's a wiki page with info. There is a Perl API.
Developed BioTorrents for sharing scientific data over bittorrent.
Community profiling. Look for genes with similar profiles (abundance/presence) across different environmental samples. Find PFAMs in GOS data and cluster them. Annotate hypothetical genes into these families. Look at similarity of GO terms within clusters using G-SESAME. Compared with random expectation.
- a.k.a. Functional profiling? Could look at functional similarity of communities?
- Aim is to annotate genes of unknown function in metagenomic data sets.
- Discussed synergy between this project and others - i.e. Tom also looking at functional genes

Aim is to classify metagenomic reads into protein families that have known function.
Use MicrobeDB genome data, identify distant and global homologs of proteins in sequenced genomes to identify superfamilies. Find metagenomic sequences in these superfamilies and align to reference sequences. Build phylogeny from this alignment.
Can we find novel subfamilies within these phylogenetic trees?
Which families play a role in community assembly?
- Compare phylogenetic distribution of a gene family across communities, cluster communities based on phylogenetic relatedness, compare with taxonomic clustering of communities, use this to test for convergence.
Tom will share his NSF proposal discussing this.

We'll finish the discussion of the progress report next time.