Dec09 KembelNotes Kembel Notes

Aaron Darling

Illumina GAII can generate huge quantities of data for cheap - but they are short 80bp reads
We want to place these reads on a reference phylogeny i.e. with pplacer
Deep sequencing means multiple reads can come from the same organisms
We may want to group together these similar sequences from the same/similar organisms and place them as a group onto the reference phylogeny
Model linkage - a model of of partitioning all reads into sets that are linked (can be thought of as coming from same organism). Place these linked reads onto the reference phylogeny as a single leaf.
Evaluate likelihood of different ways of partitioning set of all reads into linkage groups. Model linkage group size as coming from a gamma distribution, estimate parameters of the distribution with likelihood, figure out most likely way to cobin reads into linkage groups.
Overall this approach allows the reads to share phylogenetic information among one another which can improve overall phylogenetic signal and improve placement of reads in a phylogenetic context
Discussion points
- Not yet implemented
- Will require massive compute power - GSoC project using graphics card parallel computing may help
- What about reads with no similarity to known genomes? Can we use paired ends?
- Difference in rates of evolution
- Lateral gene transfer
- What does a linkage group mean - is it a 'species' or 'OTU'?
- Other sources of linkage information available?
- Will any of this matter with PacBio sequencing technology?

Should we set up a MySQL database that everyone can use to store, share and subset data?
This will make it easier to make sure that people are working from the same version of data
We will encourage people to provide metadata for all the data in the database
- Look into EML and metadata standards from NCEAS (ecoinformatics.org)
Especially important for sharing code - we are probably reinventing the wheel
- Let's share scripts on iSEEM SVN - currently they mostly live on genbeo.
- Document script names and descriptions on the wiki here.

IT issues
- Guillaume will be point man for issues related to genbeo/cluster use
Data / databases
Scripts / tools / software
Task tracking / management / jobs / working
- Ask slashdot what to use?
CAMERA
- Guillaume will be point man for CAMERA interactions
Developing
Wiki
- We need to update the public wiki page on OpenWetWare - someone (Guillaume?) fix this, contact OWW
- Fine to start migrating content from private -> public
- We all need to be updating the wiki on a regular basis with a summary of what we've been doing, with enough detail that someone can figure out what you've been doing and eliminate duplication of effort
Citeulike
- Please share papers you're reading on citeulike - either in your own library or in the iseem group
  - There's a bookmarklet and a way to post stuff from google reader to citeulike
Communication system
- Email list? Online forum? Gmail group?
Software packages and locales
- Let's set up a shared /bin directory on genbeo

Josh
- OTUs - from rRNA - from ocean PCR and metagenomic data
- OPFs from metagenomic data
James
- OTUs filtered by narrower taxonomic range (e.g. families)
- OTUs from both metagenomic and additional ocean PCR data
Sam
- Protein families
  - alignments, trees from metagenomic data for those families
- Simulations
  - Generating and analyzing trees from simulated data
Tom
- All vs. all clustering of protein sequences from whole genome data
  - Build the families
Dongying
- Merge families from different phyla
  - MySQL database of relationships between taxonomy and sequences (map JGI <-> MicrobeDB?)
- Install HMMER3 in centralized location
  - Get consensus sequences for HMM profiles to group HMMs/families into clusters
Morgan
- All vs. all BLASTs
- Access to many metagenomic data sets with metadata (grab from CAMERA)?
Steve
- Alignments of metagenomic reads to more gene families (using AMPHORA?)
  - For GOS data set
  - Additional marker gene families from Dongying
  - Additional markers for different taxonomic groups (archaea, Actinos, etc.)