Dec09 KembelNotes Kembel Notes

From OpenWetWare
Jump to navigationJump to search

Aaron Darling

Phylogenetic cobinning

  • Illumina GAII can generate huge quantities of data for cheap - but they are short 80bp reads
  • We want to place these reads on a reference phylogeny i.e. with pplacer
  • Deep sequencing means multiple reads can come from the same organisms
  • We may want to group together these similar sequences from the same/similar organisms and place them as a group onto the reference phylogeny
  • Model linkage - a model of of partitioning all reads into sets that are linked (can be thought of as coming from same organism). Place these linked reads onto the reference phylogeny as a single leaf.
  • Evaluate likelihood of different ways of partitioning set of all reads into linkage groups. Model linkage group size as coming from a gamma distribution, estimate parameters of the distribution with likelihood, figure out most likely way to cobin reads into linkage groups.
  • Overall this approach allows the reads to share phylogenetic information among one another which can improve overall phylogenetic signal and improve placement of reads in a phylogenetic context
  • Discussion points
    • Not yet implemented
    • Will require massive compute power - GSoC project using graphics card parallel computing may help
    • What about reads with no similarity to known genomes? Can we use paired ends?
    • Difference in rates of evolution
    • Lateral gene transfer
    • What does a linkage group mean - is it a 'species' or 'OTU'?
    • Other sources of linkage information available?
    • Will any of this matter with PacBio sequencing technology?

Data sets and metadata

  • Should we set up a MySQL database that everyone can use to store, share and subset data?
  • This will make it easier to make sure that people are working from the same version of data
  • We will encourage people to provide metadata for all the data in the database
  • Especially important for sharing code - we are probably reinventing the wheel
    • Let's share scripts on iSEEM SVN - currently they mostly live on genbeo.
    • Document script names and descriptions on the wiki here.

Informatics issues / jobs

  1. IT issues
    • Guillaume will be point man for issues related to genbeo/cluster use
  2. Data / databases
  3. Scripts / tools / software
  4. Task tracking / management / jobs / working
  5. CAMERA
    • Guillaume will be point man for CAMERA interactions
  6. Developing
  7. Wiki
    • We need to update the public wiki page on OpenWetWare - someone (Guillaume?) fix this, contact OWW
    • Fine to start migrating content from private -> public
    • We all need to be updating the wiki on a regular basis with a summary of what we've been doing, with enough detail that someone can figure out what you've been doing and eliminate duplication of effort
  8. Citeulike
    • Please share papers you're reading on citeulike - either in your own library or in the iseem group
      • There's a bookmarklet and a way to post stuff from google reader to citeulike
  9. Communication system
    • Email list? Online forum? Gmail group?
  10. Software packages and locales
    • Let's set up a shared /bin directory on genbeo

Specific data requests

  • Josh
    • OTUs - from rRNA - from ocean PCR and metagenomic data
    • OPFs from metagenomic data
  • James
    • OTUs filtered by narrower taxonomic range (e.g. families)
    • OTUs from both metagenomic and additional ocean PCR data
  • Sam
    • Protein families
      • alignments, trees from metagenomic data for those families
    • Simulations
      • Generating and analyzing trees from simulated data
  • Tom
    • All vs. all clustering of protein sequences from whole genome data
      • Build the families
  • Dongying
    • Merge families from different phyla
      • MySQL database of relationships between taxonomy and sequences (map JGI <-> MicrobeDB?)
    • Install HMMER3 in centralized location
      • Get consensus sequences for HMM profiles to group HMMs/families into clusters
  • Morgan
    • All vs. all BLASTs
    • Access to many metagenomic data sets with metadata (grab from CAMERA)?
  • Steve
    • Alignments of metagenomic reads to more gene families (using AMPHORA?)
      • For GOS data set
      • Additional marker gene families from Dongying
      • Additional markers for different taxonomic groups (archaea, Actinos, etc.)