Dec09 KembelNotes Kembel Notes
From OpenWetWare
Jump to navigationJump to search
Aaron Darling
Phylogenetic cobinning
- Illumina GAII can generate huge quantities of data for cheap - but they are short 80bp reads
- We want to place these reads on a reference phylogeny i.e. with pplacer
- Deep sequencing means multiple reads can come from the same organisms
- We may want to group together these similar sequences from the same/similar organisms and place them as a group onto the reference phylogeny
- Model linkage - a model of of partitioning all reads into sets that are linked (can be thought of as coming from same organism). Place these linked reads onto the reference phylogeny as a single leaf.
- Evaluate likelihood of different ways of partitioning set of all reads into linkage groups. Model linkage group size as coming from a gamma distribution, estimate parameters of the distribution with likelihood, figure out most likely way to cobin reads into linkage groups.
- Overall this approach allows the reads to share phylogenetic information among one another which can improve overall phylogenetic signal and improve placement of reads in a phylogenetic context
- Discussion points
- Not yet implemented
- Will require massive compute power - GSoC project using graphics card parallel computing may help
- What about reads with no similarity to known genomes? Can we use paired ends?
- Difference in rates of evolution
- Lateral gene transfer
- What does a linkage group mean - is it a 'species' or 'OTU'?
- Other sources of linkage information available?
- Will any of this matter with PacBio sequencing technology?
Data sets and metadata
- Should we set up a MySQL database that everyone can use to store, share and subset data?
- This will make it easier to make sure that people are working from the same version of data
- We will encourage people to provide metadata for all the data in the database
- Look into EML and metadata standards from NCEAS (ecoinformatics.org)
- Especially important for sharing code - we are probably reinventing the wheel
- Let's share scripts on iSEEM SVN - currently they mostly live on genbeo.
- Document script names and descriptions on the wiki here.
Informatics issues / jobs
- IT issues
- Guillaume will be point man for issues related to genbeo/cluster use
- Data / databases
- Scripts / tools / software
- Task tracking / management / jobs / working
- CAMERA
- Guillaume will be point man for CAMERA interactions
- Developing
- Wiki
- We need to update the public wiki page on OpenWetWare - someone (Guillaume?) fix this, contact OWW
- Fine to start migrating content from private -> public
- We all need to be updating the wiki on a regular basis with a summary of what we've been doing, with enough detail that someone can figure out what you've been doing and eliminate duplication of effort
- Citeulike
- Please share papers you're reading on citeulike - either in your own library or in the iseem group
- There's a bookmarklet and a way to post stuff from google reader to citeulike
- Please share papers you're reading on citeulike - either in your own library or in the iseem group
- Communication system
- Email list? Online forum? Gmail group?
- Software packages and locales
- Let's set up a shared /bin directory on genbeo
Specific data requests
- Josh
- OTUs - from rRNA - from ocean PCR and metagenomic data
- OPFs from metagenomic data
- James
- OTUs filtered by narrower taxonomic range (e.g. families)
- OTUs from both metagenomic and additional ocean PCR data
- Sam
- Protein families
- alignments, trees from metagenomic data for those families
- Simulations
- Generating and analyzing trees from simulated data
- Protein families
- Tom
- All vs. all clustering of protein sequences from whole genome data
- Build the families
- All vs. all clustering of protein sequences from whole genome data
- Dongying
- Merge families from different phyla
- MySQL database of relationships between taxonomy and sequences (map JGI <-> MicrobeDB?)
- Install HMMER3 in centralized location
- Get consensus sequences for HMM profiles to group HMMs/families into clusters
- Merge families from different phyla
- Morgan
- All vs. all BLASTs
- Access to many metagenomic data sets with metadata (grab from CAMERA)?
- Steve
- Alignments of metagenomic reads to more gene families (using AMPHORA?)
- For GOS data set
- Additional marker gene families from Dongying
- Additional markers for different taxonomic groups (archaea, Actinos, etc.)
- Alignments of metagenomic reads to more gene families (using AMPHORA?)