ISEEM Progress June 2008
Integrating Statistical, Ecological and Evolutionary approaches to Metagenomics (iSEEM) Project Progress Report July 7, 2008
Progress made towards completion of goals
We continued recruitment of five postdocs and a bioinformatics engineer. Our ad was circulated again to our personal contacts and some community lists. Application materials were collected through the UC Davis Genome Center hiring “portal” at:
To date we have received and reviewed 70 postdoc applications and 21 bioinformatics engineer applications. From these, we produced a short list of 15 postdocs and 2 engineers. In April and May 2008 we conducted phone interviews with short listed candidates. We also had the opportunity to meet 7 candidates in person, due to their visiting or residing near a PI or attending a conference with a PI.
Each candidate was first interviewed by one PI (based on the most likely home lab). After a group discussion of each candidate, we selected the top 10 to be interviewed by a second PI. In most cases all three PIs spoke with the candidate. We also obtained references for the top 10 candidates. Each PI ranked the candidates for their own lab, and we discussed how to form a team with a good balance of skill sets within and among labs. We then made offers to the following candidates:
- Green Lab
- Pollard Lab
- Eisen lab
- Dr. Sourav Chatterji. He has been a post doc in the Eisen lab for ~1.5 years. He has extensive experience in metagenomic and genomic analyses. He earned his PhD in Computer Science from U. C. Berkeley.
Kembel, O’Dwyer, Ladau and Chatterji have accepted. O’Dwyer has started. Kembel and Ladau will start in Fall 2008. Chatterji is starting July 2008. Sharpton is still considering the offer, and would start by January 2009. We have a ranked list of additional interviewed candidates plus several promising looking recent applications should we need to make additional offers.
We agreed on two top candidates based on application materials. These two candidates were interviewed on the phone by one PI. Both also visited Davis (at their own expense) to meet with the Pollard and Eisen labs. We ranked the candidates
- Srijak Bhatnagar
- Brent Kronmiller
We have offered the position to Bhatnagar based on his experience with metagenomics data (he worked in Josh Weitz's lab at Georgia Tech. on a metagenomic informatics project previously) and he has accepted. Kronmiller is also interested in the postdoc positions and is being considered for a position in the Eisen lab.
Full list of personnel involved
- Green Lab
- Liz Perry, PhD student
- James O'Dwyer, post-doc
- Steven Kembel, post-doc starting Fall 2008
- Pollard Lab
- Josh Ladau, post-doc starting Fall 2008
- Eisen Lab
- Martin Wu, Project Scientist
- Dongying Wu, Project Scientist
- Sourav Chatterji, Post-doc
- Marisano James, PhD student in Population Biology at UC Davis starting Fall 2008
- Srijak Bhatnagar, Bioinformatics Engineer
Comparative metagenomics (Liz Perry and James O’Dwyer):
We have begun analyzing beta-diversity patterns in metagenomic data using the GOS Sorcerer II dataset. Using the similarity metric published by the Venter group in their 2007 PLoS paper, we have found that genomic similarity shows significant distance-decay patterns along several important environmental gradients. The overall metagenomic similarity between samples is significantly correlated with geographic distance, temperature, chlorophyll a concentration, and ocean depth. Other environmental ‘metadata’ such as sample depth and salinity do not correlate with metagenomic similarity.
Additionally, we have identified several weaknesses in the similarity metric employed by the Sorcerer II team, and have devised a new metric that will more accurately describe the overall sequence similarity between two metagenomic samples. We have begun using network theory to develop a metric that will describe ‘alpha-diversity’ or sequence richness of individual metagenomic samples.
Review paper on “Computational Methods for Studying Microbial Diversity” (iSEEM team):
- We are jointly working on a review paper on “Computational Methods for Studying Microbial Diversity” for the journal PLoS Computational Biology.
- We are using this to get everyone in the project up to speed on what has been done in the broader scientific community in the area of the goals of our project.
- In addition, the private web page we are making within the wiki for writing this paper has links to web servers, available software, and journal publications in this field. We will make this web resource available through the public iSEEM web page (see below), through PLoS Computational Biology and through CAMERA.
- We anticipate submitting this review prior to the submission of our next quarterly report (October 2008).
STAP Pipeline (Eisen lab - Dongying Wu)
- Paper published July 2, 2008: Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS ONE 3(7): e2566.
- The paper describes software for performing automated alignments and phylogenetic analyses of rRNA sequences. The software was developed over many years in the Eisen lab and the final work on it was supported by this project. It will be used as a part of the rRNA analyses to be carried out for this project.
- We have been working with the JCVI (initially) and UCSD-CalIT2 (now) to implement this software within CAMERA
Amphora phylogenomics software (Eisen lab - Martin Wu)
Martin Wu is finalizing the testing of software for automated phylogenomic analysis of genomes and metagenomics. The software in central to the gene family analyses to be done as part of the iSEEM project. This software, called AMPHORA, was originally developed with support from an NSF "Assembling the Tree of Life" to Jonathan Eisen. Recent work on the software has been supported by this grant and by the subcontract to the Eisen lab from the CAMERA group. A paper on the software has been submitted to Genome Biology and is undergoing revision.
A component of this software is a set of hidden markov models of protein families that will be used as markers for surveying metagenomic data as part of the iSEEM project.
Protein family "pre-computes" based on complete genomes (Eisen Lab - Dongying Wu)
Dongying Wu has been working on an automated pipeline for identification of protein families in complete genomes. The method being used is a Markov chain clustering (MCL) approach that was developed by others and that the Eisen lab is also using for automated identification of phylotypes/OTUs in rRNA data. Test runs of the protein family clustering indicate that it is robust and should produce a catalog of protein families that can be used for the development of weights and scoring schemes for different gene families as part of this project.
Automated alignment masking (Eisen Lab - Martin Wu and Sourav Chatterji)
One of the challenges for the iSEEM project and an project attempting to carry out phylogenetic analysis of many different genes is the need to assess the quality of multiple sequence alignments that are used for the phylogenetic reconstructions. In the past, including in the development of the STAP and AMPHORA software outlined above, we have carried out extensive manual examination of alignments to identify regions of the alignment that may not be useful for phylogenetic analysis. Known as "masking" this manual work takes both significant expertise and time. We are now developing a completely automated approach that uses profile hidden markov models to, in essence, assign a likelihood score to each alignment position that measures the probability that a particular amino acid or nucleotide residue is aligned correctly. From these scores we can generate a completely automated "mask" of a sequence alignment. We are still testing this approach but first impressions indicate it may greatly accelerate the second phase of the iSEEM protein family analysis where we are to include a large number of protein families in our metagenomic analyses.
Metagenomic diversity scientific workflow (iSEEM team)
We have come to the realization that it would be very helpful for our project in many ways to develop a formal scientific workflow for most/all of our computational analyses being conducted. By formal scientific workflow we mean something akin to the Kepler system (http://kepler-project.org/) which says that "Scientists in a variety of disciplines (e.g., biology, ecology, astronomy) need access to scientific data and flexible means for executing complex analyses on those data. Such analyses can be captured as 'scientific workflows' in which the flow of data from one analytical step to another is captured in a formal workflow language."
The realization that we need such a formal workflow came in part from discussions with the CAMERA group about the needs and goals of our project. One reason we need this workflow is to help organize the overall project. With multiple labs being involved (with different levels of computational experience as well as with different people working on different subproject) a formalization of our workflow will help everyone know where they fit in. In addition, a formal workflow can allow non bioinformatics experts to carry out analyses more easily. A third benefit of a formal workflow such as that of the Kepler system is that the software of the workflow system itself can serve as an electronic lab notebook recording all data sets and settings used in any analysis.
Thus, in conjunction with our work with CAMERA, we have begun conversations with some of the developers of the Kepler workflow system who are in the Genome Center at U.C. Davis. We note, of all the workflow systems being developed, we particularly like the Kepler system for many reasons including that it is a fully Open Source project. In addition, researchers with CAMERA have experience with this system.
Communications, Collaboration, Outreach and Education
iSEEM Web Page
An iSEEM web page is currently being created. The purpose of this url will be to serve as a readily accessible medium for our collaborators, colleagues, CAMERA, and the general public who wish to garner information about the iSEEM project. Examples of information to be included in this web page are links to the individual sites of the iSEEM investigators, postings of project-related products such as open access manuscripts, and news related items about past or upcoming project related events such as workshops/conferences. For a sample web pages that follows in spirit what we are generating for iSEEM, see http://fibr.kgi.edu/.
Conference and Seminar Activities
We have begun to incorporate iSEEM project-related topics into our seminar and conference activities:
- Gordon Research Conference “Metabolic Basis of Ecology” July 6-11, “Microbial Metabolic Diversity: Geography and Dimensions”, Jessica Green
- Biology of Genomes, Cold Spring Harbor Laboratory, May 6-10, “Rapid evolution upstream of ape-specific duplicated genes”, Katie Pollard
- Emerging Statistical Challenges in Genome & Translational Research, Banff International Research Station for Mathematical Innovation & Discovery, June 1-6, “Nonparametric approaches to QTL mapping”, Katie Pollard
Several potential and ongoing collaborations relevant to iSEEM have been initiated. These collaborations should significantly advance iSEEM research and also contribute to the broader metagenomics research community. These include:
- Potential collaborations between Jessica Green, Stephen Giovannoni, Alexandra Worden, and Craig Carlson through a pending grant titled “Metagenomic Analysis of the North Atlantic Spring Bloom” submitted to the Gordon and Betty Moore Foundation.
- Potential collaborations between Jessica Green and Forest Rohwer through a pending grant titled “Biodiversity and Biogeography of the Human Microbiome” submitted to the Howard Hughes Medical Institute. One component of this project would entail Green and Rohwer collaborating on the metagenomic analysis of microbial diversity in cystic fibrosis infected lungs.
- Jonathan Eisen taught a seminar in Winter 2007 students in the Population Biology Graduate project on metagenomics and phylogenomics.
- Jonathan Eisen taught a course in Spring 2008 on "Microbial phylogenomics" where ~ half of the course focused on microbial diversity.
- Jonathan Eisen taught a lecture for the Bodega Bay Workshop in Applied Phylogenetics in March 2008 (see http://bodegaphylo.wikispot.org/Front_Page for more detail). A component of the lecture focused on microbial diversity and metagenomics.
Biweekly PI meetings
Eisen, Green and Pollard are meeting bi-weekly via conference call Wednesdays 9:30 – 10:30 AM. During these meetings we are mostly dealing with logistical aspects of the iSEEM project.
Biweekly iSEEM meetings
On the weeks that are alternate to the PI meetings we are holding iSEEM lab group meetings for all involved participants (including those at UCD, UCSF and UO). The purpose of these meetings is to discuss the scientific aspects of the iSEEM project and foster collaborations and interdisciplinary efforts between the PI lab groups.
We held a quarterly meeting April 18, 2008 at CalIT2 on the UCSD campus as part of a meeting with the CAMERA team. The people from the iSEEM project at the meeting were: Jonathan Eisen, Martin Wu, Dongying Wu, Jessica Green, James O' Dwyer, Liz Perry, Katherine Pollard. Multiple personnel from the CAMERA team were also there. The meeting consisted of a discussion of the goals CAMERA as well as the goals of the iSEEM project as well as the CAMERA subcontract to the Eisen lab. We then discussed how the iSEEM team could work with CAMERA both to get the science done that is part of the iSEEM project as well as to implement in CAMERA any tools the iSEEM project develops. Overall the meeting was very helpful. It did however highlight some challenges with working with CAMERA to get exactly what the iSEEM project needed in terms of scientific resources. Follow up discussions with people from CAMERA including Paul Gilna and Mark Ellisman have led to a strategy to move forward that seems likely to solve all of the perceived challenges.
Any unexpected challenges that imperil successful completion of the Outcome
There have been no unexpected challenges. As indicated in the last report in Spring 2008, the delay in Davis making the funds available did delay our hiring a few months. We are a few months behind schedule in some of the spending associated with personnel due to this delay. We are now catching up in terms of the work on the project and believe that though the hiring was a bit delayed, we have hired some stellar post doctoral researchers for the project, most of whom have started or will start in the fall.
We note that Dr. Pollard is moving from U.C. Davis to U. C. San Francisco. We do not believe this will hinder the project and we are in the process of setting up a subcontract from U. C. Davis to U. C. San Francisco for her portion of the project.