Moore Notes 6 03 09

1. iSEEM meeting to happen September 9-10 or 10-11. Jonathan e-mailing Kelly to see what dates she prefers.

2. Steve K. going to help set up a tracker for the future.

3. For future things, let's put on the iSEEM page what you need.

4. CAMERA update: asked Jonathan if we need real compute resources. It looks like they are going to discuss what they are going to give us, at a minimum we will have access to sql series. What does having an account mean? They have a massive 1000 node linux node cluster. Have three things that are useful: 1) accounts on linux farm, 2) sql type database that contains all information in camera, 3) use their linux cluster for hmm and blast searches, 4) in event building workflows for our software, being able to see the inside of their system as opposed to waiting until we are done will be a better model. In addition, can imagine we could use that system to store/transfer large files. Jeffrey Grethe is his contact, asked if we wanted to run our own analyses, etc. It is cc'd to their IT group, etc. Paul Gilna told them to do this for us. Kelly has been asking for updates, etc. Not in their deliverables. Supposed to be releasing CAMERA 2.0.

5. GOS data peptide reads run through AMPHORA, align, build tree, assign phylotypes. GOS data too big to do BLAST search on genbio, so he uploaded marker sequences, searched in CAMERA server. This is where he got all of the candidates. Used HMM's in AMPHORA to remove distantly related homologues. Got metadata from CAMERA for each sample, summarized phylotypes with metadata. For excel bar graph, just used these major groups (0.5% population). For heat map, for each group/sample it's like a microarray analysis. On x horizontal axis have the groups, y axis have the samples. Have relative abundance of particular group in sample. Can cluster samples based on the relative abundance of each bacterial groups (used Michael Eisen's cluster software). A few samples with not enough reads, removed those. To assign read to phylotype, do a BLAST search, assign read to a marker genes. GOS too big to use HMM's. He used E.Coli rpoB gene protein sequence to pull out the rpoB. He used very relaxed criteria. Use HMM search to align reference sequences identified (using HMM align). Alignment not time consuming. Then build a tree for each marker gene. Limit the topology of the tree to be the topology of the genome tree. Place read into tree using maximum parsimony. In addition due bootstrap to assess some confidence. Figure out phylotypes based on tree. Assigning reads to trees. Reads not being compared to each other.

6. Phylogenetic approach for binning data into OTUs, Andrew Martin 2002 paper, a variety of other people. How do you bin tree into OTUs, can use branch length distance. Also, how figure what name of tip is. Eisen say anything isn't implemented because DOTUR method does pretty good job, they say it is not necessary to build the tree. Eisen wrote OTU clustering algorithm to comparing monophyletic group. Amount of improvement when use tree is small. But amount of computational time not worth it. UNIFRAC similar in concept to what we want to do. They reduce complexity tree step binning first, then build trees. 99% similar OTUs, then make trees, that works pretty well. Only get differences when binning with DOTUR or using phylogeny is if you have unequal rates of evolution.

7. Josh asked - how do model the range of a species? Model a convex hull around that species? Josh will post results.

8. Samantha asked Martin and Steve - Alignments that are coming out - should I fully mask all of the states that are coming out? Metasim has a bunch of problems.