Moore GreenLabNotesSept-3-08

From OpenWetWare
Jump to navigationJump to search

9.3.08

Conversation with Katie, Jonathan, James and Liz about Arrowhead project


1. Primary concern about using the GDD curve to infer eco-evo processes is that, even if we have a theoretical prediction for the distribution of whole-genome similarities, when we take the metagenomic sample these two issues will come to play:

Because we are comparing pieces of genomes, the moderately high similarity-scores in our histogram could potentially reflect comparisons of orthologous genes from divergent lineages or paralogous genes that are the result of a history of duplication events within genomes. Our metric, by itself, will not be able to distinguish between these two types of comparisons. Also, certain areas of the genome might show patterns of repitition (long strings of single nucleotides etc) or might be especially prone to convergent evolution, and comparisons of these regions both within and across genomes will factor into our similarity scores even if they are not informative from a phylogenetic and/or functional/ecological perspective. Similarly, the low-similarity scores that we record will be due both to comparisons of non-homologous areas of very similar genomes and to comparisons of 'homologous' regions between very distant taxa. The metric won't be able to distinguish between these two types of comparisons either, and we will have to take into account the fragmented nature of our data if we are to correctly predict theoretical patterns.


Because different genes evolve at different rates across the genome, we don't know if a high similarity score reflects a comparison between highly conserved genes shared by somewhat distantly related individuals OR fast evolving genes shared by closely related individuals. Neutral predictions can tell us about abundances at the tips of phylogenetic trees and about time since divergence between species, but doesnt account for the fact that different amounts of genetic divergence will accrue over the same time period for different parts of the genome. (Neutral theory doesn't allow for positive/negative selection).

In summary, how will the genomic similarity metrics be interpreted?

2. Question posed: how would you expect the GDD curve for whole genomic data to differ from a GDD curve based on a phylogenetic marker gene? What would make these two curves different? Would it be LGT, convergent evolution, gene diversification, gene duplication? This is a potentially interesting application of the analysis that we were not thinking of before. It sounds like JE is interested in comparative analyses using whole metagenomic versus marker gene genomic analysis to tease apart the influence of important evolutionary processes such as LGT. This is idea was discussed under the umbrella of phylogenetic versus functional analyses.

From JE: This is one of the specific goals of the comparison of phylogenetic trees of "marker" genes vs. functional genes we are planning on doing. For example, we are planning on looking for how much phylogenetic "novelty" there is in marker genes vs. functional genes and to quantify this by sample. So if a sample has a lot of unusual functional genes but normal species, we should detect this.


3. Suggestions: look at the whole genomic similarity literature by Tiedje, talk to Marisano. DNA-DNA hybridization literature.

4. Potential data bases to use:

- JGI has simulated metagenomic communities, mostly designed for metagenomic annotation testing - based on data from sequenced genomes - JE has done simulations/virtual genome sequences - JE has data for in vivo simulations, mixed cells, 10 different species (with sequenced-genomes), have sequence reads - Banfield acid mine drainage. More realistic in terms of actual co-occuring species. Simple system, well-annotated.