# Future

 Home Project News For Team Calendar Library

# New Bioinformatics Tools for Marine Metagenomics

## Who is out there? (Sharpton, Koeppel, Darling)

### PHYLOTU: Identifying Microbial Taxa

Overview: One of the major challenges facing marine microbiologists is how best to leverage new sequencing technologies to improve molecular characterizations of microbial communities. Transitioning from using targeted, full-length 16S rRNA sequences to using shotgun-sequenced metagenomes or 16S pyrotags requires new tools for identifying taxa and estimating diversity from large, fragmentary data sets. Existing methods, that assume sequences overlap, do not work. This year, we developed a tool, called PHYLOTU, to meet this need. The current version has reasonable accuracy on 454-like (~400bp) reads. Our goals for the next year are to improve the accuracy of PHYLOTU on shorter, Illumina-like (~100bp) reads and to extend the algorithms to work with pyrotags.

#### Enabling characterization of microbial diversity from short-read metagenomic data

Motivation. Investigators frequently use overlapping 16S sequences to describe the number of microbial species in an environmental sample. We developed PHYLOTU to enable the characterization of microbial diversity from non-overlapping 16S metagenomic sequences. PHYLOTU is currently limited to 454- and Sanger-sequence metagenomic libraries. Many metagenomic project investigators are adopting short-read (less than 100 bp) sequence technology (e.g., Illumina). PHYLOTU must be extended to incorporate short sequences to ensure that microbial diversity can be described from these investigations. This work would collectively comprise PHYLOTU v2.0.

Approach. To facilitate high-throughput investigations, we originally implemented PHYLOTU with a phylogenetic method that sacrifices sensitivity for speed. Our initial investigations suggest this method is particularly imprecise when processing sequences less than 100 bp. To enable accurate analysis of short reads, we will develop PHYLOTU modules that incorporate slower, but more sensitive, maximum likelihood and bayesian phylogenetic methods (e.g, RAxML, pplacer, and xrate). In addition, we will develop statistically-guided short-read sequence quality control filters. These developments will be evaluated via the same simulation approach we previously derived and described in the PHYLOTU manuscript. Once developed, PHYLOTU v2.0 will be used to analyze the diversity of microbes in publicly available metagenomic libraries.

Outcomes. We anticipate that PHYLOTU v2.0 would be widely adopted given that more investigators are moving towards short-read technology (e.g., Illumina). The software will be made freely available to the public as will the results of our processing publicly available metagenomic libraries with PHYLOTU v2.0.

#### Improving diversity estimates from 16S pyrotag sequences

Motivation. Many modern investigations characterize a microbial community by amplifying and sequencing 16S pyrotags. Percent identity (PID) based methods, originally developed using full-length 16S sequence, are used to characterize diversity from these pyrotags. Recent studies have revealed that PID approaches may be inappropriate for these relatively short sequences, calling into question many pyrotag based diversity findings. PHYLOTU can accurately cluster sequences with lengths comparable to pyrotags (~400 bp), thus we hypothesize that it may serve as an accurate alternative for the analysis of PID.

Approach. We can simply test this hypothesis by simulating pyrotags from full-sequence 16S markers and clustering both the simulated tags and the corresponding full-length sequence via PHYLOTU. Our previously developed clustering accuracy approach can be applied to these results to determine the statistical robustness of PHYLOTU-based pyrotag investigations of microbial diversity. Publicly available pyrotag data will be processed once the accuracy is verified. Similar simulations will be conducted that evaluate the sensitivity at which PHYLOTU processed mixed libraries consisting of both metagenomic-derived and pyrotag-derived 16S sequences.

Outcomes. Because pyrotag analysis is fast becoming a standard-fare approach to survey microbial diversity, we expect this work to have profound implications should PHYLOTU more accurately process pyrotags than PID-based methods. Verifying PHYLOTU's unique ability to process mixed sequence libraries will enable in-depth explorations of diversity at pyrotag and metagenomic sequences are subject to different methodological biases.

## Binning reads from next-generation sequencing (Darling, et al.)

Motivation

Taxonomic identification of metagenomic sequence reads, or "binning," has long been considered a hard computational problem. Current binning methods such as MEGAN, PhyloPithia, and PhymmBL typically use only a small number of the available information sources to determine the identity of the reads in the sample. Moreover, reads typically get classified independently, but in samples generated by high throughput sequencing methods, many of the reads are likely to hail from the same or very similar organism and could therefore be considered to share taxonomic information.

An updated approach for classifying metagenomic sequence reads will use traditional information sources such as sequence composition and phylogenetic signal in addition to information such as:

• Shared phylogenetic signal across reads
• Mate-pair information
• Shared changes in abundance across samples

Although the value of such information may seem obvious in retrospect, it has not always been the case. Modeling shared information among reads has the most value when the datasets contain an extremely large number of short reads, and the technology to produce such datasets has only existed for two years. The cost to produce such datasets is so low that they are an extremely attractive technology for metagenomics, however, the resulting datasets can be especially difficult to analyze with the previous generation of metagenomic binning software due to the short read length. Modeling the shared information among reads may be the only feasible way to analyze the short-read datasets.

Approach

We call our approach "metagenomic cobinning" because reads are "cobinned," or binned together. In its simplest form, our method takes as input an alignment of sequences $A$, some of which may be full-length reference sequences s_1, s_2, ... s_M \in S, and some of which are metagenomic sequence reads to be classified, r_1, r_2...r_N \in R. We assert that the metagenomic reads R should be partitioned into k taxonomic bins, where the membership of reads in bins is unknown a priori. Each bin defines a cluster of reads that come from the same taxon in the phylogenetic tree. We then assert that a phylogenetic tree T relates the organisms in the sample. The tree, whose topology and branch lengths are also unknown a priori, has M+k leaf taxa. That is, there is one taxon for every reference sequence and one taxon for every bin. We now describe how to infer the partitioning of the metagenomic reads R into k bins while also inferring the shape of the tree T that relates the reads and the reference sequences.

We perform inference on the reads in a Bayesian phylogenetic setting. As described above, we are attempting to infer the parameters T and a k-partition of the reads R, call it \pi. We can define the probability of any particular combination of tree T and partition \pi as P(T,\pi|D), where D is our data namely the alignment A composed of sequences from S and R. Using Bayes rule, P(T,\pi|D) = P(T,\pi)P(D|T,\pi)/P(D). Here P(T,\pi) is the prior probability of a particular configuration of tree and partition over sequences and P(D) is the marginal probability of the data, averaged over all possible values of T, \pi weighted by their prior probabilites. Finally, P(D|T,\pi) is the probability that a particular tree and partition generated the data. Since we the partitioning is independent of the tree, we can write this as P(D|T)P(D|\pi). P(D|\pi) is always 1, and P(D|T) is the usual phylogenetic likelihood function.

Given the stochastic model outlined above, we can employ standard Markov-chain Monte Carlo techniques to calculate the posterior probability distribution over the parameters T and \pi. Such techniques are well-known, with widely used software implementations including PhyloBayes, MrBayes, and BEAST.

Extensions to this basic model include the incorporation of multiple alignments A_1...A_j, each of which may have a different tree T_1...T_j and estimation of linkage among reads present in the various A_i and T_i.

Outcomes

A prototype implementation of the simplest form of cobinning has been developed by extending the BEAST software, and it appears to work. More effort will be required to extend the model to include the other information sources and to carefully scrutinize how well the method works and where it fails. We anticipate this work to be especially valuable and relevant for the large datasets generated by inexpensive current-generation sequencing methods such as the Illumina and ABI SOLiD sequencing systems.

### Ecologically Defined Taxa

Overview: Species delineation practice in bacterial systematics has suffered from a reliance on universal molecular cut-offs, such as 70% hybridization of genome content, or more recently, 99% identity at the 16S rRNA locus. The principal problem with these cut-offs is that they are ultimately calibrated to deliver species groupings that correspond to species designations based on broad metabolic and morphological characteristics. As a result, the species delineated by universal cutoffs are known to encapsulate large quantities of genetic and ecological diversity, far more than would typically be found within plant or animal species. The goal of this project is to develop methods that re-calibrate sequence-based species designations to produce groupings that correspond with ecologically distinct communities.

Motivation. In order to fully capture the genetic and functional diversity of the bacterial world, it is necessary to develop a systematics in which the species unit describes a commonality of ecological and physiological function among its members. Species based on universal molecular markers do not necessarily fit this criterion. Ecotypes, defined as ecologically homogeneous populations bounded by the competitive domain of recurrent adaptive mutations within their ecological niche, may meet this criterion, because they are expected to possess the dynamic properties attributed to species, such as a cohesive force limiting divergence within a population (periodic selection), a shared evolutionary history, and prognosis for continued coexistence (ecological distinctness). New methods of bacterial species demarcation, grounded in ecological and evolutionary theory, are necessary if we are to learn how (and perhaps whether or not) bacterial biodiversity is intrinsically organized into cohesive species-like lineages.

Approach. Ecotype Simulation is an algorithm capable of predicting the number and identity of ecotypes in a natural community based on DNA sequences alone without the need for cultivation of the organisms. In any environment where high levels of bacterial diversity exist with little available ecological information, Ecotype Simulation can give a starting point from which further investigations into specific ecological differences can proceed. Due to the present uncultivability of the majority of bacterial taxa on earth, tools such as Ecotype Simulation are vital if we are to improve our understanding of bacterial biodiversity. Past studies have shown good correspondence between ecotypes as predicted by Ecotype Simulation, and ecologically distinct bacterial populations. Ecotypes predicted by Ecotype Simulation have been independently confirmed as ecologically distinct populations in numerous systems, including isolates from natural communities of Bacillus licheniformis and B. subtilis sampled from desert canyons in Israel, communities of Synechococcus from hotsprings in Yellowstone, and in clinical and environmental isolates of Legionella pneumophila. In each of these cases, Ecotype Simulation revealed numerous putative ecotypes contained within species named using the current conventions of bacterial systematics.

Outcomes. By applying the ecotype simulation algorithm to much larger data-sets, including whole-genome and meta-genomic data, we hope to detect the most recent products of speciation in bacterial communities. If we are able to successfully demarcate the fundamental units of diversity in a community based on DNA sequence alone, then the scientific community will gain a very powerful tool for characterizing the vast diversity of the bacterial world.

## What are they doing? (Jospin, Langille, Sharpton)

Overview: A key step in the analysis of any metagenomic data set involves the classification and prediction of function of proteins encoded by the metagenomes being sampled. Such classifications are typically done using similarity search methods (e.g., BLAST) and leveraging the "top hit" to any previously characterized protein. In studies of complete genomes, it has been found that such top hit methods are not ideal, and that phylogenetic classification of new sequences is a more accurate way to characterize them and predict functions. Such "phylogenomic" analyses are also clearly of value in metagenomic studies, yet they are challenging due to the fragmentary nature and massive volume of data being produced. We propose to expand upon our gene family work and develop a "phylogenomic" classifier and database to meet this need.

Motivation. Metagenomic reads are difficult to functionally annotate because they typically cover only part of an open reading frame. Hidden markov models built from protein domain alignments (e.g., Pfam models) have been used to identify reads encoding various domains. While useful, this approach is constrained in such that domains provide a limited resolution regarding which gene family a read belongs to and the particulars of its function.

Approach. The functional annotation of metagenomic reads can be improved by building similar models from full-length protein sequences. We demonstrated through a trial analysis that reads can be accurately classified into protein families using a prototype annotation workflow we have devised. Recently, we constructed a database of over 340,000 full-length protein family models. We will adapt the current software prototype to interface with this database of families. We will then engage in a statistical assessment of the classifier, using simulated data, to ensure that reads are not spuriously annotated as belonging to particular families. Finally, we will use the software we had developed to annotate publicly available metagenomic libraries.

Outcomes. This work enables researchers to determine which specific protein families are present in their metagenomic libraries. This in turn facilitates the characterization of the functional capabilities of the microbial community under investigation. This functional annotation software will be made freely available to the public, as will access to the all database data, including our annotation results.

# New theory to link biodiversity to function

## Modeling Taxa Distributions (Ladau, O'Dwyer)

Overview: Empirical studies in ecology are uncovering vastly more information about the natural world than ever before, across global spatial scales and in unprecedented genomic detail. With this increase in the volume and resolution of data, the concurrent development of theory in ecology has become increasingly important. Theoretical frameworks in community ecology are always guided by empirical data, but they also extrapolate beyond our current knowledge, offering us predictive power in a changing world. Reaching back in the other direction, theory guides and informs the structure of future empirical studies (e.g., sampling strategies), allowing us to refine our predictions and potentially preempt the loss of biodiversity. The overall aim of these projects is to build and apply models for how microbial taxa, identified by PHYLOTU or other methods, are distributed in terms of both geography and ecologically relevant variables.

### Characterizing the ranges of marine microbes

Motivation. Despite the ubiquity of marine microbs, little is known about their distributions and community ecology. Are most taxa of marine microbes cosmopolitan, or do they have high rates of endemism? How many taxa of marine microbes are there on regional and global scales? At large-scales, do communities of marine microbes assemble primarily through random, diffusive processes, or is community assembly affected by interspecific interactions? To answer these questions, specialized models and statistical tools are needed because of the sparse nature of metagenomic data.

Approach. Our general approach is to use a framework based on geometric probability theory. In the last year, we have begun to apply geometric probability theory towards understanding distance-decay relationships. Distance-decay relationships quantify how the taxonomic similarity between samples decays as function of the distance between them. As such, they are a measure of beta diversity, one of the the three main components of ecological diversity. By applying ideas from geometric probability, we have successfully predicted attributes of distance-decay relationships, and gained insight into the processes generating them. We believe that it will be possible to extend the geometric probability based approach to understanding other ubiquitous patterns of biodiversity; for instance species-area relationships and endemics-area relationships.

The geometric probability based approach is particularly useful for understanding marine microbial diversity because it yields estimators of biodiversity that are that are ideally suited for use with metagenomic data. For instance, it is essentially impossible to conduct a census of all microbes in an ecosystem, but our distance-decay models enable robust estimates about the shapes of microbial taxon ranges from sparse metagenomic samples at random locations in an ecosystem. This approach makes it possible to infer attributes of microbial ranges at large spatial scales despite the impossibility of actually measuring these quantities directly. By extending this theory and methodology, we will aim to estimate microbial distributions in three dimensions; for instance, volumes of ocean. Moreover, the extension of our framework to other patterns of biodiversity may yield estimators of different measures of biodiversity. For example, we hope to derive estimators of taxonomic richness at regional and global scales from samples collected at much smaller scales.

One reason that the development of these estimators is particularly germane and exciting is the number of new metagenomic studies currently underway. We believe that because of improved sampling schemes in these studies, the data sets that they generate will be highly useful for inferring attributes of microbial distributions and diversity. We intend to apply our estimators to these new data sets.

The geometric probability based approach can also yield important insights into the processes of microbial community assembly. Using dynamical modeling, we expect that we can generate predictions of attributes of range shapes under different scenarios of dispersal and community assembly. By comparing these predictions to estimates from metagenomic data made using our estimators, we hope to be able to gain insight into the ecological processes determining microbe distributions. Thus, for instance, we will be able to investigate whether the marine microbes appear to diffuse through the ocean, or are driven by specific currents and environmental conditions.

Outcomes. The development of the geometric probability approach to biodiversity will yield novel estimators of biodiversity, ideally suited to assessing fundamental components of the distributions of marine microbes. By assessing these distributions, it will be possible to infer large-scale processes of microbial community assembly. We plan to write user-friendly software to make these methods available to applied researchers. Moreover, more generally, developing the geometric probability based approach will illuminate the processes generating ubiquitous patterns of biodiversity, and the links between those patterns.

Motivation. We have adapted a set of tools from theoretical and statistical physics known as 'field theory', and developed these tools into a new theoretical framework for community ecology. So far, we have explored two distinct applications of these methods. We began with communities structured by body-size (O'Dwyer et al (2009) PNAS), where the size of an organism is taken to be the dominant factor in determining its demographics. This follows an existing body of work in ecology known as metabolic theory, a collection of theoretical derivations meshed with empirical data, telling us that body-size is a key determinant of metabolic rates, from unicellular organisms up to the largest mammals on tree of life. Our work allowed us to combine these ideas from metabolic theory with stochastic population dynamics, and we used this combination to make novel community level predictions.

The second strand of this framework introduces explicitly spatial processes, involving the dispersal or diffusion of organisms across a spatial environment (O'Dwyer & Green (2010) Ecology Letters). The spatial structure of biodiversity is crucial in determining potential shifts in the structure and function of ecological communities in response to environmental change, and investigating this spatial structure has been a major focal point of empirical ecology from its inception. Our work allows us to predict one of the classic patterns of spatial biodiversity, the Species-Area relationship. This relationship characterizes the non-linear increase in diversity with area sampled, and our relatively simple model predicts the shape of this relationship as a function of dispersal capability and speciation rate.

Each of these models offer an insight into the way ecological communities are structured, both across a spatial environment and in terms of a trait like body-size. But so far we are missing two crucial ingredients: the evolutionary relatedness among individuals and explicit impact of environmental variability across space. Ultimately we would like to predict the effect of phylogenetic signal and environmental heterogeneity on patterns of biodiversity. Our models capture the broad-scale phenomena, but in order to say something more concrete about changes in diversity and function across an environmental gradient, we need to know how this environmental variability feeds into community-level processes. We propose the development of new biodiversity theory to incorporate environmental stochasticity, the impact of random variability in the environment, and environmental gradients.

Approach. A classic example of an ecologically-relevant environmental gradient is temperature. For example, there are substantial variations in temperature across both latitude and altitude, with important consequences for biodiversity. This variation illustrates how an environmental gradient can shape both variation in function, with different types of organisms optimized for life in different ranges of temperature, and also patterns of taxonomic and phylogenetic diversity, with an increase in biodiversity with temperature documented in hundreds of studies of the latitudinal and altitudinal diversity gradients.

As a first step towards developing ecological theory to describe environmental diversity gradients, we propose extending our spatial model of community dynamics to take account of both fluctuations and large-scale variation in temperature. We will again draw from the body of research on metabolic theory, which shows a strong dependence of metabolic and demographic rates on temperature, and also predicts a functional form for the survival rate of organisms around their optimal temperature. The mathematical formulation of this theory will draw on both our spatial and earlier size-structured models, and we anticipate developing predictions for varying species richness with temperature, variation in the species-area relationship with temperature, and turnover in both taxonomic diversity and trait-based diversity (where the functional trait here is the optimal temperature of an organism).

We plan subsequently to generalize this framework in two directions. First, we would like to understand if the example of temperature variation could be leveraged to model more general environmental gradients, and their impact on correspondingly more general traits. Can we go beyond this to hypothesize and test the impact of particular environmental variables on functional genes, or suites of genes? And can we understand the circumstances under which ecosystems are most likely to lose, regain or develop new functions? A model of community assembly which takes into account the impact of environmental variability on taxonomic and functional diversity will be crucial in addressing these questions, and we will use patterns revealed by metagenomic data to guide the development of more general theoretical models.

A more far-reaching and longer term goal is to develop theory to predict phylogenetic measures of diversity alongside functional and taxonomic diversity. Phylogenetic diversity extends the traditional description of a community in terms of different taxa to include the evolutionary relatedness of the individuals comprising it. It is thought to be a more meaningful measure of biodiversity than taxonomic diversity, with changes in phylogenetic diversity strongly impacting the evolvability and robustness of a community. Patterns of phylogenetic diversity have been studied both across space and along temperature and other environmental gradients, but there is currently no mechanistic theoretical framework to predict and understand the origin of these patterns, or guide future empirical studies. We plan to draw on an approach to spatial population genetics known as coalescent theory, in conjunction with our existing mathematical approach, to develop such a framework. Ultimately, we hope to develop a comprehensive understanding of the interconnection between functional, phylogenetic and taxonomic diversity and their dependence on key environmental factors.

Outcomes. We will develop a theoretical model of community assembly to predict key patterns of diversity across a temperature gradient. We hope to extend this to make predictions for more general patterns of trait-based biodiversity, and ultimately to begin to understand the interconnection between functional, phylogenetic and taxonomic diversity and its dependence on environmental variation.

## Microbial Biogeography: a phylogenetic approach (Kembel, D. Wu, Jospin)

Overview: The goal of this project is to develop microbial community diversity metrics to complement existing measurements that are based on taxonomic assignments or OTUs. We focus instead on phylogenetic diversity, a measure of the evolutionary relatedness among organisms living in ecological communities. We have shown that phylogenetic diversity reveals important, subtle distinctions that taxonomy-based methods miss. For example, total taxonomic richness of surface and intermediate depth ocean samples may be quite similar, while the surface communities tend to contain a cluster of closely related species and the deeper communities cover a much broader phylogenetic range. Insights such as this can vastly change interpretations regarding the processes shaping and maintaining biodiversity.

Motivation. We propose to use phylogenetic relationships among metagenomic reads from a wide variety of habitats to study the evolution of microbial habitat associations, and to provide baseline data that can be used with taxonomic, functional, and population genomic analyses to understand the evolution of microbial biogeography. This will build on existing methods developed to study phylogenetic diversity using metagenomic data, but will allow broader questions about microbial biogeography and the evolution of microbial ecology to be addressed.

Approach. Using models of phylogenetically informative marker genes from multiple taxonomic groups, we will identify phylogeneticaly informative sequences from metagenomic data sets. We will create phylogenetic trees linking sequences in metagenomic data sets, using the best performing methods for phylogenetic inference identified by S. Riesenfeld. This portion of the planned work has already been applied to several data sets, and will be carried out in the future collaboration with other iSEEM researchers interested in using phylogenetic diversity estimates for metagenomic data.

Once phylogenetic relationships among metagenomic reads have been estimated, we will calculate measueres of phylogenetic diversity within and among communities for each family. Phylogenetic diversity estimates will be standardized to account for variation in numbers of sequences among samples as well as variation in rates of evolution among gene families.

We will apply these analyses to as many metagenomic data sets as possible, in order to extend this approach to a variety of different habitats, versus the current analyses which have focused on oceanic microbial communities. This will allow us to study how phylogenetic diversity varies across a broader range of habitat types and environmental conditions than previously possible. We will also identify nodes in the reference phylogeny constructed from fully sequenced microbial genomes whose descendants are over- or under-represented in different habitats, and identify the taxonomic identity of these nodes using an AMPHORA-like approach to map taxonomy onto the reference phylogeny.

Extending this approach to larger and multiple data sets will require the development of software that can handle phylogenetic trees with hundreds of thousands of tips. It will also require tools to link the phylogenetic placement of metagenomic reads with taxonomic and environmental information about environmental samples. While there are existing software tools that can perform these kinds of analyses, the nature of metagenomic data will likely require the development of new tools that can deal with the extremely large phylogenetic trees that will be generated.

Outcomes. This study will result in a comprehensive study of how microbial diversity varies along major spatial and environmental gradients, and will greatly increase our understanding of the nature and magnitude of phylogenetic diversity in microbial ecosystems, and the processes responsible for patterns of microbial diversity. Novel, open-source software tools and reference data sets will be generated by this project, which will allow our methods to be applied by other researchers to metagenomic data sets as they are generated in the future. Finally, several other collaborative projects among iSEEM researchers will make use of, or contribute to, the phylogenetic diversity analyses generated by this project.

# Novelty: Discovering new taxa, proteins, and functions in metagenomes

## Adaptation in Microbial Communities (Riesenfeld, Kembel, Sharpton)

Overview: One of the exciting promises of marine metagenomic data is the opportunity to study how microbial communities evolve in response to environmental pressures. By developing methods to quantify divergent and convergent evolution in gene families, and then modeling the associations of these metrics with ecological meta-data, we aim to uncover the genes, taxa, and environmental variables driving major adaptations in marine microbial communities.

### Detecting selection in metagenomic data

Motivation. In comparative genomics, traditional genome-based or gene-based studies have employed sophisticated statistical techniques in combination with phylogenetic analyses of complete sequences to detect genetic regions that are particularly slow- or fast-evolving in a specific taxonomic group, relative to the same genetic regions in other taxa or relative to other regions of the genome. We are in the process of developing related techniques that can be applied to metagenomic data, in order to, for example, use metagenomic sequences from many organisms, and possibly many environmental samples, to find signals of adaptive evolution in specific gene families. Using information about the function of the gene families, the taxonomic groups sampled, and the environments sampled, we may be able to then infer possible sources of evolutionary pressure.

Approach. A fundamental step in this type of analysis is the inference of evolutionary relationships from sequence data. Phylogenetic trees enable powerful analyses of taxonomy, community diversity, and evolutionary patterns. It is not obvious, however, how to quantify evolutionary distance between fragmentary, non-overlapping metagenomic sequence reads. A major iSEEM project led by Sam Riesenfeld has been to use comprehensive simulations to establish whether metagenomic phylogenies can be reliably constructed for individual gene families. Within iSEEM, this work has implications for the PHYLOTU pipeline built by Tom Sharpton, the analyses of phylogenetic diversity of communities led by Steve Kembel, and the search for novel protein subfamilies led by Tom Sharpton and Morgan Langille, but the impact extends well beyond iSEEM projects.

To obtain the first complete series of simulation results, we built a well-parameterized software pipeline, generated many gigabytes of simulated metagenomic data, built alignments from this data, inferred phylogenies from the alignments using multiple algorithms, and then statistically analyzed these results. Our results show that many factors, including the choice of phylogenetic inference method, sequence length, and gene family, all have a significant effect on the accuracy of constructed phylogenies. Different gene families appear to suffer from different, yet consistent, biases in branch-length estimation, and our simulations have been shown to be useful both in identifying and helping to correct for such biases. (These findings were recently presented in a talk at a major genomics conference, Biology of Genomes 2010).

Using parameter settings informed by the first series of results, we are currently in the middle of producing a refined, second series of simulation results for journal submission. The simulation software pipeline is complete and has been tested during both series of simulations. We are preparing to release a production-level version of it along with an applications note for journal submission.

Next we will make use of the groundwork we have laid by applying what we learned from the simulations to real metagenomic data. We will build phylogenies from metagenomic samples for a wide range of gene families using the best possible methodology, as determined from the simulations. This step will leverage the protein family alignments generated by Guillaume Jospin. To improve our ability to detect accurate evolutionary signals from this noisy data, we will aggregate information across gene families and environmental samples. While our simulations work may enable us to overcome some of the major hurdles in studying the evolutionary dynamics of microbial communities, there remain many fundamental issues to address. Using these gene-family phylogenies of metagenomic sequence reads, we plan to:

• Use simulations to quantify branch-length biases for each gene family and then use these bias scores to design adjustments to potentially enable comparisons across gene families.
• Adapt statistical models from complete-sequence phylogenetic analysis to develop methods for identifying gene families or subfamilies that display unusual evolutionary patterns for certain taxonomic groups or environmental samples. This aim may require development of novel statistics for summarizing signals of selection, since traditional metrics may be sensitive to systematic errors introduced by short reads.
• Apply these statistical methods to gene families that are potentially providing adaptive functions in specific environments.
• Examine possible correlations among sequences that present unusual patterns of evolution, the function of the corresponding gene families, and metadata about the environment in order to infer possible sources of evolutionary pressure and interpret the evolutionary patterns.

Outcomes. Developing rigorous phylogenetic tests for selection in metagenomics data sets will enable entirely new types of questions to be asked from environmental sequencing data sets. These studies will go beyond describing taxa distributions and will begin to uncover what environmental and cell biological variables drive community assembly in different contexts.

### Protein family phylogenetic diversity and convergent evolution

Motivation. Previous work from our and other groups has shown that evaluations of protein families based on phylogenetic diversity (PD) may reveal biological functions that play critical roles in ecosystem under investigation. For example, it has been shown that comparisons of PD for "phylogenetic marker" genes versus "functional genes" can help identify possible cases of convergent evolution or lateral gene transfer. Most of the work in this area has focused on data from complete genome sequences. We propose here to adapting and expand such approaches to metagenomic data which should greatly improve our ability to understand both the role biological functions play in shaping the ecosystem and the role the environment plays in shaping the evolution of biological function.

Approach. Metagenomic reads will be classified into protein families using our functional annotation software. Phylogenies will be built from these reads across multiple environmental samples using our metagenomic phylogenetic approaches. Since PD-based analyses require accurate estimates of phylogenetic branch length, we will apply the family-specific statistical corrections for bias in the estimates of phylogenetic branch lengths, as described in the previous section on detecting selection. These phylogenies will be used to characterize the PD for each family in each sample, which in turn will be used to determine PD similarity between samples (e.g., how many lineages do each pair of samples share for a family). The PD sample similarity for each protein family will be compared to the PD sample similarity for the 16S family to identify those samples that are taxonomically disparate but functionally similar. We will then develop metadata correlation statistics to identify environmental parameters that may associate with any identified functional convergence across samples.

To boost our ability to detect signals of functional convergence, we can check for correlations in the PD sample similarity between protein families that are in the same pathway. We may also employ other measures, in addition to the PD sample similarity, such as tree-based similarity metrics, adjusted for different family-specific mutation rates.

Outcomes. This work will result in a fundamentally expanded view of natural functional diversity. In addition to enabling PD-based characterization of functional variation, researchers will be able to use this approach to identify functions that may have arisen in response to environmental stimuli. The software will be freely available to the public.

## Discovering Novel Genes and Functions in Environmental Samples (Sharpton, Langille, D Wu)

Overview: The identification of novel biological functions in the microbial world is important for both ecological understanding and industrial development. Metagenomic analysis has already revealed novel protein families, though it is difficult to identify specific function from sequence data alone. The goal of these projects is to leverage protein family profiles developed using whole genomes (see above) to mine metagenomic data sets for groups of related sequences that represent either functionally divergent subfamilies of known gene families or totally novel gene families that have not been identified through genome sequencing. These will be rich sources of candidate genes for discovering novel functions in marine ecosystems.

### Describing novel biological functions in nature via the identification of novel protein subfamilies

Motivation. Most of the work on functional novelty in genomes and metagenomes has focused on analyzing protein families as a unit. Yet this focus misses an important component of novelty - the functional diversity that exists within protein families. We propose here to develop methods that focus on quantifying and characterizing within family diversity in metagenomic data by focusing on protein subfamilies.

Approach. We will develop software that enables the automated detection of novel subfamilies from metagenomic data. Previous work using whole genome sequence data revealed that subfamilies are most accurately identified through phylogenetic approaches. We will use our phylogenetic metagenomic applications and our metagenomic read annotation workflow to cluster reads into protein families and construct metagenomic read phylogenies. (These phylogenies will be adjusted as needed to correct for family-based biases in branch-length estimates, as described in the previous section on detecting selection.) We will then adapt the aforementioned subfamily-identification tools developed using whole genome data to our phylogenies. The approach will be statistically evaluated using simulation data. Upon completion of the software, we will process publicly available data for the signatures of protein subfamily novelty.

Outcomes. Our software will facilitate the identification of putatively novel biological functions. It will also improve the characterization of functional diversity (e.g., how many subfamilies are present in a sample). The software will be freely available to the public as will the results of the subfamily scans in public data. By combining our discoveries with environmental metadata, we anticipate that our novel approach will enable us to identify functional innovations underlying microbial adaptations to specific environments.

### Characterization and prediction of unknown genes

Motivation. Perhaps one of the most frustrating aspects of genome and metagenome analysis is that for many protein families we cannot make any predictions of function using similarity search methods. Such "hypothetical" or "unknown" proteins, represent a significant fraction of the proteins in most genomes or metagenomes (sometimes up to 50%). The percentage of "unknown genes" will probably continue to increase as sequencing technology continues to outpace lab experiments that can shed light on these genes. This severely limits our ability to use metagenome data to understand communities. We propose here to extend some of the work from the initial iSEEM project to develop new computational approaches that will improve the amity to use and interpret unknown famous in Metagenomic data.

Approach.

• First, we will identify families that contain only proteins with unknown function. Proteins with unknown function in completed genomes are identified by a combination of searching for key words in their description("hypothetical protein", "unknown function", etc.) and searching for those that do not contain hits to the PFAM database. Using the protein families that have been constructed already by the iSEEM project, those families that contain only unknown genes are listed as unknown protein families.
• The second step is to characterize and rank these unknown protein families using various measurements such as family size, family universality (percentage of species that contain this family), and phylogenetic diversity. The families can also be ranked by other metrics that are of scientific interest such as being present in only pathogens or certain environments (aquatic, terrestrial, host-associated, etc.). At first these rankings will be performed using completed genomes, but will be extended and better evaluated by extending the families to metagenomic datasets.
• The third step is to search for the presence of unknown gene families across various metagenomic datasets. Using metadata information from the metagenomic samples will allow us to identify families that are only present in certain communities and could possible provide clues to the function of some families. Additionally, by clustering all protein families across many metagenomic samples, we are hoping to identify clusters of families that all have the same or similar function. If so, unknown families that cluster with known families could be annotated. If successful, this is a powerful method that does not require sequence similarity (which is the primary method in which all genes are annotated) and will improve as the number of metagenomic datasets increases.

Outcomes. This project will result in two major outcomes. First, is a resource that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments. These families of high interest could be targeted for analysis using more traditional lab experiments to determine their function. Second, is a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method would help annotate the vast number of proteins that we currently can not annotate, which will otherwise continue to be an increasing problem to biology.

### Protein family diversity

Motivation Every new microbial genome or metagenome that is sequenced usually revel as a large number of new protein families. This is in essence a reflection of the still poor sampling we have of global microbial diversity. The same could be said to be true for microbial taxa, yet we are gaining an understanding of taxa diversity (e.g., richness, beta diversity) through various metrics and estimators that measure taxa variation across space and time. We propose here to develop similar metrics for total protein family diversity to ask questions like "what environments have the highest protein family diversity" and "how many protein families are there".

Approach - We will treat protein families in much the same way that others have treated OTUs and will develop metrics around Assays of these protein families as the unit of measure. among the metrics we propose to test and develop

• Richness estimators - rather than estimating the number of species in a sample we will estimate the number of protein famous based upon measures such as rarefaction curves of families vs. Number of sequences
• beta diversity measures : protein family presence and absence across environments can be treated much like OTU presence and absence - environments can be compared and clustered based on ecological measures of beta diversity and other parameters
• Comparisons between different types of protein families
• analysis of subfamilies based on sharp tons subfamily divisions

Outcome analogs of species diversity measures for global protein family diversity

### Combined metrics for assessing novelty in metagenomic samples

Motivation A key aspect of analyzing metagenomic data involves looking for the degree of novelty of the samples. "Is there anything unusual in the sample?" and "Is there anything in the sample we would not have expected based on the metadata, or the phylogenetic types present, etc" or "Are there organisms here that we would not expect" or "How unusual is this sample?" In essence these questions relate to measures of novelty in metagenomic samples. We propose here to build on, and further develop, quantitative metrics that measure different aspects of the novelty of metagenomic samples.

Approach

• Use metrics for novelty such as those proposed above including: phylogenetic novelty (in essence unique PD), novel subfamilies, novel families.
• Add other metrics as published or developed
• Compare measures of novelty vs.
• number of sequences (in essence, a rarefaction curve)
• biogeographical measures (e.g., distance, metadata)

## Developing iSEEM software and integrating with existing tools

motivation CAMERA did not work out so we need to further develop our own software and also find better ways to release it

Approach

• Software engineering for scripts and iseem workflows
• integrate with QIIME and MOTHUR etc

outcome