User:Morgan G. I. Langille/Notebook/Unknown Genes

{| width="800"
 * style="background-color: #cdde95;" align="center"|
 * style="background-color: #cdde95;" align="center"|




 * align="center" style="background-color: #e5edc8;" |

title=Search this Project


 * colspan="2" style="background-color: #F2F2F2;" align="right"|Customize your entry pages 
 * colspan="2"|
 * colspan="2"|
 * colspan="2"|

Motivation
Perhaps one of the most frustrating aspects of genome and metagenome analysis is that for many protein families we cannot make any predictions of function using similarity search methods. Such "hypothetical" or "unknown" proteins, represent a significant fraction of the proteins in most genomes or metagenomes (sometimes up to 50%). The percentage of "unknown genes" will probably continue to increase as sequencing technology continues to outpace lab experiments that can shed light on these genes. This severely limits our ability to use metagenome data to understand communities. We propose here to extend some of the work from the initial iSEEM project to develop new computational approaches that will improve the amity to use and interpret unknown famous in Metagenomic data.

Approach

 * First, we will identify families that contain only proteins with unknown function. Proteins with unknown function in completed genomes are identified by a combination of searching for key words in their description("hypothetical protein", "unknown function", etc.) and searching for those that do not contain hits to the PFAM database. Using the protein families that have been constructed already by the iSEEM project, those families that contain only unknown genes are listed as unknown protein families.
 * The second step is to characterize and rank these unknown protein families using various measurements such as family size, family universality (percentage of species that contain this family), and phylogenetic diversity. The families can also be ranked by other metrics that are of scientific interest such as being present in only pathogens or certain environments (aquatic, terrestrial, host-associated, etc.). At first these rankings will be performed using completed genomes, but will be extended and better evaluated by extending the families to metagenomic datasets.
 * The third step is to search for the presence of unknown gene families across various metagenomic datasets. Using metadata information from the metagenomic samples will allow us to identify families that are only present in certain communities and could possible provide clues to the function of some families. Additionally, by clustering all protein families across many metagenomic samples, we are hoping to identify clusters of families that all have the same or similar function. If so, unknown families that cluster with known families could be annotated. If successful, this is a powerful method that does not require sequence similarity (which is the primary method in which all genes are annotated) and will improve as the number of metagenomic datasets increases.

Outcomes
This project will result in two major outcomes. First, is a resource that would allow researchers to identify particular genes with unknown function that are a high level of interest due to their presence across the tree of life, their possible role in pathogenisis, or their contribution to species in particular environments. These families of high interest could be targeted for analysis using more traditional lab experiments to determine their function. Second, is a completely novel method that would predict gene function that does not use sequence similarity and would in theory improve as the number of metagenomic datasets become available over time. This method would help annotate the vast number of proteins that we currently can not annotate, which will otherwise continue to be an increasing problem to biology.