Information Technology Applied to Bioenergy Genomics: Probabilistic Annotation using Artificial Intelligence
There is no doubt that one of the greatest challenges to mankind on this century is energy production. For geopolitical, economical and, most pressing, environmental reasons, no ordinary form of energy production is the solution, making renewable and environmentally-friendly options mandatory (Metz et al., I.P.C.C. 2007).
All major global economies are organizing themselves to tackle this issue and Brazil is no exception. Our country became, throughout the years, a key player on energy production from alternative sources, in particular fuel derived from fermentation of biomass, aka biofuel (Goldemberg, Biotechnol. Biofuels 2008). The pillar of this achievement is the economically important crop sugarcane (Saccharum sp.).
Since it is recognized that biofuels may be part of the solution to such pressing problem (Tilman et al., Science 2009), São Paulo State, the main biofuel producer in Brazil, launched an aggressive research program called [BIOEN - FAPESP Program for Research on Bioenergy http://www.fapesp.br/english/materia/472]. There are several scientific and technological goals in this program related to environmental impacts, social impacts, next-generation fuels, production technology and so on. Nevertheless, one of the most fundamental scientific goals is to sequence the sugarcane genome.
No challenge of this size in Molecular Genetics can be meet without the aid of Information Technology (IT). The present proposal aims to develop new IT tools and methods, in the scope of Artificial Intelligence (IA) and Machine Learning (ML), to address some of the Bioinformatics questions raised by research on Molecular Genetics of sugarcane. This proposal will be an explicit attempt to ignite the dialog between two major FAPESP programs running in parallel: the BIOEN Program and the Microsoft Research-FAPESP Institute for IT Research Program.
Specifically, the present proposal aims to develop new probabilistic annotation tools based on the IA/ML methodology known as Bayesian Networks (BN) to automatically assign putative functions to sugarcane genes, identified by transcriptome projects (SUCEST and others previously funded by FAPESP) and the genome project (an ongoing effort carried out by the BIOEN consortium). If proven successful, the developed IT tools may point out new biological functions for several sugarcane genes, unraveling potential targets for crop enhancement through traditional Genetic Engineering or modern Synthetic Biology.
Briefly, the main scientific challenge in Automatic Probabilistic Annotation is to define the probability that a given gene belongs to a given functional category instead of just claiming that it has a given function when it meets some arbitrary criteria - which is the most commonly approach used nowadays.
The initial step where a research program in this field may start is the Phylogenomics-based SIFTER method (Engelhardt et al., PLoS Comput Biol 2005; Engelhardt et al., ACM Int. Conf. Proc. 2006). Although currently recognized as one of the most powerful approaches, this method is still not used in a genome-wide fashion, specially for plants (Conte et al., BMC Genomics 2008; Jöcker et al., Bioinformatics 2008). The AI/ML methodology upon which SIFTER was build is known as Bayesian Networks (BN).
The innovation introduced by SIFTER was to use phylogenetic information to create a BN model. The BN topology for each gene is build mimicking a phylogenetic tree, even if this tree can only be built with a quality level below of what would be required by a regular evolutionary analysis. Therefore, any information available for evolutionarily related proteins are propagated through the tree by means of classical BN algorithms, using the phylogenetic tree as support. This procedure is an advancement relative to the simple function transposition motivated by naïve sequence similarity (BLAST-like approaches), which is still a common approach.
BNs are knowledge networks represented by directed acyclic graphs, where the nodes represent variables with uncertainties associated to them and the edges represent interdependence among these variables. Using these networks, it is possible to calculate the probability of one event conditioned to others. They represent uncertainties using the Probability Theory framework. Each node stores a probability density function for the values taken by the node, given the values took by their parents, i.e, those nodes directly connected to it.
Results up to 2011-March
We currently have completed scripts that allow full automation of the pipeline of SIFTER methodology, with average performance gain of about 72.5% (quad core machine) and 67.7% (dual core) in relation to original scripts supplied with the software. To achieve this goal we changed the originally proposed pipeline, and beyond that we added new functions to the scripts aiming user friendly software and a better detection performance, under evaluation. This new pipeline is designed to enable the analysis proposed by SIFTER methodology in high-throughput analysis.