Utilizing shared interacting domain patterns and Gene Ontology information to improve protein-protein interaction predictions
Protein-protein interactions (PPI) play a significant role in many crucial cellular operations such as metabolism, signaling and regulations. The computational prediction methods for PPI have shown tremendous growth in recent years, but problem such as huge false positive rates has contributed to the lack of solid PPI information. We aimed to enhance the overlap between computational predictions and experimental results with the effort to partially remove the false positive pairs from the computational predicted PPI datasets. The usage of protein function prediction based on shared interacting domain patterns named PFP() for the purpose of aiding the Gene Ontology Annotation (GOA) is introduced in this study. We used GOA and PFP() as agents in the filtration process to reduce the false positive in computationally predicted PPI pairs. The functions predicted by PFP() which are in Gene Ontology (GO) IDs that were extracted from cross-species PPI data were used to assign novel functional annotations for the uncharacterized proteins and also as additional functions for those that are already characterized by GO. As we know, GOA is an ongoing process and protein normally executes a variety of functions in different processes, so with the implementation of PFP(), we have increased the chances of finding matching function annotation for the first rule in the filtration process as much as 20%. The results after the filtration process showed that huge sums of false positive pairs were removed from the predicted datasets. We used signal-to-noise ratio as a measure of improvement made by applying the proposed filtration process. While strength values were used to evaluate the applicability of the whole proposed computational framework to all the different computational PPI prediction methods.
PPI play critical roles in the control of most cellular processes and act as a key role in biology since they mediate the assembly of macromolecular complexes, or the sequential transfer of information along signaling pathways. Many proteins involved in signal transduction, gene regulation, cell-cell contact and cell cycle control require interaction with other proteins or cofactors to activate those processes (Papin and Subramaniam, 2004; Tucker et al., 2001; Wang, 2002; Reš et al., 2005). In recent years, high throughput technologies have provided experimental methods to identify PPI in large scale, generating tremendous amount of PPI data such as yeast two hybrid (Y2H) and mass spectrometry of coimmunoprecipitated complexes (Co-IP: von Mering et al., 2002). Several methods have been previously used to identify true interactions in high-throughput experimental data like paralogous verification methods (Deane et al., 2002) structurally known interactions (Edwards et al., 2002) and by using an interaction generality measure (Saito et al., 2003). Advances in experimental methods are paralleled by rapid development of computational methods designed to detect vast number of protein pairs on wide genome scale. The major limitation in both the computational and experimental approaches is their lack of confidence in the identification of PPI, with high false positive and false negative rates (von Mering et al., 2002; Qi et al.,2006). Most efforts in computational approaches focused on predicting more PPI by the means of various approaches that identify true positives. The results from these approaches are higher or of huge volume of predicted PPI datasets that contains not only more true positive predictions but also numerous false positive predictions.
Experimental PPI detection methods attempt to discover direct physical interactions between proteins while computational PPI prediction often refer to functional interactions (Valencia and Pazos, 2002). Efforts and researches in enhancing true positive fraction of computationally predicted PPI datasets has not been adequately investigated. A lot of other researchers have focused on improving computational method in producing better result of predicted datasets in terms of its accuracy which means low false positive by means of refinement of a particular computational method (Wu et al., 2003; Sun et al., 2005; Huang et al., 2007) or an integration of several types of computational methods such as joint observation method (JOM: Marcotte et al., 1999; Chen and Xu, 2003) that calculates the accuracy and coverage for the PPI that were predicted by at least one, two, three or four methods using three positive datasets (KEGG, EcoCyc and DIP). Those methods are Phylogenetic Profiles (PP), Gene Cluster (GC), Gene Fusion (GF) and Gene Neighbourhood (GN). STRING (von Mering et al., 2003) that integrate combined scores for each pair of proteins and InPrePPI (Sun et al., 2007) that integrates the scores of each protein pair obtained by the four methods. While other researchers focused on improvement in computational methods area, Mahdavi and Lin (2007) have proposed a filtering algorithm solely using GOA (Camon et al., 2005). The removal of false positive depends on whether the predicted pairs satisfy the heuristic rules that were developed based on the concept of PPI in cellular systems observation. The result after the filtration process differs among different types of computational methods that were implemented. GOAs that were used as a common ground for the filtration process has been a popular and reliable source for several research which concern validation or evaluation of a certain results such as in Patil and Nakamura (2005). GOAs were used as one of the means to assign reliability to the PPI in yeast determined by high-throughput experiments. GO has appeared to be utilized in several studies concerning PPI. GO (Ashburner et al., 2000) terms had been used by Rhodes et al. (2005) to assess associations between proteins in a pair while Wu et al., (2006) constructed a PPI network for yeast by measuring the similarity between two GO terms with a relative specificity semantic relation. In the meantime, Hsing et al. (2008) used GO term for predicting highly-connected 'hub' nodes in PPI networks. While Dyer et al. (2008) used GO to provide functional data to protein interactome sets that also revealed interactions of human proteins with viral pathogens. From the GO analysis, it indicated that many different pathogens target the same processes in the human cell, such as regulation of apoptosis, even if they interact with different proteins. On the other hand, GO structural hierarchy was used to evaluate functional associations by Lord et al. (2003).
Although GO shows tremendous usage in recent studies, GO suffers from inconsistency within and between genomes. This is because ontology annotation is an ongoing process, thus it is considered incomplete and does not contain full or complete annotations. Problems that could arise from this limitation are, one protein is assigned a term that represents a broad type of activity, and its interacting partner is assigned a more specific term. There are some cases where some proteins have not even been assigned all three ontologies which make the interaction assessments more difficult. There is also a possibility that a substantial portion of most genomes are still unannotated such as D. melanogaster and H. sapiens and some proteins are still uncharacterized. Chen et al. (2008) has stated that only about 54% among the current list of D. melanogaster genes that were downloaded from FlyBase (Crosby et al., 2007) as on November 2006 are annotated with molecular function terms in GO.
In this research, we aimed to enhance the overlap between computational predictions and experimental results through a confidence level which reflects the agreement of a link between both the experimental results and computational predictions. Therefore, we proposed a computational framework to filter false positive of the predicted PPI pairs so that it will increase true positive fraction of the computationally predicted PPI dataset. Using GO as a common ground in the filtering process, we also implemented protein function assignment based on the shared interacting domain patterns extracted from cross-species PPI data to assign novel functional annotations for the uncharacterized proteins and predict extra functions for proteins that are already annotated in the GO. The involved species in PPI data that were used to infer the uncharacterized or incomplete functions are S. cerevisiae, C. elegans, D. melanogaster and H. sapiens. In order to evaluate the improvement made by the proposed filtration process, the Signal-to-Noise Ratio (SNR:Fujimori et al., 1974) was employed while value strength (Mahdavi and Lin, 2007) was calculated to show the effect of the rules applied. A series of steps was conducted in a framework to refine the computationally predicted datasets. First, a set of S. cerevisiae PPI datasets with high confidence were prepared for the experimental dataset and one set of each newly updated PPI dataset consist of four species (C. elegans, D.melanogaster, H. sapiens and S.cerevisiae). Second, GOAs with the aid of GO functions predicted by the shared interacting domain patterns extracted from cross-species PPI data were utilized to identify keywords which represent general category functions of the proteins. Third step was to establish interaction rules. It is established to be satisfied by the predicted interacting proteins. Next, four computational PPI prediction methods were selected to use in this study. Those methods are the conventional Phylogenetic Profiles (PP: Pelligrini et al., 1999), Gene co-Expression (GE: van Noort et al., 2003), Mutual Information (MI: Date et al., 2003) and Maximum Likelihood Estimation (MLE: Deng et al., 2002). For each of these computational methods, predicted PPI datasets were obtained. Then, the false positive pairs that exist in the predicted datasets were removed by applying the interaction rules. If the predicted interacting pair satisfies the rules, then it is considered as a true positive pair, otherwise the pair is assume as a false positive pair and removed from the dataset. The result of the filtered datasets were statistically evaluated and compared.
- PFP() data:
- test data:
- Research in Bioinformatics: Protein-Protein Interactions (PPI).
- PPI prediction.
- Gene Ontology.
- Interacting protein homolog.
- Interacting protein PFAM domain.
- Reducing false positive in PPI prediction.
- Increasing true positive in PPI prediction.
- 2010, MSc (Computer Science),Universiti Teknologi Malaysia
- 2008, BSc (Computer Science),Universiti Teknologi Malaysia
- 2005, Dip Computer Science (Information Technology), Universiti Teknologi Malaysia
- Roslan R., Othman R.M., Shah Z.A, Kasim S., Asmuni H., Taliba J., Hassan R., and Zakaria Z. (2010). Incorporating Multiple Genomics Features with the Utilization of Interacting Domain Patterns to Improve Protein-Protein Interactions Prediction. Information Sciences [Elsevier; impact factor 2009: 3.291] DOI: 10.1016/j.ins.2010.06.041.
- Roslan R., Othman R.M., Shah Z.A., Kasim S., Asmuni H., Taliba J., Hassan R., and Zakaria Z. (2010). Utilizing Shared Interacting Domain Patterns and Gene Ontology Information to Improve Protein–Protein Interaction Prediction. Computers in Biology & Medicine [Elsevier; impact factor 2009: 1.269] 40(6): 555-564.
Laboratory of Computational Intelligence and Biotechnology (LCIB)
No. 204, Level 2
Industry Centre, Technovation Park
Universiti Teknologi Malaysia (UTM)
Jalan Pontian Lama
81300 Skudai, Johor, MALAYSIA