Umie Kalsum

PREDICT PROTEIN DOMAIN FROM SECONDARY STRUCTURE INFORMATION BASED ON SUPPORT VECTOR MACHINES
ABSTRACT Background Protein domains are units of protein structure where the prediction of protein domain is important for multiple reasons. This includes to predict the function of protein and to determine the protein domain boundaries and regions in order to manufacture new protein with new function. Methods Based on this importance, a proposed method named SSI-SVM is developed to predict protein domain from a protein sequence and protein secondary structure information with the usage of SCOP 1.73 to generate the datasets. In SSI-SVM, BRNN is applied to predict the protein secondary structure. Measures of entropy, correlation, protein sequence termination, contact profile, protein secondary structure, physio-chemical properties and intron-exon boundaries are defined to predict the information of sequence from this secondary structure. The SVM training is used to process the scores of information obtained from the various measures. SVM training is used to classify the protein domain into single-domain, two-domain and multiple-domain. Results To evaluate the results, sensitivity and specificity aspects are used to measure the performance of SSI-SVM. The SSI-SVM is evaluated by comparing it with other existing methods such as DOMpro, GlobPlot, Dompred-DPS, Mateo, Biozon, Armadillo, HMMPfam, HMMSMART and AutoSCOP. An analysis of the results has demonstrated that the SSI-SVM has performed better than other protein domain prediction methods. Conclusions The SSI-SVM predicts the information of protein sequence from protein secondary structure using BRNN to detect protein domain and classify the protein domain by SVM. The SSI-SVM shows outstanding performance on single-domain, two-domain and multiple-domain.  Background A protein domain is the basic unit of protein structure that can develop itself by using its own shapes and functions, and exists independently from the rest of the protein sequence. Each shape of protein domain is a compacted and folded structure that is independently stable. It exists independently since the protein domain is a part of the protein sequence. The independent modular nature of protein domain means that it can often be found in proteins with the same domain content, but in different orders or in different proteins. A protein domain comprises of protein domain boundary that relates to a part in amino acid residue where each residue in the protein chain is defined as domain position. The knowledge of protein domain boundaries is important to analyze the different functions of protein sequences. However, there is no signal to indicate when a protein domain starts and ends. Nowadays, it is not only important to detect a protein domain accurately from large numbers of protein sequences with unknown structure, but it is also very important to detect protein domain boundaries of protein sequence. The main problem in protein domain prediction is to predict the protein domain boundaries in protein sequence alone since the protein sequences alone contain structural information that is only available in small portion of the protein space. For classification of protein domain, many prior methods apply the neural network algorithm such as DomNet (Paul et al. 2007), Armadillo (Dumontier et al. 2005) and DOMpro (Cheng et al. 2006). The setback in neural network is that it follows a heuristic path with application, and also extensive experimentation that precedes theory. It also has to depend on the dimensionality of the input space and uses empirical risk minimization in the prediction. This problem reflects inaccurate prediction of protein domain.  In the last few years, ab initio that uses machine learning algorithm has been used widely to predict protein domain. This method is based on the understanding of how a 3D structure of protein is attained by a given protein sequence. One drawback of this method is that it is computationally intensive (Cheng et al. 2006). Mostly, previous ab initio methods produce good results in cases of single-domain prediction either based on similarity of multiple alignments, protein sequence structure, dimensional structure or model based. The method based on similarity of multiple sequence alignments such as SVM-Fold (Melvin et al. 2007), EVEREST (Portugaly et al. 2006) and Biozon (Nagaranjan and Yona, 2004) depend on BLAST (Altschul et al. 1997) or Clustal (Chenna et al. 2003) to detect the protein domain. The method based on protein sequence structure such as AutoSCOP (Gewer et al. 2007) and CATH (Orengo et al. 1997) depend on known protein structure to identify the protein domain. The method that uses dimensional structure to assume protein domain are PROMALS (Pei and Grishin, 2007), GlobPlot (Linding et al. 2003), Dompred-DPS (Marsden et al. 2002) and Mateo (Lexa and Valle, 2003). Model-based method such as HMMPfam (Bateman et al. 2004) and HMMSMART (Ponting et al. 1999) uses Hidden Markov Model (HMM) to identify others members of protein domain families.  To solve the problem, we proposed a method that involves three phase namely pre-processing, feature extraction and post-processing. The pre-processing contains two parts: generate the dataset and predict secondary structure for protein sequence. The predicted secondary structure is important to increase the signal of protein domain boundaries. Three steps are needed to generate the dataset which are; search seed protein sequence, find protein data and perform multiple alignments. In the features extraction phase, the information of protein sequences based on protein secondary structures are predicted. Lastly, in the post-processing phase, two steps will be executed: classification by Support Vector Machines (SVM) and evaluation of results. The SVM is used to classify the protein domain and identify the protein domain boundaries. The performance for each classifier is evaluated in term of sensitivity and specificity for each single-domain, two-domain and multiple-domain prediction compared with other methods such as DOMpro, GlobPlot, Dompred-DPS, Mateo, Biozon, Armadillo, HMMPfam, HMMSMART and AutoSCOP.

RESOURCES
 * 1) Dataset:
 * 2) *Testing Dataset
 * 3) *Training Dataset
 * 4) Supplementary Materials:
 * 5) *The score of hydrophobicity and molecular weight for Physio-chemical properties measure

Education

 * 1) Expected 2009, MSc in Computer Science (Major in Bioinformatics), Universiti Teknologi Malaysia
 * 2) 2006, BSc in Computer Science (Major in Information System), Universiti Teknologi Malaysia
 * 3) 2003, Diploma in Computer Science (Major in Information Technology), Universiti Teknologi Malaysia

Research Interests

 * 1) Protein domain prediction
 * 2) Support Vector Machines
 * 3) Neural Networks

Group Members

 * 1) Razib M. Othman
 * 2) Zuraini A. Shah
 * 3) Rohayanti Hassan

Contact Info
Kalsum U Hassan Laboratory of Computational Intelligence and Biology (LCIB) No. 204, Level 2 Industry Centre, Technovation Park Universiti Teknologi Malaysia 81310 UTM Skudai, Johor Bahru, Malaysia Email: ukalsum8@siswa.utm.my Tel: +60123674711