User:Hassan Uk

SplitSSI-SVM: An Algorithm to Reduce the Misleading and Increase the Strength of Domain Signal
ABSTRACT

Protein domains contain the information for prediction of protein structure, function, evolution and design since the protein sequence may be contained of several domains with different or same copies of protein domain. In this study, we proposed a method named SplitSSI-SVM that works with the following steps. Firstly, the training and testing datasets are generated to test the SplitSSI-SVM. Secondly, the protein sequence is split into subsequence based on order and disorder region. The protein sequence that is more than 600 residues is split into subsequences to investigate the effectiveness of protein domain prediction based on subsequence. Thirdly, multiple sequence alignment is performed to predict the secondary structure using Bidirectional Recurrent Neural Networks (BRNN) where BRNN considers the interaction between amino acids. The information of protein secondary structure is used to increase the protein domain boundaries signal. Lastly, Support Vector Machines (SVM) is used to classify the protein domain into single-domain, two-domain and multiple-domain. The SplitSSI-SVM is developed to reduce misleading signal, lower protein domain signal caused by primary structure of protein sequence and to provide accurate classification of protein domain. The performance of SplitSSI-SVM is evaluated using sensitivity and specificity on single-domain, two-domain and multiple-domain. The evaluation shows that SplitSSI-SVM achieved better result compared with other protein domain predictors such as DOMpro, GlobPlot, Dompred-DPS, Mateo, Biozon, Armadillo, KemaDom, SBASE, HMMPfam and HMMSMART especially in two-domain and multiple-domain.

INTRODUCTION Protein domains can be seen as distinct functional or structural units of a protein. Protein domain can exist independently and contain specific function for each protein domain. Protein domains provide one of the most valuable information for the prediction of protein structure, function, evolution and design. Protein domain is detected from protein structure which is predicted from protein sequence of amino acid.

Several methods have been used to detect the protein domain. Protein domain can be defined based on geometry, kinetics, physics and genetics. In geometry based prediction, protein domain is predicted as group of residues with the high contact density and the number of contacts within domain is higher than the number of contact between domains such as work done by Barenboim et al. (2008) and Han et al. (2006). In kinetics based prediction, protein domain is predicted as an independently folding unit such as work done by Lei et al. (2008) and Natalya et al. (2008). HHMPfam (Bateman et al., 2004), HMMSMART (Ponting et al., 1999) and AutoSCOP (Gewer et al., 2007) have been applied in kinetics based prediction to detect the protein domain. In physics based prediction, protein domain is predicted by flexibility link between domain such as work done by Katagiri et al. (2008) and Oldziej et al. (2005). Lastly, in genetics based prediction, protein domain is predicted based on pieces of gene that contain specific protein function such as work done by Boa et al. (2006) and Garsia et al. (2008).

Those methods produced good results in case of single-domain proteins. To improve multi-domain prediction, protein sequence should be split into subsequences. In this paper, long protein sequence is split into subsequences based on order and disorder structure of protein sequences. Previous method that splits the alignments at domain boundaries includes ADDA (Heger et al., 2005) and work done by Walliser et al. (2008). Other methods that used genetic definition to split protein domain are GlobPlot (Linding et al., 2003) and work done by Kurpiers and Mootz (2008). These methods showed the effectiveness of splicing the protein to predict multi-domain.

Previously, Neural Network (NN) is used in protein domain detection such as in Dompred-DPS (Marsden et al., 2002), Biozon (Nagaranjan and Yona, 2004), DOMpro (Cheng et al., 2006), Armadillo (Dumontier et al., 2005) and Mateo (Lexa and Valle, 2003). However, a Support Vector Machine (SVM) is an algorithm that can replace NN. The development of NN followed a heuristic path, with application and extensive experimentation preceding theory. SVM on the other hand, involved theory, implementation and experiments. Unlike NN, the SVM does not depend on the dimensionality of the input space. NN used empirical risk minimization compared to SVM that used structural risk minimization. SBASE (Kristian et al., 2005) and KemaDom (Lusheng et al., 2006) are examples that apply SVM in protein domain prediction. The results from these methods are more accurate compared to NN methods.

The method used in this research focus on multi-domain prediction based on subsequences, the secondary structure information and the classification by SVM. The protein sequences are split based on disorder regions. The disorder regions of protein sequences create interdomain space and are global structural analysis of the protein domain arrangement in 3-Dimensional. The separate protein sequences on disorder regions are assumed to be linkers between protein domains. The information of secondary structure are extracted based on several measures such as entropy, correlation, protein sequence termination, contact profile, protein secondary structure, physio-chemical properties and intron-exon boundaries. SVM is used in this study to achieve accurate classification since SVM is a learning machine that can perform pattern recognition and real valued function approximation.

The proposed method named SplitSSI-SVM works with the following steps: (i) training and testing datasets preparation; (ii) splitting the protein sequence with more than 600 amino acids are split into a segments; (iii) multiple sequence alignment are performed to produce higher similarity from the splitting protein sequences and used to predict secondary structure; (iv) the secondary structure is predicted using Bidirectional Recurrent Neural Networks (BRNN) [20] to increase the signal of protein domain boundaries; (v) the features for each protein sequence are extracted in order to increase the domain signal; (vi) SVM is used to process the information and classify the protein domain into single-domain, two-domain and multiple-domain; (vii) lastly, the performance of protein domain prediction is evaluated based on sensitivity, specificity and accuracy. The results from the proposed method are compared to with DOMpro, GlobPlot, Dompred-DPS, Mateo, Biozon, Armadillo, KemaDom, SBASE, HMMPfam, HMMSMART and AutoSCOP.

RESOURCES
 * 1) Dataset:
 * 2) *Testing Dataset
 * 3) *Training Dataset
 * 4) Supplementary Materials:
 * 5) *The score of hydrophobicity and molecular weight for Physio-chemical properties measure

Education

 * 1) Expected 2009, MSc in Computer Science (Major in Bioinformatics), Universiti Teknologi Malaysia
 * 2) 2006, BSc in Computer Science (Major in Information System), Universiti Teknologi Malaysia
 * 3) 2003, Diploma in Computer Science (Major in Information Technology), Universiti Teknologi Malaysia

Research Interests

 * 1) Protein Domain Prediction
 * 2) Support Vector Machines
 * 3) Neural Networks
 * 4) Protein Subsequences

Group Members

 * 1) Zuraini A. Shah, MSc
 * 2) Razib M. Othman, PhD
 * 3) Rohayanti Hassan, MSc
 * 4) Shafry M. Rahim, PhD
 * 5) Hishammuddin Asmuni, PhD
 * 6) Jumail Taliba, MSc
 * 7) Zalmiyah Zakaria, MSc

Contact Info
Kalsum U. Hassan Laboratory of Computational Intelligence and Biology (LCIB) No. 204, Level 2 Industry Centre, Technovation Park Universiti Teknologi Malaysia (UTM) Jalan Pontian Lama 81300 Skudai, Johor, MALAYSIA Mobile: +6012-769-7349 Tel/Fax: +607-559-9230 E-mail: umiekalsum@gmail.com