M Hilmi Muda

Two-Layer Multi-Class Classifiers for Remote Protein Homology Detection and Fold Recognition
ABSTRACT

Remote protein homology detection refers to detection of structural homology in proteins where there are small or no similarity in the sequence. To detect protein structural classes from protein primary sequence information, homology-based methods have been developed, which can be divided to three types: discriminative classifiers, generative models for protein families and pairwise sequence comparisons. Support Vector Machines (SVM) and Neural Networks (NN) are two popular discriminative methods. Recent studies have shown that SVM has fast speed during training, more accurate and efficient compared to NN. We present a comprehensive method based on two-layer multiclass classifiers. The first layer is used detect up to superfamily and family in SCOP hierarchy by using optimized binary SVM classification rules directly to ROC-Area. The second layer uses discriminative SVM algorithm with a state-of-the-art string kernel based on PSI-BLAST profiles that used to leverage the unlabeled data. It will detect up to fold in SCOP hierarchy. We evaluated the results obtained using mean ROC and mean MRFP. Experimental results show that our approaches significantly improve the performance of protein remote protein homology detection for all three different datasets (SCOP 1.53, 1.67 and 1.73). We achieved 0.03% improvements in term of mean ROC in dataset SCOP 1.53, 1.17% in dataset SCOP 1.67 and 0.33% in dataset SCOP 1.73 when compared to the result produced by state-of-the-art methods.

INTRODUCTION

Advances in molecular biology in past years like large-scale sequencing and the human genome project, have yielded an unprecedented amount of new protein sequences. The resulting sequences describe a protein in terms of the amino acids that constitute it and no structural or functional protein information is available at this stage. To a degree, this information can be inferred by finding a relationship (or homology) between new sequences and proteins for which structural properties are already known. Traditional laboratory methods of protein homology detection depend on lengthy and expensive procedures like X-ray crystallography and Nuclear Magnetic Resonance (NMR). Since using these procedures is unpractical for the amount of data available, researchers are increasingly relying on computational techniques to automate the process.

Accurately detecting homologs at low levels of sequence similarity (remote protein homology detection) still remains a challenging ordeal to biologists. Remote protein homology detection refers to detection of structural homology in proteins where there are small or no similarity in the sequence. To detect protein structural classes from protein primary sequence information, homology-based methods have been developed, which can be divided to three types: discriminative classifiers (Ben-Hur and Brutlag, 2003; Jaakkola et al., 2000; Leslie et al., 2004; Liao and Noble, 2002; Saigo et al., 2004) generative models for protein families (Krogh et al., 1994; Park et al., 1998) and pairwise sequence comparisons (Altschul et al., 1990). Discriminative classifiers show superior performance when compared to other methods (Leslie et al., 2004; Rangwala and Karypis, 2005).

Support Vector Machines (SVM) and Neural Networks (NN) are two popular discriminative methods. Recent studies showed that SVM has faster training speed, more accurate and efficient compared to NN (Ding and Dubchak, 2001). This classifier is uniquely different from generative models and pairwise sequence comparisons because it removes the amino acid sequence from the prediction step. The protein sequences are transformed into feature vectors and then are used to train an SVM to identify protein families. Feature vectors give the benefit of mapping the sequences into a multivariate representation and additionally do not depend on a single pairwise score. The performance of remote protein homology detection has been further improved through the use of methods that explicitly model the differences between the various protein families (classes) and build discriminative models. In particular, a number of different methods have been developed that build these discriminative models based on SVM and have shown, provided there are sufficient data for training, to produce results that are in general superior to those produced by pairwise sequence comparisons or methods based on generative models (Ben-Hur and Brutlag, 2003; Jaakkola et al., 2000; Leslie et al., 2004; Liao and Noble, 2002; Saigo et al., 2004; Kuang et al., 2005; Hou et al., 2003). Motivated by the idea and work from Rangwala and Karypis (2006) and Ie et al. (2004), we further study the problem of building SVM based multiclass classification models for remote protein homology detection in the context of the Structural Classification of Proteins (SCOP: Murzin et al., 1995) protein classification scheme. We present a comprehensive method based on two layers multiclass classifiers. The first layer of can detect up to superfamily and family in SCOP hierarchy by using optimized binary SVM classification rules directly to ROC-Area. The second layer of multiclass classifier uses discriminative SVM algorithm with a state-of-the-art string kernel based on PSI-BLAST profiles to leverage unlabeled data. This will detect up to fold in SCOP hierarchy. Details are explained in the methods section. We evaluated our result using mean ROC and mean MRFP. Experimental results show that our approaches significantly improve the performance of protein remote protein homology detection.

Datasets

 * SCOP 1.53
 * SCOP 1.67
 * SCOP 1.73

Source Code

 * Binary Base Classifier: SVM Struct
 * SVM 2 Layer Codes: SVM 2 Layer

Contact Info


Mohd Hilmi Muda Laboratory of Computational Intelligence and Biology (LCIB) No. 204, Level 2 Industry Centre, Technovation Park Universiti Teknologi Malaysia (UTM) Jalan Pontian Lama, 81300 Skudai, Johor, MALAYSIA Mobile: +6012-626-0678 Tel/Fax: +607-559-9230 E-mail: [mailto:mrhilmi@gmail.com mrhilmi@gmail.com ]

Education

 * 2008, BSc (Computer Science), Universiti Teknologi Malaysia.
 * 2005, Dip (Computer Science), Universiti Teknologi Malaysia.

Research Interests

 * 1) Remote Protein Homology Detection
 * 2) Protein Fold Recognition
 * 3) Support Vector Machine (SVM)
 * 4) Multi-class Classification

Group Members - Bioinformatics

 * 1) Razib M. Othman, Supervisor.
 * 2) Jumail Taliba, PhD Student (Protein-Protein Interaction Prediction).
 * 3) Rohayanti Hassan, PhD Student (Protein Tertiary Structure Prediction).
 * 4) Shahreen Kasim, PhD Student (Gene Expression Analysis).
 * 5) Zuraini Ali Shah, PhD Student (Gene Expression Analysis).
 * 6) Mohamad Firdaus Abdullah, MSc Student (Remote Protein Homology Detection).
 * 7) Rosfuzah Roslan, MSc Student (Protein-Protein Interaction Prediction).
 * 8) Surayati Ismail, MSc Student (Remote Protein Homology Detection).
 * 9) Umi Kalsum Hassan, MSc Student (Protein Domain Detection).