Surayati Ismail

About the Current Research
High Sensitive Similarity Algorithm with Latent Semantic Analysis to Detect Remote Protein Homology Abstract In remote protein homology detection, Support Vector Machines (SVMs) are currently the most effective method for the problem of family recognition. However, the study to improve remote protein homology is still being continued by researchers in order to produce better methods. Due to this, serious study in SVM and remote protein homology is needed. In this study, two problems have been identified; hard to align protein sequences and noise data that can prevent SVM to perform well. From these problems, a new method known as SVM-LSAA is introduced. SVM-LSAA handles both problems with two main techniques; high sensitive similarity algorithm and Latent Semantic Analysis. The performance of SVM-LSAA is tested using ROC, MRFP, and family by family comparison of ROC measures to 54 families of SCOP 1.53 dataset. SVM-LSAA has shown a better performance compared to the basic formalisms.

Keywords: High Sensitive Similarity Algorithm, Latent Semantic Analysis, Protein Substring, Remote Protein Homology, Support Vector Machines.

Introduction In remote protein homology detection, various methods and techniques were created by researchers to facilitate the classification of protein sequences into their family by referring to their structure and functions. The purpose of classification is to reduce the problem of overloaded protein sequences in biological databanks. There are several models in remote protein homology detection to classify the protein sequences, e.g. protein sequences comparison model such as PSI-BLAST (Altschul et al., 1990) and FASTA (Pearson, 1990), generative model such as HMMER (Eddy, 1998) and SAM (Hughey and Krogh, 1996), and discriminative classifiers model such as SVM-Fisher (Jaakkola et al., 2000) and SVM-Fold (Melvin et al., 2007). Each model is either applied alone or combined with other models such as protein sequences comparison model with generative models such as HMMStruct (Bernardes et al., 2007) and HMMstr (Hou et al., 2004) or discriminative classifier models such as SVM-Pairwise (Liao and Noble, 2003) and SVM-BALSA (Webb-Robertson et al., 2005). There are also methods that applied all models in order to get the best result such as SVM-SS (Zaki and Deris, 2007). In discriminative classifier model, various techniques have been applied to produce the best method. Support Vector Machine (SVM) is one example technique in discriminative classifier model that has become very popular in classification problems in the area of Bioinformatics (Golub et al., 1999; Brown et al., 2000; Furey et al., 2000). SVM has shown to perform well with binary classification. Example works that have applied SVM in their methods are SVM-Fisher and SVM-Kernel (Saigo et al., 2004). However, some problems still exists in using discriminative classifier models to detect the homology of the protein. High dimensionality with noise data is one of these problems where there are thousands of noise data with high dimension matrix of feature vectors for moderate-length proteins datasets that will prevent SVM to perform well. Due to this problem, generative model is one of the steps in our method that is capable to handle this problem. Generative models are probabilistic models utilized in pattern recognition problems (Chen and Chen, 2006). In this study, Latent Semantic Analysis (LSA) is applied to produce feature vectors with less noise data. In LSA, Singular Value Decomposition (SVD) is used to filter the noise data and reduce high dimensional data with only the top dimensions for which the elements from data are greater than threshold are considered for further processing. Another technique was also added to handle hard to align protein sequences. Hard to align protein sequences is caused by multi domain protein sequences in which several amino acids from different domains may belong to one family. Recently, methods such as SVM-Pairwise and SVM-LA (Henikoff and Wallace, 1998) consider only the frequencies and lengths of similar regions within proteins and do not take into account these biological relationships that exist between amino acids. Therefore, high sensitive similarity algorithm is proposed. The algorithm contains three steps which include pairwise protein substring alignment, guide tree, and multiple protein substring alignment. The main part that is used to tackle hard to align protein sequences problem is by pairwise protein substring alignment, which was added with alignment free approach. Then, the result will be used to build guide tree. Referring to guide tree, multiple protein substring alignment will be executed. The reason in building multiple protein substring alignment based on guide tree is to improve the sensitivity of the multiple protein substring alignment involving highly diverged sequences. In this paper, a method capable to solve noise data with high dimension feature vectors and hard to align protein sequences named SVM-LSAA is introduced. The SVM-LSAA contains substring protein sequences that is used to obtain a score based on sensitive and non-sensitive regions, high sensitive similarity algorithm to handle hard to align protein sequences, LSA to extract and represent the feature vectors and reduce noise and high dimension data, and the SVM to discriminate the entire vector into positive and negative members. To measure the quality of the results, the Receiver Operating Characteristics (ROC), Median Rate of False Positives (MRFP), and family by family comparison of ROC scores are used. Experimental results have shown that SVM-LSAA has successfully produced better results compared to other methods such as PSI-BLAST, SVM-Pairwise, HMMER, SAM, SVM-Fisher, and SVM-I-Sites.

Resources

 * 1) Dataset - SCOP 1.53.

Education

 * 1) 2007, BSc (Computer Science), Universiti Teknologi Malaysia.
 * 2) 2003, Dip (Computer Science), Institut Teknologi Perak.

Research interests

 * 1) Research in Bioinformatics: Remote Protein Homology Detection.
 * 2) * Noise removal and dimension reduction.
 * 3) * Redundant information reduction.
 * 4) * Latent semantic analysis.
 * 5) * Support vector machines.
 * 6) * Similarity protein structural.

Group members

 * 1) Razib M. Othman - Laboratory of Computational Intelligence and Biology (LCIB)
 * 2) Nazar M. Zaki - College of Information Technology, United Arab Emirates University
 * 3) Safie M. Yatim - Information Technology Centre, Universiti Teknologi Malaysia

Contact information
Surayati Ismail Laboratory of Computational Intelligence and Biology (LCIB) No. 204, Level 2 Industry Centre, Technovation Park Universiti Teknologi Malaysia (UTM) Jalan Pontian Lama 81300 Skudai, Johor, MALAYSIA Tel/Fax: +607-559-9230 E-mail: surayatiismail@gmail.com