User:Shahreen Kasim

From OpenWetWare
Jump to: navigation, search
Shahreen Kasim

About The Current Research

CluFA: Incorporating Multiple Functional Annotation Databases with the Utilization of Gene Ontology in Fuzzy c-Means Clustering to Improve Gene Function Prediction

Background: Analysis of simultaneous clustering of gene expression with biological knowledge has now become an important technique and standard practice to present a proper interpretation of the data and its underlying biology. However, common clustering algorithms do not provide a comprehensive approach that look into the three categories of annotations; biological process, molecular function, and cellular component, and were not tested with different functional annotation database formats. Furthermore, the traditional clustering algorithms use random initialization which causes inconsistent cluster generation and are unable to determine the number of clusters involved.

Results: In this paper, we present a novel computational framework called CluFA (Clustering Functional Annotation) for semi-supervised clustering of gene expression data. The framework consists of three stages: (i) preparation of Gene Ontology (GO) datasets, functional annotation databases, and testing datasets, (ii) a fuzzy c-means clustering to find the optimal clusters; and (iii) analysis of computational evaluation and biological validation from the results obtained. With combination of the three GO term categories (biological process, molecular function, and cellular component) and functional annotation databases (Saccharomyces Genome Database (SGD), the Yeast Database at Munich Information Centre for Protein Sequences (MIPS), and Entrez), the CluFA is able to determine the number of clusters and reduce random initialization. In addition, CluFA is more comprehensive in its capability to predict the functions of unknown genes.

Conclusions: We tested our new computational framework for semi-supervised clustering of yeast gene expression data based on multiple functional annotation databases. Experimental results show that 76 clusters have been identified via GO slim dataset. By applying SGD, Entrez, and MIPS functional annotation database to reduce random initialization, performance on both computational evaluation and biological validation were improved. By the usage of comprehensive GO term categories, the lowest compactness and separation values were achieved. Therefore, from this experiment, we can conclude that CluFA had improved the gene function prediction through the utilization of GO and gene expression values using the fuzzy c-means clustering algorithm by cross referencing it with the latest SGD annotation.


  1. Application : CluFA
  2. User's Guide: User's Guide
  3. Testing Datasets  : Eisen's and Gasch's
  4. Functional Annotation Databases : SGD, Entrez, MIPS

Gene Ontology Based Biclustering In Gene Expression Data

One of the purposes of the analysis of gene expression data is to cater for the cancer classification and prognosis. Currently, clustering has been introduced as a computational method to assist the analysis. However, these clustering algorithms focus only on statistical similarity and visualization presentation, thus neglecting the biological similarity and the consistency of the annotation in the cluster. Furthermore, there are still complexity issues and difficulty in finding optimal cluster. In this study, we proposed a clustering algorithm named BTreeBicluster to overcome those problems. The BTreeBicluster starts with the development of GO tree and enriching it with expression similarity from the Sacchromyces genes. From the enriched GO tree, the BTreeBicluster algorithm is applied during the clustering process. The BTreeBicluster takes subset of conditions of gene expression dataset using discretized data. Therefore, the annotation in the GO tree is already determined before the clustering process starts which gives major reflect to the output clusters. The results of this study have shown that the BTreeBicluster produces better consistency of the annotation.

Gene expression data has been widely used in the bioinformatics analysis. The analysis of gene expression profile is used to predict cancer classification for example Sotiriou et al. (2003) have done the research for breast cancer classification and prognosis. Antonov et al. (2004) also have done the classification which concentrating of tumor samples based on microarray data. This procedure detects groups of genes and constructs models (features) that strongly correlate with particular tumor types. Meanwhile, Xiong and Chen (2005) used the optimized kernel to increase the performances of the classifiers in classifying gene expression data.

Apart from classification, clustering is also a useful data-mining tool for discovering similar patterns in gene expression dataset, which may lead to the insight of significant connections in gene regulatory networks. Cheng and Church (2000) introduced the concept of bicluster which captures the similarity of clustering of both genes and conditions. Meanwhile, Getz et al. (2000) introduced Coupled Two-Way Clustering (CTWC) analysis on colon cancer and leukemia datasets. Lazzerroni and Owen (2002) introduced plaid models which is similar as cluster analysis. These plaid models incorporate additive two way ANOVA models within the two-sided clusters of yeast gene expression datasets. However, all of these works only focus more on mathematical similarity of genes and conditions. These works did not pay attention to the biological process of each cluster. Lately, there are several biclustering methods have been introduced. The advantage of using biclustering is the genes in one cluster do not have to behave similarly through all conditions. Bicluster referred to subset of genes that behave similarly in a subset of conditions. Some of related works in bicluster is Samba (Tanay et al., 2002). Samba presented a graph-theoretic approach to biclustering in combination with a statistical data model. Iterative Signature Algorithm (ISA) (Ihmels et al., 2004) considers a bicluster to be a transcription module, for instance a set of co-regulated genes together with the associated set of regulating conditions. Order Preserving Submatrix Algorithm (OPSM) (Ben-Dor et al., 2003), a bicluster is defined as a submatrix that preserves the order of the selected columns for all of the selected rows. xMotif by Murali and Kasif (2003), biclusters are sought for which the included genes are nearly constantly expressed - across the selection of samples. All of these methods are too complex to be solved which their optimization problems are NP-hard and did not bring optimal cluster result.

Fang et al. (2006) developed biclustering method which incorporates Gene Ontology (GO) (Gene Ontology Consortium, 2000) in the expression data. GO has been applying in many works for example Liu et al. (2004) incorporates GO information in its Smart Hierarchical Tendency Preserving clustering (SHTP-clustering). Hvidsten et al. (2001) inducing predictive rule models for functional classification of gene expressions which are taken from the GO. There are also softwares which based on GO for performing statistical determination, interpretation and visualization of function profiles such as GOMiner (Zeeberg et al.,2003), GOTree Machine (Bing et al., 2004), Onto-Tools (Sorin et al., 2003), GO::TermFinder (Boyle et al., 2004) and FunSpec (Robinson et al., 2002). However, all of these software only playing with knowledge in GO.

Therefore, in order to solve complexity problems and to achieve optimal cluster, we developed a new clustering method named BTreeBicluster, which applies fundamental biclustering method and at the same time GO is integrated in analysis of gene expression data. Differ from the conventional clustering techniques such as hierarchical clustering (HCL) (Eisen et al., 1998), the BTreeBicluster allows genes in the same cluster not to respond similarly across all experimental conditions. Instead, it defined as a subset of genes that shows similar expression patterns over a subset of conditions. This is useful to find processes that are active in some but not all samples. The BTreeBicluster also uses discretized data which will bring a comprehensive result. Furthermore, the BTreeBicluster eschews random interference which is caused by masked bicluster in Cheng and Church (2000). More importantly, the BTreeBicluster is based on similarity measures which expression profiles and biological functions are taken in first place. This step gives major difference where the annotation in other clustering methods is done after the clustering process.

The details explanation of our method is in the following sections. In Section 2, the clustering algorithm is illustrated. In Sections 3, the results of the BTreeBicluster on two realworld datasets (Eisen et al., 1998) and (Tavazoie et al., 1999). Some comparison with those of other published methods are presented. Finally, in Section 4, the BTreeBicluster is discussed, and thus some conclusions are drawn. The paper ends with perspectives for other potential applications and suggestions for further improvements.


  1. Datasets: Eisen's and Tavazoie's datasets and SGD data
  2. Application: BTreeBicluster

Guidelines to Select Gene Ontology Data Formats and Traversal Methods

Motivation: Gene Ontology (GO) is evolving rapidly and has been used as a guide in the clustering process. In order to construct a GO tree, GO data and traversal method are used. There are many GO data formats and traversal methods, but little guidance is provided to assist in the selection of formats and methods to be used. Therefore, the performance study of GO data formats and traversal methods have become important issues in the Bioinformatics field.

Result: In this paper, we have used six GO data formats and three traversal methods to construct the GO tree and map them with genes. The results of this experiment have shown that GO flat file is more efficient while GO OBO-XML is the least efficient among the GO data formats. On the other hand, a level-order traversal had performed more efficiently than pre-order and post-order traversal while the post-order traversal is the least efficient among the traversals.

Researchers nowadays do not have adequate guidelines to help them choose which Gene Ontology (GO) (The Gene Ontology Consortium, 2000) data format is most suitable for their research. They either simply choose the current version of the data or choose the easiest way to use the data in their application. With the increasing GO data, guidelines would have a major role in helping researchers achieve the best results in their study. The GO has now reached 25,231 terms as the controlled vocabulary used to describe gene and gene product attributes in any organism. The terms are classified as only one of the three ontologies: cellular component, biological process or molecular function. Each term in these ontologies is structured as a Directed Acyclic Graph (DAG). There are many types of GO data formats such as OBO-XML, RDF-XML, OBO version 1.2, MySQL, OWL and flat file. Examples of their use are as follows:

  1. Mungall (2004) used GO OBO-XML in Obol software to automatically generate cross-product definitions from the names of terms or classes in OBO.
  2. Razib et al. (2008) used GO RDF-XML to annotate protein sequences.
  3. Conesa et al. (2005) used GO OBO version 1.2 for functional annotation of sequences and the analysis of annotation data.
  4. AmiGO (2000) browser used GO MySQL to search and browse the ontology and annotation data provided by the GO consortium.
  5. Aitken et al. (2005) used GO OWL in their COBrA ontology browser and editor to create links between ontologies.
  6. Fang et al. (2006) used GO flat file to capture gene expression pattern and function similarity.

Although GO terms are structured as DAG with a hierarchy but structurally, it differs from the example of the specialized term known as ‘child’ which may have many ‘parents’ defined as non-specialized terms. Unlike the array and linked list which model linear data structures, a ‘tree’ models a hierarchical organization which best suits the GO structure. In the tree, each element may have several children except for one element which acts as a root with a unique parent. GO tree refers to the structure between terms. A term is referred to as the ‘parent’ term if it has other terms below it. These terms are referred to as ‘child’ terms. We use three types of tree traversal methods in our GO tree namely, pre-order, level-order and post-order to search terms in the GO tree. Examples using these tree traversal methods are as follows:

1.1 Pre-order Traversal Method

  1. Rezaei and Bai (2005) used a pre-order traversal to calculate the minimal spanning tree for a merged graph in their precedence based alignment approach in gene order sequence corresponding to the merged matrix.
  2. Chauve et al. (2007) presented a pre-order traversal to identify the duplications with respect to the specialization of events in the genome species tree.
  3. Bhutkar et al. (2007) applied a two-stage tree traversal. In the first stage, a pre-order traversal was implemented to infer phylogenetic relationships through maximizing gene pair similarity. In the second stage, a level-order traversal in the second stage was used to infer rearrangement in their Drosophila genomes.
1.2 Level-order Traversal Method
  1. Fang et al. (2006) used a level-order traversal to construct hierarchical tree and mapping interested genes into their GO tree.
  2. Gempeler (2006) used a level-order traversal to evaluate expression trees.
  3. Jacobson (1989) used a level-order traversal to construct binary prefix code which is known as Level Order Unary Degree Sequence (LOUDS) representations.
1.3 Post-order Traversal Method
  1. Sankoff et al. (2000) used a post-order traversal to construct a binary tree in the first version. Then, they applied a post-order traversal followed by a pre-order traversal at each iteration. This was done to implement a fast priority insertion heuristic for the median problem induced breakpoints on unequal genomes.
  2. Lott et al. (2006) applied a post-order traversal to develop monophyletic compression rule for a simplification tool for gene trees called TreeSimplifier.
  3. Luo et al. (2004) applied a post-order traversal in their Dynamically Growing Self-Organizing Tree (DGSOT) algorithm for hierarchical clustering to determine the topological relationship in the neighborhood.
In this paper, we present comparisons of different combinations of GO data formats and traversal methods in producing a GO tree aimed at obtaining the best results. We tested the different combinations used to construct and map the GO with gene expression in order to determine the biological similarity for clustering process. The results of the combination are evaluated based on time and space complexity. The time complexity is the number of steps in the algorithm when we execute the program whereas the space complexity will measure the memory which is required to perform the algorithm.

  1. Datasets: GO data and Gene Expression data
  2. Source Code: GOTraversal
  3. Supplementary Materials:

About Author


  1. Expected 2010, PhD in Computer Science (Research in Bioinformatics), Universiti Teknologi Malaysia.
  2. 2005, MSc in Information Technology (Management), Universiti Teknologi Malaysia.
  3. 2003, BSc in Computer Science (Graphics), Universiti Teknologi Malaysia.

Research Interests

  1. Gene expression analysis
  2. Knowledge guided clustering
  3. Gene ontology
  4. Data filtering and dimension reduction

Group Members

  1. Prof Dr Safaai Deris, Artificial Intelligence and Bioinformatics Research Group (AIBIG)
  2. Dr Razib M. Othman, Laboratory of Computational Intelligence and Biotechnology (LCIB)
  3. Zuraini S. Ali, Laboratory of Computational Intelligence and Biology (LCIB)

Contact Info

Shahreen Kasim
Laboratory of Computational Intelligence and Biology (LCIB)
No. 204, Level 2
Industry Centre, Technovation Park
Universiti Teknologi Malaysia (UTM)
81310 UTM Skudai, MALAYSIA
Mobile : +6012-769-7349
Tel/Fax: +607-559-9230