BioSysBio:abstracts/2007/Alok Mishra

=Effect of microarray data heterogeneity on gene regulatory module discovery= Author(s): Alok Mishra, Duncan Gillies Affiliations: Imperial College, London Contact:email: am2203@doc.ic.ac.uk Keywords: 'microarrays' 'cluster similarity' 'regulatory gene module'

Abstract
An integrative genomics approach, in which data from different micro-array experiments are merged together to study regulatory networks, has been adopted in several recent research studies. However, we propose that blind use of this approach can be misleading. As micro-array data from different experiments are merged local patterns of activity, for example the cell cycle, can be masked by more dominant patterns such as stress reactions. We have carried out a systematic study in which data with increasing heterogeneity is clustered to determine groups of co related genes. The clusters are then tested for similarity using a statistical measure called the adjusted Rand index. The results demonstrate that patterns associated with stress, where the gene expression levels are high, do indeed mask those of other gene activity. The same measure of cluster similarity can be used to determine the quality of the data of individual experiments.

Introduction
Micro-arrays allow us to study a large proportion of genome expression simultaneously. These expression data have been used to build models of gene regulatory networks. In some of the recent work TSKS05, SFKR04, `integrative genomics`, in which data from experiments related to different conditions or even different organisms are merged together, has been suggested as a method to discover these regulatory networks. One of the important reasons behind this is that gene regulatory networks have large number of variables (genes) while there are not enough numbers of microarray experiments that can model the networks reliably. Another reason behind this approach is that because of evolution many of the genes are believed to have similar roles in different organisms and so collective analysis might help make up for experimental noise in individual experiments.

The motivation behind the research being presented in this paper is that in our view, for this specific problem, blind integrative genomics might not be the right way forward. It should be used only when we are ready to lose signals that are individually visible in individual experiments and are only concerned about the dominant signal that is able to suppress others in experiments. The global regulatory network is the sum of individual regulatory networks and in our view by taking the integrative approach we would loose information from the individual networks.

Our hypothesis is that as micro-arrays from different experimental conditions but same experiment type e.g. stress are merged, we should be able to readily identify stress specific regulatory network. The clusters of co-regulated genes obtained should reinforce the local regulatory network (stress specific) while suppressing the noise. By contrast, when micro-arrays from various different experiment types are merged together then the local regulatory networks (individual types) would be masked by dominant regulatory networks. The individual signals would fade and only sets of the genes that are overexpressed (dominant network) among all the conditions for which datasets have been mixed would be observed to behave in a consistent manner while the rest would be expressed in an unpredictable manner.

In order to validate our hypothesis, the primary requirement is to obtain the regulatory networks from various datasets and their mixtures and then measure their similarities to each other. This way we would be able to show that progressive mixing of data with other types deteriorates the quality of local regulatory networks at the cost of more dominant regulatory networks. A number of researchers have worked on the problem of finding regulatory networks, some of the most important ones being SSR03,BJGL03,TSKS04 where they have incorporated prior knowledge in the form of known transcription factors or DNA binding data to guide the clustering process. The results in these works have shown that the resulting clusters of transcription factor (TF) modules as well as regulated gene modules are biologically meaningful. We have used Module Networks algorithm SSR03 which is a well established approach and has had success in finding biologically relevant modules. For measuring the similarities among sets of regulated gene clusters resulting from this algorithm, we chose to use the modified Rand Index HA85 which has been shown to be a very stable index of partition similarity.

Previous work
Learning the structure of genetic regulatory networks has attracted lots of attention in the past years. A slightly dated review of these techniques is found in dJ02. Some of the newer methods that have proved quite successful in the recent years are Module Networks SSR+03 and SAMBA TSKS04. Unlike earlier algorithms both of these use prior biological knowledge in order to come up with a clustering of similar set of genes. Module Networks, given a list of potential regulators and microarray expression data uses an iterative procedure that searches for a regulation program for each module (set of genes) and then reassigns each gene to the module whose program best predicts its behaviour. SAMBA uses a diverse set of prior knowledge in the form of protein interactions, TF binding and phenotype sensitivity. All this information is represented as nodes of graphs and the algorithm tries to find statistically significant subgraphs. Recently, some researchers have focused on integrated approaches where they analyze a big compendium of microarrays gathered from various sources. TSKS05 have used data from 2000 experiments from 60 different publications to find regulatory programs.

Methodology
In order to validate our hypothesis we chose to work with two very diverse datasets. One of them is from experiments done to study the gene expression when yeast is exposed to stress conditions. The other dataset was from the study of cell-cycle related genes which is very different as the expression of genes when stress conditions are created is much more drastic (both repressed and induced genes) when compared to cell-cycle experiments where optimal conditions are created for growth. We want to see how mixing of these two very different datasets affects the resulting clusters. Are the results as expected that the expression values for genes from stress experiments being much higher would dominate resulting in clusters with much higher similarity to original stress datset clusters?

All the data used in our analysis were taken from the Stanford Microarray Database (SMD) which hosts c-DNA micro-array data-sets from various experimenters. We decided to focus our study on yeast as the regulatory mechanisms in more complex organisms are more involved and yeast has been studied extensively in the recent years. We started with analysing data by individual researchers for experiments related to stress GSK00 in this paper referred as DS-STRESS1 (76 microarrays), SBB04 called DS-STRESS2 (49 microarrays) and GSK00 called DS-STRESS3 (41 microarrays). In the next stage we merged all the 166 microarrays from these 3 researchers to create and analyse the data set we call DS-STRESS. To compare these clustering against an entirely different category, we took 93 microarray data sets for cell-cycle experiments GSK00 referred in this article as DS-CCYCLE. A further mixing of both stress and cell-cycle data was named DS-STRESS-CCYCLE. Finally, we extracted all available data (1082 microarrays) for yeast (not only stress/cell-cycle) named DS-ALL and compared the earlier results against it. In order to have statistical significance behind our results we also generated a random microarray dataset for all the genes by generating random numbers from a Gaussian distribution with zero mean and unit standard deviation. This dataset was named DS-RANDOM. To analyze the data we did standard pre-processing of the datasets. We used $$log_{2}$$ of the ratio of mean of Channel 2 to mean of Channel 1. We only included those features that have not been designated as unreliable by the scanning software. For normalization, we use total intensity which is based on the (slightly naive) assumption that the average log ratio on the array should be zero. Further, we do filtering on the genes selected by choosing genes whose log(base2) of R/G ratio is greater than 2 times for at least one experiment. This retains only those genes that have shown significant change in their expression. We did not do any scale normalization across slides to account for different experimental conditions or different data-sets.

We also needed a list of transcription factors (TFs) as prior knowledge on which to base the clustering upon. Our TFs were taken from the Yeastract website (http://yeastract.com/) which is a curated database and has a list of 145 TFs. Once we had the required data, we analysed it using the the software package Genomica (refer SSR03) which has been provided by the authors of the Module Network algorithm. The reason why we chose the Module Network algorithm was because it has been shown in literature to identify biologically meaningful clusters as the clustering process is driven with the help of known TFs. Only a few other algorithms use prior biological knowledge TSKS04 to guide the clustering process and in comparison to them Module Networks has resulted in more significant biological findings. The algorithm works like this: given a gene expression dataset and a precompiled set of candidate regulatory genes, it simultaneously searches for a partition of genes into modules, and for a regulation program for each module that explains the behaviour of the genes in the module. It uses an iterative Expectation Maximization approach to do the search. For each module the procedure searches for a regulation program that provides the best prediction of expression profiles of genes in the module as a function of the expression of a subset of genes from the candidate regulator set. The approach is iterative (Expectation Maximization) and runs till convergence refining both the regulation program and the gene modules in each iteration SSR03.

Since we are comparing the results on different data-sets our goal is to check the closeness of these resulting clusters (on different data-sets). This closeness was validated using cluster similarity as described in the next section.

Cluster similarity
Cluster similarity is a measure to check how close to each other the various clustering are from different data-sets. We have used a well established measure of cluster similarity - the adjusted Rand Index which was proposed by (HA85). Based on an extensive empirical comparison of several such measures, Milligan and Cooper (1986) recommended this index as the measure of agreement even when comparing partitions having different numbers of clusters.

Rand index works on the concept of pair-wise matching on each of the clusters that are being compared. Given a set of objects of cardinality n $$S = {s_1,. . ., s_n}$$, suppose we get two clustering C1 and C2 of these objects such that $$C1 = {c1_1,. . ., c1_k}$$ and $$C2 = {c2_1,. . ., c2_k }$$ such that $$\bigcup_{i=1}^{k}c1_{i} = S = \bigcup_{j=1}^{k}c2_{j}$$. Let,
 * $$N_{11}$$ be the number of pairs of objects in the same cluster in both C1 and C2
 * $$N_{00}$$ be number of pairs of objects in different clusters in both C1 and C2
 * $$N_{01}$$ be the number of pairs of objects in different clusters in C1 but same cluster in C2
 * $$N_{10}$$ be the number of pairs of objects in the same cluster in C1 but different clusters in C2

The Rand index is simply the fraction of agreement to total, i.e. $$\frac{(N_{11} + N_{00})}{(N_{11} + N_{00}+N_{01} + N_{10})}$$ and it lies between 0 and 1. When the two partitions are identical, the Rand index is 1 whereas it reaches 0 when they have nothing in common. A problem with the Rand index is that the expected value of the Rand index of two random partitions does not take a constant value. The adjusted Rand index (HA85) corrects for this by assuming the general form $$\frac{index-expected index}{maximum index-expected index}$$. Its maximum value is 1 and its expected value in the case of random clusters is 0. We are using this index in all our analysis as it was found to be stable even when the number of clusters are dissimilar and its performance was better in comparison to other indices (HA85).

Results


    

 

 

  </TD> </TR> </TABLE> </DIV>

</TD></TR> </TABLE> </DIV> <BR>

<BR> <DIV ALIGN="CENTER"> </DIV></TD></TR> </TABLE> </DIV> <BR>

<BR> <DIV ALIGN="CENTER">

</DIV></TD></TR> </TABLE> </DIV> <BR> To reiterate, the clustering algorithm groups together functionally similar groups of genes. By using the modified Rand's index, we measure the similarity among the resulting sets of clusters. A higher index value, according to our interpretation suggests that the two data-sets are functionally similar to each other while a lower index suggests dissimilarity. The index values can range from 0 (completely dissimilar) to 1 (fully similar).

In order to validate that the algorithm was producing consistent results, we ran it many times over the same data-sets and then compared the resulting similarity. For each of the Stress data-set we did four runs of clustering and then measured the similarities among the various runs. From the results in Table-1 we can see that various runs on the same data-set produces similar clustering as the variance among the indices are small. The results have been presented as an upper triangular matrix with each run being compared against all other runs. For any good clustering algorithm this is to be expected if the algorithm is running in a predictable manner. One of the interesting side effects of this analysis is that we can comment not only on the ability of the algorithm but also on the quality of the data set. As we see that the similarity values are higher for DS-STRESS1, DS-STRESS3 and DS-STRESS in comparison to DS-STRESS2. It might indicate that the dataset had too much noise and the clusters could not be identified in a consistent manner in various runs and hence a lower similarity value.

We compared each of the stress datasets against DS-STRESS-CCYCLE, DS-ALL, DS-CCYCLE which are increasingly distant from the stress datasets as described earlier in methodology. As reference, we also compared them against the two extremes of similarity - DS-STRESS which is a mixture of all the stress datasets and DS-RANDOM which is a random dataset. As seen from the results in table-2, different datasets show different similarity even to the DS-STRESS dataset. This suggests that DS-STRESS1 and DS-STRESS3 are more similar to each other than DS-STRESS2, the reason we think is that because they came from experiments related to common research. All the stress datasets' similarity to DS-CCYCLE is very low as we expected because of very different nature of expression in these diverse experiments. Similarity values across the columns DS-STRESS and DS-STRESS-CCYCLE are not that different which indicates that expression data that had much higher level of expression change (stress) has dominated the final clusters when we mixed it with data where the expression change was not that much (cell-cycle). Another interesting observation is that even though we mixed many other types of data-sets in DS-ALL, the results are not significantly different from DS-STRESS. This makes us conclude that stress data-set is very dominant as compared to others. As expected, the similarity values for the random data-set are near zero in all the cases. The visible trend of similarity values gradually falling as we move from left to right indicates that similar data do keep the similarity among clusters higher while dissimilar data brings it down. It reinforces our hypothesis that if datasets are similar then the similarity index should be higher as the resulting functional clusters that we obtain will be similar but as we mix different datasets this similarity goes down.

We also did a combined data-set level comparison rather than individual data sets as done earlier. In this we compared the cell cycle and stress data-set with each other, DS-STRESS-CCYCLE, DS-ALL and DS-RANDOM. The results in table-3 generalise and substantiate our earlier observations as the trends are even more robust here. DS-CCYCLE's similarity to DS-STRESS is very low. Since DS-STRESS has dominant expression values, hence the similarity value for the mixed (stress-cellcycle and all) is also very low. On the other hand DS-STRESS is maximally similar to itself and gradually declining as the mixing progressively attenuates its dominance. Again the random similarity values are negligible.

Conclusion and Future Work
As argued in OR02, all cellular regulatory mechanisms are very local in nature and trying to use a blind integrative approach is most likely going to prove futile if we are interested in meaningful results. We have tried to establish this from a different point of view that as more diverse data-sets are merged then the similarity to individual data-sets (which have more local patterns) is reduced and the dominant ones shadow the weaker signals. One source of error in our results might be attributed to the fact that our similarity index is based on pairwise matches of genes in each sets. It is a good measure for cluster comparison but a better approach would be to calculate the functional similarity of clusters. Since we have extra information in the gene ontology databases about each of the genes, we can use this to calculate how functionally similar these resulting clusters are. As there is no such index, we are working on this and in future we would extend our work by using it.

Even though we are not a proponent of blind integration of diverse microarray data-sets we do believe that an integrated approach of analysing different families of data to learn specific regulatory models is the right approach. We now have 4 major families of data namely -microarray expression, CHIP-chip DNA binding, sequence and protein-protein interaction data. We can integrate each of these to arrive at a better picture of the regulatory networks as each of them complement the partial picture painted by the others. Kernel methods LDC04,LDBC04 give us a principled way to merge these seemingly disparate data and then use kernalised versions of algorithms (e.g. clustering) for further analysis.

There is huge scope of further work in this field. If we can come up with a way to mix datasets so that the final mix acts as a background model, then we can remove the significantly enriched genes from other analysis as we know that they are always enriched. This way more significant genes can be studied in individual data-sets and not have to deal with information that is not significant to those sets of experiments. Currently, we can see that experiments where there are wider fluctuations dominate the final mixture. In order to correct this so that we might have a good background model, one step might be to do scale normalization of data across various experiments. This will bring expression values across different types of experiments on the same level. Once we have this background model, we can use it to stabilise the covariance matrix by shrinking it towards the background model when we don't have enough data for the individual data-sets to correctly estimate the covariance matrix because of singularity conditions. Pooled covariance matrices are used in Graphical Gaussian Model analysis to recover genetic regulatory networks SS05.