From OpenWetWare
Jump to: navigation, search

Data Mining Techniques in aCGH based Breast Cancer Subtype Profiling: an Immune Perspective with Comparative Study

Authors: F. Menolascina, S. Tommasi, P. Chiarappa, A. Mangia, V. Bevilacqua, G. Mastronardi, A. Paradiso.
Contact: f.menolascina at gmail dot com
Keywords: 'aCGH' 'AIRS' 'Artificial Immune System' 'Breast Cancer' 'Clonalg' 'Immunos'


Figure 1. Segmented image of a microarray scanning. SPOT output..

Array Comparative Genomic Hybridization has been successfully used in post-genomic cancer research studies[1][2]. In particular this technology has been developed in order to monitor gene copy number changes in whole DNA. Results returned by similar screening techniques are in the form of microarray (Fig. 1) high dimensional data; the data complexity naturally requires computational analysis tools to extract reliable knowledge from the data. The discovery of such knowledge can then ease the difficulty of translating the complex raw data into relevant and clinically useful diagnostic or prognostic rules. The investigation of gene copy number changes is a critical aspect of cancer research. It is commonly known that each human beeing has two copies of the same gene, however aberration could occur in some cases due to disruptive biophysical events (Fig. 2). As a consequence of these events the activity of genes potentially involved in cellular self-repairing processes can be altered resulting in the raising of the probability of tumorigenesis. In this context the screening of gene copy number covers a key role in the understanding of biological pathways involved in the complex tumorigenetic process; many research groups are currently focusing on the quest for giving a name to main actors of the many cancerous biological pathways.

Figure 2. Breakage-fusion bridge proces at the basis of gene copy number mutation.

We applied data mining techniques and novel Artificial Intelligence immune inspired algorithms in order to analyse a dataset composed by 119 breast cancer samples divided in ER+ and ER- sets. The main objective of this analysis was to find genes involved in the activation of Estrogen Receptor. Several approaches have been exploited and their results are compared. Both predictive power of classifiers and derived biological interpretation are reported and discussed. Promising results have been showed by C4.5 derived classifier and immune based approaches that pushed for further research in employment of similar systems in this field.


Several classifier have been developed in order to compare the ability of different approaches in classifying the data according to the binary discrimination ER+/ER-. The dataset under investigation was composed as follows:

  • Class 'ER-' = 33 cases
  • Class 'ER+'= 86 cases

Interesting rules have been extracted from the dataset under investigation. These and other comments about the results are discussed in the final section of this abstract. Accuracy and Kappa-Statistic are reported in table 1 and a graphical representation of the same results is given in Fig. 3. Each system has been trained on the 66% of the 119 cases of the dataset, leaving 41 cases for validation purposes. The results reported refer to the validation set.


Bagging AdaboostM1 Logit MultiBoost J48 JRip AIRS Immunos CSCA
Accuracy 82.37% 82.93% 85.37% 85.37% 90.24% 87.80% 87.80% 87.80% 82.93%
Kappa Statistic 0.502 0.393 0.541 0.502 0.694 0.602 0.602 0.632 0.393

Table 1. Results returned by each system

Figure 3. Accuracy of all of the systems plotted with Kappa statistic attached. The notation of the left vertical axis refers to accuracy (in % points), the one on the right refers to the Kappa statistic.

All of the systems under investigation in this analysis showed a quite high accuracy. However as it can be seen in Fig 3, there's a interesting separation between the Boosting and Bagging based systems and the other approaches. In particular J48 showed the best absolute accuracy although JRip, AIRS and Immunos returned comparable results. Although the expressive power of J48 systems is a well established characteristic of these kind of approaches, the potentialities of immune based systems is currently an open field of research. Then, these results, seem to confirm the promising aptitude of immune inspired paradigms to stand at the basis of accurate classifiers for data mining tasks in bioinformatics.


Figure 4. Immune cell stimulation and differentiation in plasma and memory cells.

Samples were acquired and collected as described elsewhere [3]. One hundred and nineteen cases, each of which composed by 2424 features composed the raw dataset. Data pre-processing techniques were employed in order to reduce the impact of noise and artifacts derived from data acquisition, in particular gene filtering and raw value normalization has been applied. The obtained dataset has been splitted in two subsets (ER+ and ER- classes) and a set of class separability has been studied using Student T-test and Entropy criteria. A comprehensive ranking of the genes best representing discriminant features has been obtained computing a consensus estimate of the position in the previous two classifications. The first 100 genes in this new ranking were used for further analyses. A new dataset has been built on these new data, counting 119 cases and 100 observations for each case. Several different classifier having different peculiar strength points have been built with the only objective of creating a common platform by which a coherent comparative study could be set. For these reasons common Bagging and Boosting approaches have been used together with typical tree classifier and immune inspired ones. In the last category lay all those systems that use paradigms imported from human immunology in order to reproduce adaptive behaviors that allow our immune system to reject the attacks of pathogens and potentially harmful molecules. The ability of the human immune system to defend our body against foreign threats is mainly covered by two different systems: the innate and the adaptive one. The ability of the adaptive system to create in a full autonomous way an internal image of the antigens (molecules attacking the system) and to gain a specificity and generalization capability that allowes the system to recognize the antigens and elements similar to it. In this way the system is trained on antigens (cases) so that it learns how to associate it to antibodies (just like one would do with points and centroids). Each antigen stimulates an cell which carries antibodies, the cell responds to the stimulation replicating itself and originating plasma cells and memory cells (Fig. 4). The antibodies improve their specificity each time they are stimulated so that the system learns how to "classify" a stimulus better and better each time. Inspired by these dynamics, some interesting paradirgms have been proposed just like Artificial Immune Recognition System (AIRS) [4], Clonalg and CSCA [5] and Immunos [6]. In this study we have compared the performances of all of these systems with well established data mining tools like tree classifiers (J48 and JRip) and metalearners (Bagging, AdaBoostM1 (Boosting with Decision Stump), Logit (performing logistic regression), and MultiBoost (and extension of AdaBoostM1)). In addition we used Kappa statistic as a measure of the agreement between predicted and observed categorization of the dataset under investigation, while correcting for agreement that occurs by chance.


Figure 5. Tree extracted by J48 classifier.

In this study the performances of different data mining and AI based systems for high throughput data classification have been compared. The results put in evidence an interesting trend: tree classifier J48, an extension of the C4.5 system, showed the best accuracy among all of the systems taken into account for this study. Although showing a high absolute accuracy, this kind of classifier is also able to maintain a good expressive power by returning trees that can be easily translated in rules. This is the case of Fig. 5, this is a tree showing 3 rules (3rd level rules omitted for readability sake). These rules can be furtherly interpreted by a human expert or reintroduced in a knowledge driven validation pipeline that takes advantage of tools like, for example, Gene Ontology [6]. The results showed interesting trends, indeed. Artificial Immune Systems based classifier, infact, returned results, in terms of accuracy and kappa statistic quite comparable with the ones that characterized best performing tree classifiers. These results seem to encourage further studies on the employment of such systems in these context; AIS systems seem to be a their ease in context characterized by high dimensional data and complex information distribution. For these reasons our laboratory is now trying to repeat the same analysis, this time in order to classify familial and sporadic breast cancers.


[1] Albertson DG. Profiling breast cancer by array CGH. Breast Cancer Res Treat. 2003;78:289–298. doi: 10.1023/A:1023025506386. [PubMed]

[2] Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci USA. 2002;99:12963–12968. doi: 10.1073/pnas.162471999. [PubMed]

[3] F. Menolascina, S. Tommasi, V. Fedele, A. Paradiso, G. Mastronardi, V. Bevilacqua. Hybrid Intelligent Data Mining Techniques and Array CGH in Breast Cancer Profiling, in press

[4] J. Timmis and J. Boggess, Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Learning Algorithm (2004) [Link]

[5] L.N. de Castro and F.J. Von Zuben, "The clonal selection algorithm with engineering applications" (2000) [PDF]

[6] Carter, The Immune System as a Model for Pattern Recognition and Classification" (2000) [Link]

[7] Gene Ontology