Integration and analysis of heterogeneous microarray data sources for supporting drug target identification in atherosclerosis A Camargo*1, F Azuaje1, 2
1 University of Ulster at Jordanstown, School of Computing and Mathematics, Shore Road, Newtownabbey, Co. Antrim, BT37 0QB, Northern Ireland, UK. 2 Systems Biology Research Group, University of Ulster.
- Corresponding author
Email addresses: AC: email@example.com FA: firstname.lastname@example.org
Background Atherosclerosis is one of the major causes of morbidity and mortality in industrially-developed countries. Despite the introduction of new pharmacological drugs, this tendency continues to grow as the world changes food habits that fit into people’s life styles. Atherosclerosis is considered an inflammatory disease due to the fact that high concentrations of cholesterol, around the wall of blood vessels, are one the main risk factors for the disease . The implementation of large-scale in vitro and in silico research is fundamental to discover significant patterns and pathways involved in the disease progression. This research integrates and analyzes two heterogeneous microarray data sets. The study led to the identification of genes, biological processes, and pathways that gave evidence to determine the progression of coronary artery disease (CAD) in humans. Results Two heterogeneous data sets obtained from the GEO (Gene expression Omnibus): Aortic stiffness (AS) and human coronary artery disease (CAD) studies were analyzed. After normalisation, scaling and harmonisation, the data were analyzed upon two different approaches. The first approach focused on uncommon genes, i.e. those included in AS but not in CAD. The second study focused on the expression patterns of common genes shared by both data sets. These analyses yielded a list of significantly differentiated expressed genes. To verify the potential biological significance of the results the genes were furthered assessed based on their involvements in different biological processes as defined by annotation databases and published papers. The lists of significant genes from each study were ranked based on their relevance encoded in functional databases. Additionally, text mining allowed the identification of a list of documents relating such significant genes to the disease. Many of the genes identified in this study proved to have strong relations with atherosclerosis. Some genes outlined the disease control, severity and progress. For instance, the study identified those genes and pathways that are linked with the expression of antimicrobial peptides defensins, which in atherosclerosis may be associated with inflammation and lipid accumulation. Similarly, the study also identified key biological patterns and genes related “programmed cell death” and “apoptosis”, which describe disease state and degree of degeneration. Conclusion This investigation generated a list of genes and biological processes that can be strongly associated with atherosclerosis. Some of the genes highlighted were directly related to the disease progression and control. This study shows how the large-scale, computational integration of heterogeneous microarray data sets, functional annotation databases and published literature may support the identification and assessment of potential therapeutic targets. It also demonstrates how integrative data mining may allow scientists to recover essential patterns and unknown relationships that could be overlooked when single studies were carried out in the first place. In this particular case, a set of representative disease-related genes were detected, which are suggested as testable hypotheses in relation to their roles in CAD progression.
Keywords: data mining, systems biology, data unification, Atherosclerosis, CAD, gene expression data.
1. Ross R. Atherosclerosis - an inflammatory disease, N Engl J Med. 340(2):115-26, 1999.