User:Matthew Whiteside/Notebook/Malaria Microarray/2009/07/15

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

Task 4.3: MAID meta-analysis of Human Malaria Microarray Datasets

MAID description

MAID paper:

  1. Borozan I, Chen L, Paeper B, Heathcote JE, Edwards AM, Katze M, Zhang Z, and McGilvray ID. MAID : an effect size based model for microarray data integration across laboratories and platforms. BMC Bioinformatics. 2008 Jul 10;9:305. DOI:10.1186/1471-2105-9-305 | PubMed ID:18616827 | HubMed [MAID]

Meta-analysis: Increase in the sensitivity & reliability of gene expression measurements by integrating results from different microarray datasets that address similar questions

Effect sizes: MAID uses effect sizes. Typical p-values represent the confidence that a observed relationship occurred by chance. Effect sizes measure the strength of the relationship (direction & magnitude of an effect). The effect size referred to here is the standardized measure of the differences in means between trmt & ctrl (see MAID paper).

The benefit of MAID over GeneMeta is that it accommodates direct microarray designs (2-color arrays).

Performing MAID meta-analysis

After preparing the microarray data into a array-level expression matrix (link), used R script ~/.../meta_analysis/maid_de_workflow.R.

Used the maid 1.0.1 bioconductor package provided by the authors. This uses old version of the exprSet objects - so i had to modify the code.

Results

Arrays from contrasts: 1.1, 2.3, 3.1, 4.1, 5.1 were used in 1st meta-analysis. These microarray datasets come from very different tissues.

DE genes

916 genes were identified as being DE (FDR < 0.05) by the MAID meta-analysis, 150 up / 766 dn. All results are in dir: .../meta_analysis/de_genes/maid/.

In MAID, either a FEM (fixed effects model) with no between study variation or a REM (random effects model) with between study variation is used (see MAID paper). To asses which model is appropriate - the hyp H0: τ2 = 0 is tested (τ2 between study variation).

Q-statistic (eqn 10 in paper) ignores between study variability and will follow a Chi-sq distribution (with degree = number experiments - 1) if H0 is true.

Q = Σ wjj - μFEM)2

γj is the observed effect size in study j and μFEM weighted least-sq mean:

μFEM = Σ wj · γj ÷ wj

Here is a histogram of the Q statistic values. Does it look like a Chi-sq?

Here is a Q-Q plot of the observed Q-values & the expected Q-values from Chi-sq distribution. Are the observed close to the expected? Doesn't look like it.

Based on those results i went with REM.

One of the assumptions is that the transformed data (the effect size z-scores) are aproximately normal. To examine this i created a Q-Q normality plot with the REM model z-scores. Here is that figure. Is the data normal?

MAID / GeneMeta also suggest plots to show the improvement in sensitivity & recall provided by the meta-analysis.

The IDR (improved discovery rate) plot shows the proportion of additional z-scores above a threshold in the combined studies vs the individual studies (i.e. what proportion of predictions by the meta-analysis are unique to the meta-analysis and would not be found by doing individual analysis). There is a line for positive effect sizes and negative effect sizes. Here is the IDR plot:

Here is a figure showing the combined and individual z-scores for contrast 2. You can see some changes (reversals of effect sizes), however there does appear to be an improvement in combined z-score for a select few of the genes - these will be the reinforced / highly-expressed genes found in other datasets.

Lastly, this plot shows the number of genes that are above an FDR cutoff in an individual study or the combined meta-analysis. You can see that h1 & h4 generate more results than the combined

TAKE HOME: The IDR is low and number of genes identified is also low relative to some of the datasets. The only benefit of meta-analysis here is generalization - common malaria genes that will be active in many tissues. There also may be problems with including this diverse of datasets - as the normality assumption may be violated.

References:

  1. GeneMeta vignette - http://www.bioconductor.org/packages/2.2/bioc/html/GeneMeta.html
  2. MAID paper.

ORA

I ran GO term and pathway ora analysis (InnateDB) using the 916 genes identified from the meta-analysis. Results are in .../malaria_data/meta_analysis/ora/maid/.

764 pathways were found associated with genes. 0 were statistically significant after Benjamini-hochburg multiple hypothesis correction (FDR < 0.05). One reason for this - the genes identified by the meta-analysis were core, downstream genes; TNF, INF-gamma, STAT etc. These are in many pathways - so result in large numbers of predictions. The BH correction probably assumes 765 independent predictions were made - this is not true, since many of the predictions were dependent.

This is similar for the 1761 GO terms.

Another note: many of the pathways and GO terms associated with down-reg genes are core processes like transcription, translation etc. This is not surprising since infected cells will go into a state of cellular senescence.