User:Matthew Whiteside/Notebook/Malaria Microarray/2009/09/24

Malaria Microarray

Main project page

Previous entry Next entry

Normalization of Mouse Malaria Datasets

Normalization procedure

All datasets were affy microarrays. CEL intensity files were downloaded from GEO. The bioconductor method GCRMA was used for normalization. GCRMA is a suite of data processing steps that includes:

Convolution Background Subtraction with sequence-based non-specific binding adjustment.
- Global distribution of probe intensities is modeled using combined gaussian & exponential equation
- This equation is used to transform data on each array.
- GCRMA corrects for differences in non-specific binding of each probe, this uses sequence information. RMA performs strictly a global transformation.
Quantile normalization.
- Imposes same empirical distribution to each array (based on quantile plots).
Summary method using multi-array median polish robust regression approach
- Summerizes probe intentisities into a single gene value using a robust regression procedure that is fit across all arrays. The procedure for GCRMA is called median polish.

Quality Assessment

Multi-array Approach

To identify poor quality arrays that should be removed from further analysis, i first used a multi-array approach. This finds arrays that are of poor quality relative to other arrays in the same dataset (so, implicitly i am assuming that some of the arrays are of suitable quality).

I first preform the RMA normalization procedure (affyPLM) since it provides many QA diagnostic tools. NB that i later use GCRMA, since it provides better precision by taking into account non-specific binding for each probe.

The summarization step for RMA and GCRMA is as follows; robustly fit the probe-level model (PLM) eqn:

log(Y_gij) = θ_gi + φ_gj + ε_gij

Where Y_gij is background normalized probe intensity, θ_gi is gene g expression level on array i, φ_gj is the probe j's effect and ε_gij is the measurement error (gene g, on array i, probe j). The robust fitting gives SE(θ_gi), weights for fitting (outliers will be weighted less) that can be used to assess array quality. For each array, i produce:

Array intensity distribution metric (all arrays should have similar distributions) - file called: array_intensity_histogram.png
Relative Log Expression metric (how far does the estimated gene expression value θ_gi deviate from the median across all arrays. Arrays with large RLM may be suspect) - file called: rlm_plot.png
Normalized Unscaled Standard Error (does the array contain large standard errors SE(θ_gi)? Arrays with large NUSE may be suspect. Because genes have different variability, the NUSE is a normalized SE so that the NUSE is 1 across all arrays) - file called: nuse_plot.png

Arrays that are suspect based on large RLM or NUSE values, are then further examined.

Single Array exploratory analysis

To examine closely suspect arrays i first generate an MAplot. This is another multi-array approach. MA plots show M - the fold-change above pseudo-median value across all arrays vs. the intensity. It also contains a loess smoothed line. If the ave fold-change is varying for different intensities, this is a problem.

I then also generate pseudo-images of the chip to look for spatial artifacts. I generate four color images:

raw log-transformed probe intensities - should show obvious defects in chip
weights from the robust regression - areas that contain high concentrations of outliers (low weights) will be obvious
residuals from the robust regression - areas that contain high concentrations of probes that vary alot (based on robust regression gene/probe model)
signed residuals from the robust regression - areas that contain high concentrations of probes that vary a lot in one direction (based on robust regression gene/probe model), i.e. if a certain area is all lower than the rest of the probes elsewhere on the chip

Results

Using this strategy i only discarded one array in study m1: array 32, GSM189627. There are 4 bio reps and 2 tech reps, so this should be fine. Here are some of the diagnostic images from that array:

Array 32 MA plot

Array 32 Pseudo-images

NUSE plot for dataset m1

Implementation

Scripts for normalization and quality assessment are called normalization_workflow_m#.R and are in the mouse_bioconductor folder. The normalized data and output plots are in the celfiles/GSE####/ directories.

User:Matthew Whiteside/Notebook/Malaria Microarray/2009/09/24

Contents

Normalization of Mouse Malaria Datasets

Normalization procedure

Quality Assessment

Multi-array Approach

Single Array exploratory analysis

Results

Implementation

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools