User:Matthew Whiteside/Notebook/Malaria Microarray/2009/09/24

From OpenWetWare
Jump to navigationJump to search
Malaria Microarray Main project page
Previous entry      Next entry

Normalization of Mouse Malaria Datasets

Normalization procedure

All datasets were affy microarrays. CEL intensity files were downloaded from GEO. The bioconductor method GCRMA was used for normalization. GCRMA is a suite of data processing steps that includes:

  1. Convolution Background Subtraction with sequence-based non-specific binding adjustment.
    • Global distribution of probe intensities is modeled using combined gaussian & exponential equation
    • This equation is used to transform data on each array.
    • GCRMA corrects for differences in non-specific binding of each probe, this uses sequence information. RMA performs strictly a global transformation.
  2. Quantile normalization.
    • Imposes same empirical distribution to each array (based on quantile plots).
  3. Summary method using multi-array median polish robust regression approach
    • Summerizes probe intentisities into a single gene value using a robust regression procedure that is fit across all arrays. The procedure for GCRMA is called median polish.

Quality Assessment

Multi-array Approach

To identify poor quality arrays that should be removed from further analysis, i first used a multi-array approach. This finds arrays that are of poor quality relative to other arrays in the same dataset (so, implicitly i am assuming that some of the arrays are of suitable quality).

I first preform the RMA normalization procedure (affyPLM) since it provides many QA diagnostic tools. NB that i later use GCRMA, since it provides better precision by taking into account non-specific binding for each probe.

The summarization step for RMA and GCRMA is as follows; robustly fit the probe-level model (PLM) eqn:

log(Ygij) = θgi + φgj + εgij

Where Ygij is background normalized probe intensity, θgi is gene g expression level on array i, φgj is the probe j's effect and εgij is the measurement error (gene g, on array i, probe j). The robust fitting gives SE(θgi), weights for fitting (outliers will be weighted less) that can be used to assess array quality. For each array, i produce:

  1. Array intensity distribution metric (all arrays should have similar distributions) - file called: array_intensity_histogram.png
  2. Relative Log Expression metric (how far does the estimated gene expression value θgi deviate from the median across all arrays. Arrays with large RLM may be suspect) - file called: rlm_plot.png
  3. Normalized Unscaled Standard Error (does the array contain large standard errors SE(θgi)? Arrays with large NUSE may be suspect. Because genes have different variability, the NUSE is a normalized SE so that the NUSE is 1 across all arrays) - file called: nuse_plot.png

Arrays that are suspect based on large RLM or NUSE values, are then further examined.

Single Array exploratory analysis

To examine closely suspect arrays i first generate an MAplot. This is another multi-array approach. MA plots show M - the fold-change above pseudo-median value across all arrays vs. the intensity. It also contains a loess smoothed line. If the ave fold-change is varying for different intensities, this is a problem.

I then also generate pseudo-images of the chip to look for spatial artifacts. I generate four color images:

  1. raw log-transformed probe intensities - should show obvious defects in chip
  2. weights from the robust regression - areas that contain high concentrations of outliers (low weights) will be obvious
  3. residuals from the robust regression - areas that contain high concentrations of probes that vary alot (based on robust regression gene/probe model)
  4. signed residuals from the robust regression - areas that contain high concentrations of probes that vary a lot in one direction (based on robust regression gene/probe model), i.e. if a certain area is all lower than the rest of the probes elsewhere on the chip

Results

Using this strategy i only discarded one array in study m1: array 32, GSM189627. There are 4 bio reps and 2 tech reps, so this should be fine. Here are some of the diagnostic images from that array:

Array 32 MA plot

Array 32 Pseudo-images

NUSE plot for dataset m1

Implementation

Scripts for normalization and quality assessment are called normalization_workflow_m#.R and are in the mouse_bioconductor folder. The normalized data and output plots are in the celfiles/GSE####/ directories.