User:Matthew Whiteside/Notebook/Malaria Microarray/2009/09/24
|Malaria Microarray||Main project page|
Previous entry Next entry
Normalization of Mouse Malaria Datasets
All datasets were affy microarrays. CEL intensity files were downloaded from GEO. The bioconductor method GCRMA was used for normalization. GCRMA is a suite of data processing steps that includes:
To identify poor quality arrays that should be removed from further analysis, i first used a multi-array approach. This finds arrays that are of poor quality relative to other arrays in the same dataset (so, implicitly i am assuming that some of the arrays are of suitable quality).
I first preform the RMA normalization procedure (affyPLM) since it provides many QA diagnostic tools. NB that i later use GCRMA, since it provides better precision by taking into account non-specific binding for each probe.
The summarization step for RMA and GCRMA is as follows; robustly fit the probe-level model (PLM) eqn:
log(Ygij) = θgi + φgj + εgij
Where Ygij is background normalized probe intensity, θgi is gene g expression level on array i, φgj is the probe j's effect and εgij is the measurement error (gene g, on array i, probe j). The robust fitting gives SE(θgi), weights for fitting (outliers will be weighted less) that can be used to assess array quality. For each array, i produce:
Arrays that are suspect based on large RLM or NUSE values, are then further examined.
Single Array exploratory analysis
To examine closely suspect arrays i first generate an MAplot. This is another multi-array approach. MA plots show M - the fold-change above pseudo-median value across all arrays vs. the intensity. It also contains a loess smoothed line. If the ave fold-change is varying for different intensities, this is a problem.
I then also generate pseudo-images of the chip to look for spatial artifacts. I generate four color images:
Using this strategy i only discarded one array in study m1: array 32, GSM189627. There are 4 bio reps and 2 tech reps, so this should be fine. Here are some of the diagnostic images from that array:
Array 32 MA plot
Array 32 Pseudo-images
NUSE plot for dataset m1
Scripts for normalization and quality assessment are called normalization_workflow_m#.R and are in the mouse_bioconductor folder. The normalized data and output plots are in the celfiles/GSE####/ directories.