User:Matthew Whiteside/Notebook/Malaria Microarray/2009/05/29

From OpenWetWare

Jump to: navigation, search
Project name Main project page
Previous entry      Next entry

Task 4.1: Prepare gene expression data matrix for meta-analysis



There are 5 experiments that i will be using:

Error fetching PMID 16988231:
Error fetching PMID 16738667:
Error fetching PMID 17579077:
Error fetching PMID 17383656:
Error fetching PMID 16714553:
  1. Error fetching PMID 16988231: [ocken2006]
    PBMC study from US & African volunteers

  2. Error fetching PMID 16738667: [boldt2006]
    African children whole blood study

  3. Error fetching PMID 17579077: [muehl2007]
    Placental malaria study

  4. Error fetching PMID 17383656: [chakra2007]
    Human Umbilical Endothelial cell-line study

  5. Error fetching PMID 16714553: [tripath2006]
    Brain Endothelial cell-line study

All Medline abstracts: PubMed HubMed

Normalizing Data

Normalized data was provided from GEO for all datasets (GDSXXXX). Datasets 1-3 were GEO datasets. Dataset 4 was from ArrayExpress and was also normalized. Dataset 5 was not normalized. Karsten helped with the normalization and data checks. Data quality checks were already performed and provided on ArrayWiki ( The first array showed some artifacts & low quality score (88.63%) - these were cleaned: the spots with high variance have been replaced with the median value of this probe from other chips in the dataset (see here for more details: Karsten generated two normalized versions of the data:

I have placed files with normalized values into /tmp on koch:
khokamp@koch:/tmp> l GSE9861_*
-rw-r--r-- 1 khokamp wg-users 3257047 2009-05-13 13:56 GSE9861_clean_gcrma.txt
-rw-r--r-- 1 khokamp wg-users 3256320 2009-05-13 13:56 GSE9861_gcrma.txt
But if you are going to analyse them with limma, you could easily re-do 
this yourself in R:
1. Change into the directory containing the (compressed) CEL files, e.g. 
cd /tmp/GSE9861_cleanCEL
2. start R
3. read data and normalize:
data = ReadAffy()
norm_data = gcrma(data)
norm_data is an expressionSet that should be readily usable with limma
The last step that I applied for extracting the data was
write.exprs(norm_data, file='GSE9861_clean_gcrma.txt')
It might be best to run both the original and the clean data set through 
limma to see if the resulting gene lists differ greatly.

I found no differentially-expressed genes when using the cleaned data. I will proceed with the original data - but may decide to drop this dataset altogether.

See scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R for scripts that were used to obtain the data.

Data preparation

Broadly speaking there are two types of meta-analysis techniques that i plan to use:

  1. accounts for variation and sample size (MAID). This would use array-level data.
  2. does not account for sample size (RankProd). This would use single summary expression column / study (so studies with more arrays are not more represented).

There are a number of arrays (1) and contrasts (2) i could use from each dataset. My first attempt where i considered a large number of contrasts (bioc_workflow_h*.R) can be found in the ~.../de_genes directory. This data contained a few problems (dataset 4 was labelled incorrectly). So always reference the ~.../de_genes/2 directory, even though its less extensive.

Probe Annotation

Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. Basic strategy for probe annotation:

  • I used the annotation packages from Bioconductor to load the probe to gene mappings for each affy platform (bioconductor version 2.4) to create a uniform source (geo also provides this info, but not all datasets are in geo)
  • I mapped affy probeset IDs to NCBI Unigene IDs (Entrez Genes)
  • I removed probes that mapped to multiple genes (none actually - affy probes are built off of Unigenes i believe)
  • I removed probes that mapped to no genes.
  • I selected the probe that had the smallest p-value from the LIMMA analysis, when there were multiple probes that mapped to the same gene. (this was the smallest in any of the contrasts that i considered for that dataset)

Data Checks

Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. I produced a heatmap showing how each array clustered within a dataset. I used pearson correlation of DE genes that were found across multiple arrays (between 100-1000 genes) and hierarchial clustering. I checked that arrays of uninfected and infected clustered together. Only for dataset 1.1 & 1.2, did i find any problems - i removed the arrays that did not cluster as expected (see h1.*.before_arrays_removed.heatmap.png for the before version).

I also generated volcano plots to visualize the microarray data for particular contrasts of interest. These files are in ~/data/projects/b_malaria/de_genes/2


I considered the following contrasts and output files are named as follows:

Microarray datasets & associated filenames
Microarray Set: h1
 PBMC Study, pmid:16988231
 US volunteer uninfected vs. US volunteer infected
 US volunteer uninfected vs. African volunteer infected
 African volunteer treated vs. African volunteer infected
 US volunteer infected vs. African volunteer infected

Microarray Set: h2
 Whole Blood Study, pmid:16738667
 Gabonese children infected w/ uncomplicated malaria vs. uninfected Gabonese children
 Gabonese children infected w/ complicated malaria vs. uninfected Gabonese children
 Complicated malaria + uncomplicated malaria vs. uninfected

Microarray Set: h3
 Placenta Study, pmid:17579077
 Chronic, inflamed malaria infected Placenta vs. healthy control placenta

Microarray Set: h4
 Plasmodium falciparum infected RBC (pRBC) & HUVEC endothelial study, pmid:17383656
Subsets (HUVEC monolayers incubated with):
 Plasmodium falciparum infected Red Blood Cells (PRBC) vs. Red Blood Cells (RBC)
 PRBC + Tumor Necrosis Factor (PRBCTNF) vs. RBC + TNF (RBCTNF)

Microarray Set: h5
 Plasmodium falciparum infected RBC (pRBC) & brain endothelial study, pmid:16714553
 (original version not the 'clean'ed version with artifacts removed)
 high-binding PRBC + low-binding PRBC vs. RBC control + Medium control

I decided to go with the following contrasts: 1.1, 2.3, 3.1, 4.1, 5.1.

Summarized Expression Matrix

I used LIMMA to compute p-values and produce a summarized logFC value for a particular contrast. A inter-study comparison of this summarized logFC values was also performed (see data in ~/.../meta_analysis/sample_selection/2/ - parent directory contains some false data but is more extensive and therefore was left in). A heatmap of the hierarchical clustering of the pearson correlation was performed for all contrasts considered (*all*) and for the final subset (*final*) of contrasts that was decided upon to go forward with. Overlapping DE genes for different combinations of contrasts in the *all* and *final* groups was also computed. The script ~/.../meta_analysis/de_gene_set_workflow.R was used to compute these figures.

Some notes on these figures. The heatmaps show that no two study are tightly clustered (approx equi-distant) and interestingly, the cell-line studies did not separate from the in vivo tissue studies. Dataset 5 has the smallest overlap with the other studies, in terms of DE genes identified. Only 2 genes were common to all studies (as being DE).

Array-Level Expression Matrix

The arrays for a particular contrast were also output as a large combined matrix. There were common 12769 Entrez Genes that had recorded expression values in all studies (The contrasts 1.1, 2.3, 3.1, 4.1, 5.1).

Personal tools