User:Matthew Whiteside/Notebook/Malaria Microarray/2009/05/29

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Project name
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Datasets
There are 5 experiments that i will be using: //PBMC study from US & African volunteers //African children whole blood study //Placental malaria study //Human Umbilical Endothelial cell-line study //Brain Endothelial cell-line study
 * 1) ocken2006 pmid=16988231
 * 1) boldt2006 pmid=16738667
 * 1) muehl2007 pmid=17579077
 * 1) chakra2007 pmid=17383656
 * 1) tripath2006 pmid=16714553

Normalizing Data
Normalized data was provided from GEO for all datasets (GDSXXXX). Datasets 1-3 were GEO datasets. Dataset 4 was from ArrayExpress and was also normalized. Dataset 5 was not normalized. Karsten helped with the normalization and data checks. Data quality checks were already performed and provided on ArrayWiki (http://arraywiki.bme.gatech.edu/index.php/GSE9861). The first array showed some artifacts & low quality score (88.63%) - these were cleaned: the spots with high variance have been replaced with the median value of this probe from other chips in the dataset (see here for more details: http://arraywiki.bme.gatech.edu/index.php/BioPNG_format). Karsten generated two normalized versions of the data:

I have placed files with normalized values into /tmp on koch:

khokamp@koch:/tmp> l GSE9861_* -rw-r--r-- 1 khokamp wg-users 3257047 2009-05-13 13:56 GSE9861_clean_gcrma.txt -rw-r--r-- 1 khokamp wg-users 3256320 2009-05-13 13:56 GSE9861_gcrma.txt

But if you are going to analyse them with limma, you could easily re-do this yourself in R:

1. Change into the directory containing the (compressed) CEL files, e.g. cd /tmp/GSE9861_cleanCEL

2. start R

3. read data and normalize: library(gcrma) data = ReadAffy norm_data = gcrma(data)

norm_data is an expressionSet that should be readily usable with limma

The last step that I applied for extracting the data was write.exprs(norm_data, file='GSE9861_clean_gcrma.txt')

It might be best to run both the original and the clean data set through limma to see if the resulting gene lists differ greatly.

I found no differentially-expressed genes when using the cleaned data. I will proceed with the original data - but may decide to drop this dataset altogether.

See scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R for scripts that were used to obtain the data.

Data preparation
Broadly speaking there are two types of meta-analysis techniques that i plan to use:
 * 1) accounts for variation and sample size (MAID). This would use array-level data.
 * 2) does not account for sample size (RankProd). This would use single summary expression column / study (so studies with more arrays are not more represented).

There are a number of arrays (1) and contrasts (2) i could use from each dataset. My first attempt where i considered a large number of contrasts (bioc_workflow_h*.R) can be found in the ~.../de_genes directory. This data contained a few problems (dataset 4 was labelled incorrectly). So always reference the ~.../de_genes/2 directory, even though its less extensive.

Probe Annotation
Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. Basic strategy for probe annotation:
 * I used the annotation packages from Bioconductor to load the probe to gene mappings for each affy platform (bioconductor version 2.4) to create a uniform source (geo also provides this info, but not all datasets are in geo)
 * I mapped affy probeset IDs to NCBI Unigene IDs (Entrez Genes)
 * I removed probes that mapped to multiple genes (none actually - affy probes are built off of Unigenes i believe)
 * I removed probes that mapped to no genes.
 * I selected the probe that had the smallest p-value from the LIMMA analysis, when there were multiple probes that mapped to the same gene. (this was the smallest in any of the contrasts that i considered for that dataset)

Data Checks
Refer to R scripts ~/workspace/other_projects/b_malaria/bioconductor/meta_workflow_h*.R. I produced a heatmap showing how each array clustered within a dataset. I used pearson correlation of DE genes that were found across multiple arrays (between 100-1000 genes) and hierarchial clustering. I checked that arrays of uninfected and infected clustered together. Only for dataset 1.1 & 1.2, did i find any problems - i removed the arrays that did not cluster as expected (see h1.*.before_arrays_removed.heatmap.png for the before version).

I also generated volcano plots to visualize the microarray data for particular contrasts of interest. These files are in ~/data/projects/b_malaria/de_genes/2

Contrasts
I considered the following contrasts and output files are named as follows:

Microarray datasets & associated filenames -- Microarray Set: h1 PBMC Study, pmid:16988231 Subsets: h1.1: US volunteer uninfected vs. US volunteer infected h1.2 US volunteer uninfected vs. African volunteer infected h1.3 African volunteer treated vs. African volunteer infected h1.4 US volunteer infected vs. African volunteer infected Microarray Set: h2 Whole Blood Study, pmid:16738667 Subsets: h2.1: Gabonese children infected w/ uncomplicated malaria vs. uninfected Gabonese children h2.2: Gabonese children infected w/ complicated malaria vs. uninfected Gabonese children h2.3 Complicated malaria + uncomplicated malaria vs. uninfected Microarray Set: h3 Placenta Study, pmid:17579077 Subsets: h3.1: Chronic, inflamed malaria infected Placenta vs. healthy control placenta Microarray Set: h4 Plasmodium falciparum infected RBC (pRBC) & HUVEC endothelial study, pmid:17383656 Subsets (HUVEC monolayers incubated with): h4.1: Plasmodium falciparum infected Red Blood Cells (PRBC) vs. Red Blood Cells (RBC) h4.2: PRBC + Tumor Necrosis Factor (PRBCTNF) vs. RBC + TNF (RBCTNF) h4.3 PRBCTNF + PRBC vs. RBCTNF + RBC Microarray Set: h5 Plasmodium falciparum infected RBC (pRBC) & brain endothelial study, pmid:16714553 (original version not the 'clean'ed version with artifacts removed) Subsets: h5.1: high-binding PRBC + low-binding PRBC vs. RBC control + Medium control

I decided to go with the following contrasts: 1.1, 2.3, 3.1, 4.1, 5.1.

Summarized Expression Matrix
I used LIMMA to compute p-values and produce a summarized logFC value for a particular contrast. A inter-study comparison of this summarized logFC values was also performed (see data in ~/.../meta_analysis/sample_selection/2/ - parent directory contains some false data but is more extensive and therefore was left in). A heatmap of the hierarchical clustering of the pearson correlation was performed for all contrasts considered (*all*) and for the final subset (*final*) of contrasts that was decided upon to go forward with. Overlapping DE genes for different combinations of contrasts in the *all* and *final* groups was also computed. The script ~/.../meta_analysis/de_gene_set_workflow.R was used to compute these figures.

Some notes on these figures. The heatmaps show that no two study are tightly clustered (approx equi-distant) and interestingly, the cell-line studies did not separate from the in vivo tissue studies. Dataset 5 has the smallest overlap with the other studies, in terms of DE genes identified. Only 2 genes were common to all studies (as being DE).

Array-Level Expression Matrix
The arrays for a particular contrast were also output as a large combined matrix. There were common 12769 Entrez Genes that had recorded expression values in all studies (The contrasts 1.1, 2.3, 3.1, 4.1, 5.1).


 * }