Dahlquist:Microarray Data Processing in R

Normalizing the Data
First, target files must be created to input all of the GPR files into R. In order to do so, open up Microsoft Notebook. The first line should say "FileName". Starting from the second line and using as many lines as necessary, copy and paste the names of all of the GPR files ordering them first by strain (wildtype and then the deletion strains in alphabetical order), then by time point, then by flask. Make sure that each line has only one GPR file name. Three of these target files must be created, one for the Ontario chip GPR files and two for the GCAT chip GPR files. One of the target files for the GCAT chips must contain the GPR files names corresponding to the tops of the GCAT chps and the other target file must contain the GPR file names corresponding to the bottom of the GCAT chips.

Make sure that the target file and the GPR files for the Ontario chips are in one folder and the the target file and the GPR files for the GCAT chips are in another folder.

Open up R. Change the directory (File > Change dir...) to the folder containing the target file and the GPR file for the Ontario chips.

Into the R Console window, type in "library(limma)".

Import the target file containing all of the GPR file names for the Ontario chips.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RGO<-read.maimages(targets, source="genepix.median", wt.fun=f)

Perform Within Array Normalization.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")

Create a blank matrix which has as many rows as there are unique genes and as many columns as there are GPR files. Average all of the replicate spots by using the tapply function. Write all of the log fold changes after averaging replicate spots into the blank matrix previously created.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}

A .csv file must be created to denote which chips were dyeswapped. To do so, open up Microsoft Excel. Label the first column in row 1 with "dyeswap" or another label for your choosing. Each cell in the column corresponds to each GPR file in the targets file in the order that the GPR files appear in the targets file. If a chip has not been dyeswapped, type in a "1" into the appropriate cell within the column in the Excel file. If a chip has been dyeswapped, type in a "-1" into the appropriate cell within the column in the Excel file. Save this excel file as a .csv file.

Import the .csv file denoting which chips are dyeswapped into R. ds<-read.csv("dyeswap_matrix.csv",header=TRUE,sep=",")

Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of GPR files.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MN<-matrix(nrow=n1,ncol=n0) MN[,1]=M1 for(i in 1:94) {MN[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}

Multiply the values in the matrix containing the log fold changes after the replicates have been averaged by the dyeswap list created.

for (i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}

Write the data to a table.

write.table(MN,"ont_WANorm_dyeswapped_Matrix.csv",sep=",")

The .csv file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one GPR file and write it to a table.

write.table(tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean),"Ontario_ID.csv",sep=",")

Open up the the .csv file with the within normalized data in Excel. Delete the first two rows of values corresponding to the controls.

Change the directory in R to the folder containing the target file and the GPR files for the GCAT chips.

First, import the target file containing all of the GPR file names for the top of the GCAT chips.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f)

Then, import the target file containing all of the GPR file names for the bottom of the GCAT chips.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f)

Merge the top and bottom of each of the chips into one column using the rbind function.

To within array normalize the GCAT chips,

RGG<-rbind(RT,RB)

Perform Within Array Normalization.

MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")

Create a new matrix so that the number of rows corresponds to the number of unique genes and the number of column corresponds to the number of GPR files.

m1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) g0<-length(MAG$M[1,]) g1<-length(m1) M<-matrix(nrow=g1,ncol=g0) M[,1]=m1

Use the tapply function to average duplicate spots so that only unique spots remain in the data. for(i in 1:9) {M[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

Write the table.

write.table(M,"GCAT_WANorm.csv",sep=",")

The .csv file generated does not have the gene ID's listed. In order to obtain the list of gene ID's from R, run the tapply function with only one GPR file and write it to a table.

write.table(tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean),"GCAT_ID.csv",sep=",")

Before between array normalization is performed, the exported table with the within array normalized GCAT chips must be put into Microsoft Access to eliminate any genes that are not within the Ontario chips.


 * 1) Open up Excel and create one column of the names of all of the genes on the Ontario chips. Label the column "Ontario_ID". Save as a tab delimited file.
 * 2) Save the .csv file with the within array normalized data for the GCAT chips as a tab delimited file.
 * 3) Open Microsoft Access.
 * 4) Create a new database in Access.
 * 5) Import the data (File->Get External Data->Import)
 * 6) Go through the import Wizard: specify the data as delimited, keep the delimiter as tab and the text qualifier as none and indicate that the first row contains field names, and choose my primary key as the ID names (the genes). The import wizard is gone through twice once with the GCAT data and once with the Ontario data.
 * 7) In the window for the current database, go to queries and select "Create query in Design View".
 * 8) Add both imported tables (the GCAT and Ontario).
 * 9) In the "Select Query" window, join GCAT ID and Ontario ID with a line. Right click on the line and press the "Join Properties" option and then third option (to include all records from the Ontario data and only those of the GCAT data that are also within the Ontario data).
 * 10) Select all of the fields in the GCAT query window and drag into the first box in the "Field" row in the table below.
 * 11) Select all of the fields in the Ontario query window and drag into the next free box in the "Field" row in the table below.
 * 12) Create a new table for this joined data (Query->Make-Table Query).
 * 13) Copy and pastet the new table into a new Excel spreadsheet.

Once it has been edited in Access, the within array normalized data for the GCAT chips must be copy and pasted into the .csv file with the within array normalized data for the Ontario chips. Make sure that the data for the GCAT chips is pasted in such a way that all of the data is still organized by strain, then by timepoint, then by flask. Make sure that the there is not a colum on of gene IDs. Rename this file so that it is clear that it contains the within array normalized data for all chips. Save this file in the same folder containing the target file and the GPR files for the GCAT chips so as not to change the directory already set in R. This file can be saved somewhere else but make sure to change the directory in R to the location where this file has been saved.

Read the .csv file containing the within array normalized data for all of the chips into R.

MA2<-read.csv("ALL_WANorm_dyeswapped_Matrix.csv",header=TRUE,sep=",")

Force the .csv file into a matrix.

MA3<-as.matrix(MA2)

Perform between array normalization.

MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL)

Write the data to a table.

write.table(MAB,"ALL_BANorm_dyeswapped_Matrix.csv",sep=",")

Generating MA Plots and Box Plots
Make sure to retain the R window in which you have been working thus far.

The following instructions show how to generate MA plots for the wildtype and each of the deletion strains using the normalized data for the Ontario chips.

Set the window so that you can fit as many graphs into it as there are GPR files for the strain. In the case of the wildtype, 4 rows and 4 columns can be set. In the case of the deletion strains, 4 rows and 5 columns should be set. It is best to begin with the strain that corresponds to the first set of GPR files imported into R, which should be the wildtype.

par(mfrow=c(4,5))

To simplify some of the code later on when replicates are averaged, set the gene names to a variable.

GeneList<-RGO$genes$Name

To create MA plots for the data before any normalization has occurred, the log fold change of the spots must be calculated before they have been normalized. These values must be multiplied by the dyeswap list. The dyeswap list should still be within the memory of R. The range within the for loop corresponds to the total number of GPR files that were imported into R.

for(i in 1:94) {lr<-ds[i,]*(log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb)))}

Any replicate spots must be averaged. In the next lines of code, the range of the for loop must be adjusted to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain being graphed. Refer to target file listing GPR files for the Ontario chips.

r0<-length(lr[1,]) RX<-tapply(lr[,1],as.factor(GeneList),mean) r1<-length(RX) MO<-matrix(nrow=r1,ncol=r0) MO[,1]=RX for(i in 1:14) {MO[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}

Then, the intensity values of the spots must be calculated before any of the spots have been normalized. As in the previous set of code, the range of the for loop must be adjusted to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain being graphed. Refer to target file of the GPR files for the Ontario chips.

la<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb))) r3<-length(la[1,]) RQ<-tapply(la[,1],as.factor(GeneList),mean) r4<-length(RQ) AO<-matrix(nrow=r4,ncol=r3) AO[,1]=RQ for(i in 1:14) {AO[,i]<-tapply(la[,i],as.factor(GeneList),mean)}

Plot the log fold change values (M) versus the intensity values (A). Again the range of the for loop must be adjusted to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain being graphed. The x and y limits of the graph, as denoted by x lim and y lim, may need to be adjusted to fit all of the data points as closely as possible.

for(i in 1:14) {plot(AO[,i],MO[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

Expand the window that now has all of the MA plots. Then, save all of them as one JPEG file (File > Save As > JPEG > 100% quality...).

Close the window with all of the MA plots.

Designate the dimensions of the new window in which MA plots will be generated for normalized data.

par(mfrow=c(4,4))

The log fold change values for the normalized data after spots have been averaged have already been calculated earlier. They are denoted by MN[,i]. However, the intensity values of each unique spot must still be calculated for the normalized data. The range of the for loop must be adjusted to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain being graphed. Refer to the GPR files as they are listed in the .cvs file with the within normalized data.

X1<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean) y0<-length(MAO$A[1,]) y1<-length(X1) AA<-matrix(nrow=y1,ncol=y0) AA[,1]=X1 for(i in 1:14) {AA[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}

Plot the log fold change values (M) versus the intensity values (A) for the normalized data. Again the range of the for loops must be adjusted to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain currently being graphed. Retain the x and y limits from the MA plots depicting the data before it has been normalized.

for(i in 1:14) {plot(AA[,i],MN[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Expand the window that now has all of the MA plots. Then, save all of them as one JPEG file (File > Save As > JPEG > 100% quality...).

Close the window with all of the MA plots.

Before moving on to graph MA plots for the other strains, it is easiest to first generate box plots for the strain for which MA plots have already been generated.

Three box plots of the log fold change of each of the genes will be generated for each strain: before normalization has occurred, after within array normalization, and after between array normalization. Therefore, designate the dimensions of a new window in which all three box plots can be fit into.

par(mfrow=c(1,3))

First, generate the box plot depicting the data that has not been normalized. Each of the numbers in the brackets correspond to a GPR file. These numbers must be adjusted to reflect what GPR files correspond to the strain currently being graphed. Refer to the order of the GPR files as they are listed in the target file. The y limit of the graph, as denoted by y lim, may need to be adjusted depending on the spread of each plot. It is acceptable to cut off some of the extreme outliers. boxplot(MO[,1],MO[,2],MO[,3],MO[,4],MO[,5],MO[,6],MO[,7],MO[,8],MO[,9],MO[,10],MO[,1],MO[,12],MO[,13],MO[,14],ylim=c(-5,5))

Next, generate the box plot depicting the data after within array normalization. The numbers in the brackets must be adjusted to reflect what GPR files correspond to the strain currently being graphed. Refer to the order of the GPR files in the .cvs file with the within array normalized data. Retain the y limit from the box plots depicting the data before it has been normalized.

boxplot(MN[,1],MN[,2],MN[,3],MN[,4],MN[,5],MN[,6],MN[,7],MN[,8],MN[,9],MN[,10],MN[,1],MN[,12],MN[,13],MN[,14],ylim=c(-5,5))

Then, generate the box plot depicting the data after between array normalization. The numbers in the brackets must be adjusted to reflect what GPR files correspond to the strain currently being graphed. Refer to the order of the GPR files in the .cvs file with the between array normalized data. Retain the y limit from the box plots depicting the data before it has been normalized.

boxplot(MAB[,2],MAB[,3],MAB[,4],MAB[,7],MAB[,8],MAB[,9],MAB[,12],MAB[,13],MAB[,16],MAB[,17],MAB[,18],MAB[,21],MAB[,22],MAB[,23],ylim=c(-5,5))

Expand the window that now has all of the box plots. Then, save all of them as one JPEG file (File > Save As > JPEG > 100% quality...).

Close the window with all of the box plots.

To generate MA plots for the other strains, the same lines of code can be reinputted adjusting the range of the for loops to indicate which GPR files within the entire set of GPR files that were imported correspond to which strain currently being graphed. However it is not necessary to repeat the following lines of code:

GeneList<-RGO$genes$Name for(i in 1:94) {lr<-ds[i,]*(log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb)))}

Once again, before going on to generate MA plots for other strains, generate box plots for the strain for which you have just generated MA plots. The same lines of code above for the box plots can be used adjusting the numbers in the brackets to reflect what GPR files correspond to the strain currently being graphed.

Similar lines of code can be used to generate MA plots and box plots for the GCAT chip data.

Set the window so that you can fit as many graphs into it as there are GPR files for the strain.

par(mfrow=c(3,3))

To simplify the code to average replicate spots, set the gene IDs to a variable.

GeneList<-RGG$genes$ID

To create MA plots for the data before any normalization has occurred, the log fold change of the spots must be calculated before they have been normalized.

lr<-(log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb)))

Average all of the replicate spots so that that each of the log fold changes correspond to a unique spot. The range of the for loop correspond to the number of GPR files for the GCAT chips. Refer to the target file listing GPR files for the GCAT chips.

r0<-length(lr[1,]) RX<-tapply(lr[,1],as.factor(GeneList),mean) r1<-length(RX) MG<-matrix(nrow=r1,ncol=r0) MG[,1]=RX for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}

Then, the intensity values of the spots must be calculated before any of the spots have been normalized. As in the previous set of code, the range of the for loop corresponds to the number of GPR files for the GCAT chips.

la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb))) r3<-length(la[1,]) RQ<-tapply(la[,1],as.factor(GeneList),mean) r4<-length(RQ) AG<-matrix(nrow=r4,ncol=r3) AG[,1]=RQ for(i in 1:9) {AG[,i]<-tapply(la[,i],as.factor(GeneList),mean)}

Plot the log fold change values (M) versus the intensity values (A). Again the range for the for loop corresponds to the number of GPR files for the GCAT chips. It is preferable to retain the x and y limits of the MA plots generated for the Ontario chips so that the MA plots for both types of chips can be looked at side by side.

for(i in 1:9) {plot(AG[,i],MG[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

Expand the window that now has all of the MA plots. Then, save all of them as one JPEG file (File > Save As > JPEG > 100% quality...).

Close the window with all of the MA plots.

Designate the dimensions of the new window in which MA plots will be generated for normalized data.

par(mfrow=c(3,3))

The log fold change values for the normalized data after spots have been averaged have already been calculated earlier. They are denoted by M[,i]. However, the intensity values of each unique spot must still be calculated for the normalized data. The range of the for loop corresponds to the number of GPR files for the GCAT chips.

X1<-tapply(MAG$A[,1],as.factor(MAG$genes[,4]),mean) y0<-length(MAG$A[1,]) y1<-length(X1) AAG<-matrix(nrow=y1,ncol=y0) AAG[,1]=X1 for(i in 1:9) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes[,4]),mean)}

Plot the log fold change values (M) versus the intensity values (A) for the normalized data. As before, the range of the for loop corresponds to the number of GPR files. Retain the x and y limits from the previous MA plots.

for(i in 1:9) {plot(AAG[,i],M[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Expand the window that now has all of the MA plots. Then, save all of them as one JPEG file (File > Save As > JPEG > 100% quality...).

Close the window with all of the MA plots.

As for the Ontario chips, three box plots will be generated for the GCAT chips: before normalization, after within array normalization, and after between array normalization. Accordingly, designate the dimensions of a new window in which all three box plots can be fit into.

par(mfrow=c(1,3))

First, generate the box plot depicting the data that has not been normalized. Each of the numbers in the brackets correspond to a GPR file. Make sure that there are as many repetitions of MG[,number] as there are GPR files. It is preferable to retain the y limit of the boc plots generated for the Ontario chips so that the box plots for both types of chips can be looked at side by side.

boxplot(MG[,1],MG[,2],MG[,3],MG[,4],MG[,5],MG[,6],MG[,7],MG[,8],MG[,9],ylim=c(-5,5))

Next, generate the box plot depicting the data once it has gone through within array normalization. Each of the numbers in the brackets correspond to a GPR file. Make sure that there are as many repetitions of M[,number] as there are GPR files. Retain the limits from the previous box plot.

boxplot(M[,1],M[,2],M[,3],M[,4],M[,5],M[,6],M[,7],M[,8],M[,9],ylim=c(-5,5))

Then, generate the box plot depicting the data once it has gone through between array normalization. Each of the numbers in the brackets correspond to a GPR file. Make sure that there are as many repetitions of MAB[,number] as there are GPR files. Retain the limit from the previous box plots.

boxplot(MAB[,1],MAB[,5],MAB[,6],MAB[,10],MAB[,11],MAB[,14],MAB[,15],MAB[,19],MAB[,20],ylim=c(-5,5))