Dahlquist:Microarray Data Processing in R: Difference between revisions

Revision as of 17:06, 21 May 2013

Home Research Protocols Notebook People Publications Courses Contact

The following protocol was used to normalize both Ontario and GCAT microarray chips using R Statistical Software 2.7.2 and the limma package.

Input Files for R

For each chip that you want to normalize in R, you must import the .gpr file created for that chip after scanning it with GenePix Pro Software. If you need to normalize both Ontario and GCAT chips, make sure that you have separate folders with the .gpr files for each of the types of chips. Otherwise, make sure that all of the .gpr files you will need for normalization are in one folder.

Create a targets file with information about the .gpr files

One targets file must be created for the Ontario chips and a separate targets file must be created for the GCAT chips. To create the targets file for the Ontario chips, create the following spreadsheet in Excel, where each row corresponds to a .gpr file:

The first column (labeled "FileName") contains the file names of all the .gpr files. Make sure that you include the ".gpr" extension at the end of each file name.
The second column (labeled "Header") describes the strain, the time point, and flask from which the data represented in the .gpr file was taken during the cold shock experiment. Create this designation for each .gpr file in the form <strain>_LogFC_t-<flask number>. For example, Wt_LogFC_t15-2, where Wt is the wild type.
The third column (labeled "Strain") contains the name of the strain to which the .gpr file corresponds to.
The fourth column (labeled "TimePoint") contains the experimental time point to which the .gpr file corresponds to.
The fifth column (labeled "Flask") contains the flask number to which the .gpr file corresponds to.
The fourth column (labeled "Dyeswap") indicates the orientation of the dyes on the chip. If the control time point (t0) was labeled with Cy3 dye and the other time points where labeled with Cy5 dye on a particular chip, then put a "1" in this column in the row corresponding to the chip. If the orientation of the dyes was reversed for a chip, then put a "-1" in this column in the row corresponding to that chip.
The fifth column (labeled "MasterList") establishes the numerical order in which you would like to see the data for each chip appear in the final spreadsheet of normalized data. In this column, put a "1" in the row corresponding to the first chip that you want to appear in the final spreadsheet, a "2" in the row corresponding to the second chip that you want to appear in the spreadsheet, and so forth until you have gone through all the chips.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the Ontario chips.

The targets file for the GCAT chips should contain the aforementioned columns in the order in which they were mentioned. However, there should be a column that precedes this aforementioned set of columns (labeled "Location"), which designates which half of the GCAT chip was scanned to create the .gpr file. If the top half was scanned for a particular chip, then put "Top" in this column in the row corresponding to that chip. If the bottom half was scanned for a particular chip, then put "Bottom" in this column in the row corresponding to that chip. Order the rows of spreadsheet so that the .gpr files from the bottom of the chip are listed after those from the top of the chip.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips.

Within Array Normalization of the Ontario Chip Data

Open up R. Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips. Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method.

Load the limma library.

library(limma)

When prompted after entering the line below, select the targets file with information about the .gpr files corresponding to the Ontario chips.

Otargets<-read.csv(file.choose())

Read in the .gpr files by designating the column within the targets file (imported as Otargets) that lists all the .gpr file names.

f<-function(x) as.numeric(x$Flags > -99)
RGO<-read.maimages(Otargets[,1], source="genepix.median", wt.fun=f)

Extract the column of values from the targets file that indicates for which chips the orientation of the dyes was swapped.

ds<-Otargets[,6]

Perform Loess normalization. This step may take some time depending upon the number of chips that need to be normalized.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")

Create a list of the names of the genes on the Ontario chips.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean)
ontID<-rownames(M1)

Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t-<flask number>.

headers<-Otargets[,2]

Create a matrix MO, which will store the normalized data after replicate spots on the chip have been averaged.

n0<-length(MAO$M[1,])
n1<-length(M1)
MO<-matrix(nrow=n1,ncol=n0)

Average all duplicate spots on the chip.

for(i in 1:n0) {MO[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}

Create a new matrix MD.

MD<-matrix(nrow=n1,ncol=n0)

Multiply each row of the data matrix "MO" by the vector "ds", which contains the values from the targets file that indicate

for which chips the dye orientation was swapped. Store the output of the calculation in the matrix MD.

for(i in 1:n0) {MD[,i]<-ds[i]*MO[,i]}

Convert matrix MD to a data frame.

MD2<-as.data.frame.matrix(MD)

Set the row names of the data frame to the names of the genes on the Ontario chips.

colnames(MD2)<-headers

Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.

rownames(MD2)<-ontID

Remove the control spots on the chip (i.e. Arabidopsis and 3XSSC).

MD3<-subset(MD2,row.names(MD2)!="Arabidopsis")
MD4<-subset(MD3,row.names(MD3)!="3XSSC")

If you are also normalizing GCAT chips, proceed to the next section Within Array Normalization for the GCAT Chips. If you normalizing only Ontario chips, then proceed to

Within Array Normalization of GCAT Chip Data

Switch to GCAT directory and start on the GCAT chips.

Read the GCAT target file into R.

targets<-read.csv("GCAT_Targets.csv",sep=",")
f<-function(x) as.numeric(x$Flags > -99)

Separate the Top and Bottom chips into their own locations in R.

RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f)
RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f)

Combine the Top GPRs with the Bottom GPRs so that there are only 9 chips left.

RGG<-rbind(RT,RB)

Loess normalize the GCAT data.

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp")

Tell the R how rows the target matrix should have.

R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean)
r1<-length(R1)

Tell R how many columns the target matrix should have.

r0<-length(MAG$M[1,])

Create the target matrix.

RR<-matrix(nrow=r1,ncol=r0)

Average any duplicate spots in the GCAT data so that only unique spots remain.

for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

Read the GCAT IDs file into R.

GNames<-read.csv("GCAT_ID.csv",sep=",")

Separate the column Headers into their own location.

Gcol<-targets[1:9,2]

Separate the row ID names into their own location.

Grow<-GNames[,1]

Tell R the target matrix should be a data frame instead.

GD<-as.data.frame.matrix(RR)

Assign column names to the data frame.

colnames(GD)<-Gcol

Assign row names to the data frame.

rownames(GD)<-Grow

Between Array Normalization of Merged GCAT and Ontario Chip Data

Between Array Normalization of Only Ontario Chip Data

Merge the GCAT and Ontario data together into a single data frame, any IDs that appear in one chip and not the other will appear as NA's in the data frame.

Q<-merge(MP,GD,by="row.names",all=T)

Tell R to get rid of the spots that are only in the GCAT chips, and keep all the spots that are in the Ontario chips, for the entire data set.

Z<-subset(Q,Q[,1] %in% Names[,1])

Specify the number of columns in the target matrix.

x0<-length(Z[1,])

Specify the number of rows in the target matrix.

x1<-length(Z[,1])

Create the target matrix.

XX<-matrix(nrow=x1,ncol=x0)

Tell R the row names from the merged data are the same as the row names for the new target matrix.

XX[,1]=Z[,1]

Tell R that the column Headers from the merged data are the same as the Headers for the new target matrix.

colnames(XX)=colnames(Z)

Divide each chip by its own MAD to scale the data.

for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

Make sure you pick the correct directory where your ordered header CSV file is located

Read the correctly ordered Headers into R.

XZ<-read.csv("Ordered_headers.csv",sep=",")

Tell R that the chip data should be a data frame instead of a matrix.

XV<-as.data.frame.matrix(XX)

Sort the columns from the data frame into a new data frame using the ordered headers as the sorting criteria.

XY<-XV[,match(XZ[,1],colnames(XV))]

Write the final data set to a table, this should consist of all of the data, Loess normalized, scaled after the fact that the controls were gone, and then sorted into their correct order.

write.table(XY,"Master_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Generating MA Plots and Boxplots

Use the following lines of code to create MA plots and boxplots for the GCAT chips.

First, you will create MA plots for the data before the normalization has occurred.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. Originally, there were 9 GCAT chips so in the line of code below there are 3 columns and 3 rows of graphs.

par(mfrow=c(3,3))

Set a variable (GeneList) for all of the GCAT gene IDs before the controls have been taken out and before replicates have been averaged.

GeneList<-RGG$genes$ID

Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred.

lr<-(log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb)))

Create a blank matrix with as many columns as there are GPR files and as rows as there are genes after replicates have been averaged.

r0<-length(lr[1,])
RX<-tapply(lr[,1],as.factor(GeneList),mean)
r1<-length(RX)
MG<-matrix(nrow=r1,ncol=r0)

Calculate the log fold changes (M values) for each spot on each chip after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files.

for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}

Calculate the intensity values (A values) for each spot on each chip before normalization has occurred.

la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))

Create a blank matrix with as many columns as there are GPR files and as many rows as there are genes after replicates have been averaged.

r3<-length(la[1,])
RQ<-tapply(la[,1],as.factor(GeneList),mean)
r4<-length(RQ)
AG<-matrix(nrow=r4,ncol=r3)

Calculate the intensity values (A values) after averaging duplicate genes. In the for loop, make sure that the range reflects the number of GPR files.

for(i in 1:9) {AG[,i]<-tapply(la[,i],as.factor(GeneList),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.

for(i in 1:9) {plot(AG[,i],MG[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Next, you will create MA plots for the data after within array normalization has been performed.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window.

par(mfrow=c(3,3))

The log fold changes after normalization is saved in R's memory under the variable RR. Therefore, just the intensity values have to be calculated after within array normalization has occurred.

Create a blank matrix with as many columns as there are columns in GPR files and as many rows as there are averaged duplicate genes.

X1<-tapply(MAG$A[,1],as.factor(MAG$genes[,4]),mean)
y0<-length(MAG$A[1,])
y1<-length(X1)
AAG<-matrix(nrow=y1,ncol=y0)

Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files.

for(i in 1:9) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes[,4]),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.

for(i in 1:9) {plot(AAG[,i],RR[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.

Change the dimensions of the window in which the graphs will appear to reflect how many graphs need to be fit into the window. Since you will be generating three graphs, one for each stage in the normalization process, you can set the dimensions to one row with three columns.

par(mfrow=c(1,3))

Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.

boxplot(MG[,1],MG[,2],MG[,3],MG[,4],MG[,5],MG[,6],MG[,7],MG[,8],MG[,9],ylim=c(-5,5))

Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.

boxplot(RR[,1],RR[,2],RR[,3],RR[,4],RR[,5],RR[,6],RR[,7],RR[,8],RR[,9],ylim=c(-5,5))

Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a GCAT GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of GCAT GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.

boxplot(XY[,1],XY[,5],XY[,6],XY[,10],XY[,11],XY[,14],XY[,15],XY[,19],XY[,20],ylim=c(-5,5))

If you used the alternative way to filter the GCAT chips, then use the following code:

boxplot(merged.MAD[,1],merged.MAD[,5],merged.MAD[,6],merged.MAD[,10],merged.MAD[,11],merged.MAD[,14],merged.MAD[,15],
merged.MAD[,19],XY[,20],ylim=c(-5,5))

Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Use the following lines of code to create MA plots and boxplots for the Ontario chips.

First, you will create MA plots for the wildtype data before the normalization has occurred.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file. Since there were originally 14 GPR files for the wildtype the code below creates a a window to fit four rows and four columns of graphs.

par(mfrow=c(4,4))

Set a variable (genelist) for all of the Ontario gene IDs before the controls have been taken out and before replicates have been averaged.

genelist<-RG$genes$Name

Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred. The log fold changes should also by multiplied by the list of dyeswaps taken from the targets file previously imported into R. In the for loop, alter the range to reflect the number of GPR files in RG for all strains.

for(i in 1:94) {lfm<-ds[i]*(log2((RG$R-RG$Rb)/(RG$G-RG$Gb)))}

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

z0<-length(lfm[1,])
ZX<-tapply(lfm[,1],as.factor(genelist),mean)
z1<-length(ZX)
MZ<-matrix(nrow=z1,ncol=z0)

Calculate the log fold changes (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.

for(i in 1:14) {MZ[,i]<-tapply(lf[,i],as.factor(genelist),mean)}

Calculate the intensity values (A values) for each spot for each chip for the wildtype before normalization has occurred.

lfa<-(1/2*log2((RG$R-RG$Rb)*(RG$G-RG$Gb)))

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

z3<-length(lfa[1,])
ZQ<-tapply(lfa[,1],as.factor(genelist),mean)
z4<-length(ZQ)
AZ<-matrix(nrow=z4,ncol=z3)

Calculate the intensity values (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.

for(i in 1:14) {AZ[,i]<-tapply(lfa[,i],as.factor(genelist),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {plot(AZ[,i],MZ[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Next, you will create MA plots for the wildtype data after within array normalization has been performed.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file.

par(mfrow=c(4,4))

The within array normalized log fold changes are already in R's memory under the variable MN. Therefore, just the intensity values have to be calculated after within array normalization has occurred.

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

v1<-tapply(MA$A[,1],as.factor(MA$genes[,5]),mean)
w0<-length(MA$A[1,])
w1<-length(v1)
AAO<-matrix(nrow=w1,ncol=w0)

Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {AAO[,i]<-tapply(MA$A[,i],as.factor(MA$genes[,5]),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {plot(AAO[,i],MN[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Use the following code to generate boxplots of the log fold changes for the wildtype chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.

Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.

boxplot(MZ[,1],MZ[,2],MZ[,3],MZ[,4],MZ[,5],MZ[,6],MZ[,7],MZ[,8],MZ[,9],MZ[,10],MZ[,11],MZ[,12],MZ[,13],MZ[,14],ylim=c(-5,5))

Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.

boxplot(MN[,1],MN[,2],MN[,3],MN[,4],MN[,5],MN[,6],MN[,7],MN[,8],MN[,9],MN[,10],MN[,11],MN[,12],MN[,13],MN[,14],ylim=c(-5,5))

Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a Ontario GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of Ontario GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.

boxplot(XY[,2],XY[,3],XY[,4],XY[,7],XY[,8],XY[,9],XY[,12],XY[,13],XY[,16],XY[,17],XY[,18],XY[,21],XY[,22],XY[,23],ylim=c(-5,5))

If you used the alternative way to filter the GCAT chips, then use the following code:

boxplot(merged.MAD[,2],merged.MAD[,3],merged.MAD[,4],merged.MAD[,7],merged.MAD[,8],merged.MAD[,9],merged.MAD[,12],
merged.MAD[,13],merged.MAD[,16],merged.MAD[,17],merged.MAD[,18],merged.MAD[,21],merged.MAD[,22],merged.MAD[,23],ylim=c(-5,5))

After MA plots and boxplots for the wildtype have been generated, you should make the same types of plots for the deletion strains. Work with one strain first creating the MA Plots and the three different boxplots for that strain before moving on to another strain. The same code as depicted above for the Ontario chips can be used for the deletion strains with some modifications. When designating the dimensions of the window in which the plots will appear, make sure that there are enough rows and columns to fit a graph for each GPR file for the strain. You do not have to reinput the code assigning the Ontario gene ID's to a variable nor the code that calculates the log fold changes before normalization nor the code that calculates intensities before normalization. For the MA plots, the range of the for loop must match the number of GPR files for the strain you are working on. For the boxplots, the number in the bracket next to the variable must correspond to the correct GPR for the strain you are working on. When generating the boxplot for the nonnormalized data, refer to the target file for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the within array normalized data, refer to the data frame with the within array normalized data (MN) for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the scale normalized data, refer to the final R output with the scale normalized data for all the chips for the correct order of the GPR files for the strain you are working on. When generating MA plots and boxplots for different strain, keep the x and y limits of the MA plot and the y limits of the boxplot the same for all the strains.

@@ Line 25: / Line 25: @@
 Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips.
-==Within Array Normalization for the Ontario Chips==
+==Within Array Normalization of the Ontario Chip Data==
 Open up R. Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips. Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method.
@@ Line 80: / Line 80: @@
 If you are also normalizing GCAT chips, proceed to the next section [[#Within Array Normalization for the GCAT Chips|Within Array Normalization for the GCAT Chips]]. If you normalizing only Ontario chips, then proceed to
-==Within Array Normalization for the GCAT Chips==
+==Within Array Normalization of GCAT Chip Data==
 Switch to GCAT directory and start on the GCAT chips.
@@ Line 118: / Line 118: @@
   rownames(GD)<-Grow
-==Between Array Normalization of Merged GCAT and Ontario Data==
+==Between Array Normalization of Merged GCAT and Ontario Chip Data==
+==Between Array Normalization of Only Ontario Chip Data==
 *Merge the GCAT and Ontario data together into a single data frame, any IDs that appear in one chip and not the other will appear as NA's in the data frame.

Dahlquist:Microarray Data Processing in R: Difference between revisions

Revision as of 17:06, 21 May 2013

Contents

Input Files for R

Create a targets file with information about the .gpr files

Within Array Normalization of the Ontario Chip Data

Within Array Normalization of GCAT Chip Data

Between Array Normalization of Merged GCAT and Ontario Chip Data

Between Array Normalization of Only Ontario Chip Data

Generating MA Plots and Boxplots

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools