Dahlquist:Microarray Data Processing in R: Difference between revisions

Revision as of 23:07, 28 May 2013

Home Research Protocols Notebook People Publications Courses Contact

The following protocol was used to normalize GCAT and Ontario chip data using R Statistical Software 2.7.2 and the limma package.

Input Files for R

For each chip that you want to normalize in R, you must import the .gpr file created for that chip after scanning it with GenePix Pro Software. If you need to normalize both Ontario and GCAT chips, make sure that you have separate folders with the .gpr files for each of the types of chips. Otherwise, make sure that all of the .gpr files you will need for normalization are in one folder.

Create a targets file with information about the .gpr files

One targets file must be created for the Ontario chips and a separate targets file must be created for the GCAT chips. To create the targets file for the Ontario chips, create the following spreadsheet in Excel, where each row corresponds to a .gpr file:

The first column (labeled "FileName") contains the file names of all the .gpr files. Make sure that you include the ".gpr" extension at the end of each file name.
The second column (labeled "Header") describes the strain, the time point, and flask from which the data represented in the .gpr file was taken during the cold shock experiment. Create this designation for each .gpr file in the form <strain>_LogFC_t-<flask number>. For example, Wt_LogFC_t15-2, where Wt is the wild type.
The third column (labeled "Strain") contains the name of the strain to which the .gpr file corresponds to.
The fourth column (labeled "TimePoint") contains the experimental time point to which the .gpr file corresponds to.
The fifth column (labeled "Flask") contains the flask number to which the .gpr file corresponds to assuming that replicates were performed for each time point for each strain.
The sixth column (labeled "Dyeswap") indicates the orientation of the dyes on the chip. If the control time point (t0) was labeled with Cy3 dye and the other time points where labeled with Cy5 dye on a particular chip, then put a "1" in this column in the row corresponding to the chip. If the orientation of the dyes was reversed for a chip, then put a "-1" in this column in the row corresponding to that chip.
The seventh column (labeled "MasterList") orders the chips by strain (wild type strain first and the deletion strains afterwards), then time point, and then flask. In this column, put a "1" in the row corresponding to the chip from the wild type, the first time point, and the first flask of that time point. Then, put a "2" in the row corresponding to the chip from the wild type strain, the first time point, and the second flask of that time point. Continue to number the chips in this order until all the chips have been numbered.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the Ontario chips.

The targets file for the GCAT chips should contain the aforementioned columns in the order in which they were mentioned. However, there should be a column that precedes this aforementioned set of columns (labeled "Location"), which designates which half of the GCAT chip was scanned to create the .gpr file. If the top half was scanned for a particular chip, then put "Top" in this column in the row corresponding to that chip. If the bottom half was scanned for a particular chip, then put "Bottom" in this column in the row corresponding to that chip. Order the rows of spreadsheet so that the .gpr files from the bottom of the chip are listed after those from the top of the chip. Make sure that the sequence of GCAT chips in the list of .gpr files corresponding to the top of the chips is the same as the sequence of GCAT chips in the list of the .gpr files corresponding to the bottom of the chips.

Make sure to save the targets file as a .csv file in the folder that contains the .gpr files for the GCAT chips.

Data Normalization

Microarray data for S. cerevisiae wild type and mutant strains was initially collected using GCAT chips before switching to Ontario chips. The within array normalization procedure for data from the GCAT chips has some differences in comparison to the procedure for data from the Ontario chips. To normalize data from both the Ontario and GCAT chips, begin with the section "Within Array Normalization for the Ontario Chips". Begin with the same section to normalize data collected with solely Ontario chips.

Within Array Normalization for the Ontario Chips

Open up R. Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the Ontario chips. Enter the code below (in the boxes) into the R command prompt and hit enter to run the within array normalization procedure that uses the Loess method.

Load the limma library.

library(limma)

When prompted after entering the line below, select the targets file with information about the GPR files corresponding to the Ontario chips.

Otargets<-read.csv(file.choose())

Read in the .gpr files by designating the column within the targets file (imported as Otargets) that lists all the .gpr file names.

f<-function(x) as.numeric(x$Flags > -99)
RGO<-read.maimages(Otargets[,1], source="genepix.median", wt.fun=f)

Extract the column of values from the targets file that indicates for which chips the orientation of the dyes was swapped.

ds<-Otargets$Dyeswap

Perform Loess normalization. This step may take some time depending upon the number of chips that need to be normalized.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")

Create a list of the names of the genes on the Ontario chips.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes$Name),mean)
ontID<-rownames(M1)

Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t-<flask number>.

headers<-Otargets$Header

Create a matrix MO, which will store the normalized data after replicate spots on the chips have been averaged. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

n0<-length(MAO$M[1,])
n1<-length(M1)
MO<-matrix(nrow=n1,ncol=n0)

Average all duplicate spots on the chip.

for(i in 1:n0) {MO[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes$Name),mean)}

Create a new matrix MD to store the data. It should have as many rows as there are unique genes on the Ontario chip and as many columns as there are GPR files for the Ontario chips.

MD<-matrix(nrow=n1,ncol=n0)

Multiply each row of the data matrix MO by the vector ds, which contains the values from the targets file that indicate for which chips the dye orientation was swapped. Store the output of the calculation in the matrix MD.

for(i in 1:n0) {MD[,i]<-ds[i]*MO[,i]}

Convert matrix MD to a data frame.

MD2<-as.data.frame.matrix(MD)

Set the row names of the data frame to the names of the genes on the Ontario chips.

colnames(MD2)<-headers

Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.

rownames(MD2)<-ontID

Remove the control spots on the chip (i.e. Arabidopsis and 3XSSC).

MD3<-subset(MD2,row.names(MD2)!="Arabidopsis")
MD4<-subset(MD3,row.names(MD3)!="3XSSC")

If you are also normalizing data from GCAT chips, proceed to the next section "Within Array Normalization for the GCAT Chips". If you are normalizing only data from Ontario chips, then proceed to the section "Between Array Normalization of Only Ontario Chip Data".

Within Array Normalization for the GCAT Chips

Change the directory (File > Change dir...) to the folder containing the targets file and the GPR files for the GCAT chips.

Read the targets file for the GCAT chips.

Gtargets<-read.csv(file.choose())

Create a subset of the targets file that contains only the GPR files created after scanning the top of the GCAT chips. Read those GPR files into R.

Gtop<-subset(Gtargets,Gtargets$Location %in% 'Top')
f<-function(x) as.numeric(x$Flags > -99)
RT<-read.maimages(Gtop, source="genepix.median", wt.fun=f)

Create a subset of the targets file that contains only the GPR files created after scanning the bottom of the GCAT chips. Read those GPR files into R.

Gbottom<-subset(Gtargets,Gtargets$Location %in% 'Bottom')
RB<-read.maimages(Gbottom, source="genepix.median", wt.fun=f)

Bind the data from the top and bottom of the GCAT chip for each chip.

RGG<-rbind(RT,RB)

Extract the column of values from the subset of targets file corresponding to the top of the GCAT chips that indicates which chips were dye-swapped.

dw<-Gtop$Dyeswap

Perform Loess normalization.

MAG<-normalizeWithinArrays(RGG,method="loess", bc.method="normexp")

Create a list of the names of the genes on the GCAT chip.

M0<-tapply(MAG$M[,1],as.factor(MAG$genes$ID),mean)
gcatID<-rownames(M0)

Designate which column of the targets file contains the name of each chip in the form <strain>_LogFC_t-<flask number>.

chips<-Gtop$Header

Create a matrix MG, which will store the normalized data after replicate spots on the chips have been averaged. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

m0<-length(MAG$M[1,])
m1<-length(M0)
MR<-matrix(nrow=m1,ncol=m0)

Average the duplicates spots on the GCAT chips.

for(i in 1:length(chips)) {MR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes$ID),mean)}

Create a new matrix MG to store the data. It should be as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

MG<-matrix(nrow=m1,ncol=m0)

Set up a for loop to multiply each row of the data matrix MR by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MG.

for(i in 1:m0) {MG[,i]<-dw[i]*MR[,i]}

Convert matrix MG to a data frame.

MG2<-as.data.frame.matrix(MG)

Set the row names of the data frame to the names of the genes on the GCAT chips.

colnames(MG2)<-chips

Set the column names of the data frame to the names of each chip that include the strain, time point and flask number.

rownames(MG2)<-gcatID

Merge the GCAT and Ontario data into a single data frame. Any gene names that only appear on the GCAT chip will have an NA in each column of the row corresponding to that gene in the new data frame.

GO<-merge(MD4,MG2,by="row.names",all=T)

Set the row names of the new data frame to the names of the genes on both the Ontario and GCAT chips.

rownames(GO)<-GO$Row.names

Get rid of the genes on the GCAT chips that are not on the Ontario chips.

GO2<-subset(GO,rownames(GO) %in% ontID)

Create a list to order the merged GCAT and Ontario chip data by strain, then time point, and then flask.

MasterList<-rbind(Otargets[,c('Header','MasterList')],Gtop[,c('Header','MasterList')])
OrderedML<-MasterList[sort.list(MasterList$MasterList),]

Rearrange the columns so that they are ordered by strain (wild type then deletion strains in alphabetical order), then time point, and then flask.

GO3<-GO2[,match(OrderedML$Header,colnames(GO2))]

Write the normalized Ontario and GCAT data to a CSV file.

write.table(GO3,"GCAT_and_Ontario_Within_Array_Normalization.csv",sep=",",col.names=NA,row.names=TRUE)

At this point, the normalization within each GCAT and Ontario array is complete. To merge the normalized data from both chips and to perform between array normalization, proceed to the section "Between Array Normalization of Merged GCAT and Ontario Chip Data".

Between Array Normalization of Merged GCAT and Ontario Chip Data

Create a new matrix to store the merged data. It should have as many rows as there are genes common to both the GCAT and Ontario chips and as many columns as there are GPR files that have been normalized.

g0<-length(GO3[1,])
g1<-length(GO3[,1])
MADM<-matrix(nrow=g1,ncol=g0)

Scale each chip by its median absolute deviation.

for (i in 1:g0) {MADM[,i]<-GO3[,i]/mad(GO3[,i],na.rm=TRUE)}

Convert matrix MADM to a data frame.

MAD<-as.data.frame.matrix(MADM)

Set the row names of the data frame to names of the genes common to the Ontario and GCAT chips.

rownames(MAD)<-rownames(GO3)

Set the column names of the data frame to the list of the names of the chips in the format <strain>_LogFC_t-<flask number> ordered by strain, then time point, and then flask number.

colnames(MAD)<-colnames(GO3)

Write the final data set to a CSV file.

write.table(MAD,"GCAT_and_Ontario_Final_Normalized_Data.csv",sep=",",col.names=NA,row.names=TRUE)

Minimize the R window (do not close the program). Open the .csv file with the R output in Excel. Replace every entry with an "NA" by a space. To do so, select the menu option Edit > Replace... (or select any cell and click Ctrl+H). Type "NA" into the "Find what:" box and leave the "Replace with:" box blank. Then click "Replace All". Save the file.

Between Array Normalization of Only Ontario Chip Data

Visualize the Normalized Data

There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots (M for log fold change and A for intensity) display the ratio of the quantity of the red dye to that of the green dye (log2(red/green)), which will be referred to as the log fold change, for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.

Create MA plots and Box Plots for the GCAT Chips

First, you will create MA plots for the data before the normalization has occurred. Input the following code in the same window in R that was used to normalize the data.

Create a variable to store the names of the genes on the GCAT chip before the controls have been removed and before the replicates have been averaged.

GCAT.GeneList<-RGG$genes$ID

Calculate the log fold change of each gene before normalization has been performed.

lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))

Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.

r0<-length(lg[1,])
rx<-tapply(lg[,1],as.factor(GeneList),mean)
r1<-length(rx)
MM<-matrix(nrow=r1,ncol=r0)

Calculate the log fold changes after averaging duplicate genes.

for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}

Create a new matrix MC to store the data. It should have as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.

MC<-matrix(nrow=r1,ncol=r0)

Set up a for loop to multiply each row of the matrix MM by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MC.

for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}

Convert the matrix MC to a data frame and set the column names of the data frame to the names of the GCAT chips in the form <strain>_LogFC_t-<flask number>.

MCD<-as.data.frame(MC)
colnames(MCD)<-chips
rownames(MCD)<-gcatID

Calculate the intensity values before normalization has occurred.

la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))

Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.

r2<-length(la[1,])
ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean)
r3<-length(ri)
AG<-matrix(nrow=r3,ncol=r2)

Calculate the intensity values after averaging duplicate genes.

for(i in 1:r0) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}

Set how many rows and columns the graphs will be partitioned in to. In the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.

par(mfrow=c(3,3))

Plot the log fold changes (labeled "M" on the y axis) against the intensity values (labeled "A" on the x axis) before each chip has been normalized. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.

for(i in 1:r0) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Next, create MA plots of the data after within array normalization.

The log fold changes after within array normalization were calculated earlier and stored in the data frame MG2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.

Create a blank matrix with as many columns as there are GPR files and as many rows as there are unique genes.

x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean)
y0<-length(MAG$A[1,])
x1<-length(x0)
AAG<-matrix(nrow=x1,ncol=y0)

Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.

for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}

Set how many rows and columns the graphs will be partitioned in to.

par(mfrow=c(3,3))

Plot the log fold changes (labeled "M" on the y axis) versus the intensity values (labeled "A" on the x axis) for each chip after within array normalization has been performed performed. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.

for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after between array normalization has occurred.

Set how many rows and columns the graphs will be partitioned in to.

par(mfrow=c(1,3))

Create a box plot of the log fold changes before normalization has occurred.

boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Specify where the tick marks will appear on the x axis of the plot.

axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Create a box plot of the log fold changes after within array normalization has been performed.

boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Place the tick mark labels along the x axis of the plot.

axis(1,at=xy.coords(chips)$x,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Create a box plot of the log fold changes after between array normalization has been performed.

boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')

Place the tick mark labels along the x axis of the plot.

axis(1, at=xy.coords(chips)$x,labels=FALSE)

Place the tick mark labels along the x axis of the plot.

text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)

Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Create MA plots and Box Plots for the Ontario Chips

First, you will create MA plots for the wildtype data before the normalization has occurred.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file. Since there were originally 14 GPR files for the wildtype the code below creates a a window to fit four rows and four columns of graphs.

par(mfrow=c(4,4))

Set a variable (genelist) for all of the Ontario gene IDs before the controls have been taken out and before replicates have been averaged.

genelist<-RG$genes$Name

Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred. The log fold changes should also by multiplied by the list of dyeswaps taken from the targets file previously imported into R. In the for loop, alter the range to reflect the number of GPR files in RG for all strains.

for(i in 1:94) {lfm<-ds[i]*(log2((RG$R-RG$Rb)/(RG$G-RG$Gb)))}

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

z0<-length(lfm[1,])
ZX<-tapply(lfm[,1],as.factor(genelist),mean)
z1<-length(ZX)
MZ<-matrix(nrow=z1,ncol=z0)

Calculate the log fold changes (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.

for(i in 1:14) {MZ[,i]<-tapply(lf[,i],as.factor(genelist),mean)}

Calculate the intensity values (A values) for each spot for each chip for the wildtype before normalization has occurred.

lfa<-(1/2*log2((RG$R-RG$Rb)*(RG$G-RG$Gb)))

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

z3<-length(lfa[1,])
ZQ<-tapply(lfa[,1],as.factor(genelist),mean)
z4<-length(ZQ)
AZ<-matrix(nrow=z4,ncol=z3)

Calculate the intensity values (M values) for each spot on each chip for the wildtype after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files for the wildtype.

for(i in 1:14) {AZ[,i]<-tapply(lfa[,i],as.factor(genelist),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {plot(AZ[,i],MZ[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Next, you will create MA plots for the wildtype data after within array normalization has been performed.

Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. There will be one graph for each GPR file.

par(mfrow=c(4,4))

The within array normalized log fold changes are already in R's memory under the variable MN. Therefore, just the intensity values have to be calculated after within array normalization has occurred.

Create a blank matrix with as many columns as there are GPR files for the wildtype and as many rows as there are genes after replicates have been averaged.

v1<-tapply(MA$A[,1],as.factor(MA$genes[,5]),mean)
w0<-length(MA$A[1,])
w1<-length(v1)
AAO<-matrix(nrow=w1,ncol=w0)

Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {AAO[,i]<-tapply(MA$A[,i],as.factor(MA$genes[,5]),mean)}

Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files for the wildtype.

for(i in 1:14) {plot(AAO[,i],MN[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Use the following code to generate boxplots of the log fold changes for the wildtype chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.

Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.

boxplot(MZ[,1],MZ[,2],MZ[,3],MZ[,4],MZ[,5],MZ[,6],MZ[,7],MZ[,8],MZ[,9],MZ[,10],MZ[,11],MZ[,12],MZ[,13],MZ[,14],ylim=c(-5,5))

Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.

boxplot(MN[,1],MN[,2],MN[,3],MN[,4],MN[,5],MN[,6],MN[,7],MN[,8],MN[,9],MN[,10],MN[,11],MN[,12],MN[,13],MN[,14],ylim=c(-5,5))

Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a Ontario GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of Ontario GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.

boxplot(XY[,2],XY[,3],XY[,4],XY[,7],XY[,8],XY[,9],XY[,12],XY[,13],XY[,16],XY[,17],XY[,18],XY[,21],XY[,22],XY[,23],ylim=c(-5,5))

If you used the alternative way to filter the GCAT chips, then use the following code:

boxplot(merged.MAD[,2],merged.MAD[,3],merged.MAD[,4],merged.MAD[,7],merged.MAD[,8],merged.MAD[,9],merged.MAD[,12],
merged.MAD[,13],merged.MAD[,16],merged.MAD[,17],merged.MAD[,18],merged.MAD[,21],merged.MAD[,22],merged.MAD[,23],ylim=c(-5,5))

After MA plots and boxplots for the wildtype have been generated, you should make the same types of plots for the deletion strains. Work with one strain first creating the MA Plots and the three different boxplots for that strain before moving on to another strain. The same code as depicted above for the Ontario chips can be used for the deletion strains with some modifications. When designating the dimensions of the window in which the plots will appear, make sure that there are enough rows and columns to fit a graph for each GPR file for the strain. You do not have to reinput the code assigning the Ontario gene ID's to a variable nor the code that calculates the log fold changes before normalization nor the code that calculates intensities before normalization. For the MA plots, the range of the for loop must match the number of GPR files for the strain you are working on. For the boxplots, the number in the bracket next to the variable must correspond to the correct GPR for the strain you are working on. When generating the boxplot for the nonnormalized data, refer to the target file for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the within array normalized data, refer to the data frame with the within array normalized data (MN) for the correct order of the GPR files for the strain you are working on. When generating the boxplot for the scale normalized data, refer to the final R output with the scale normalized data for all the chips for the correct order of the GPR files for the strain you are working on. When generating MA plots and boxplots for different strain, keep the x and y limits of the MA plot and the y limits of the boxplot the same for all the strains.

@@ Line 183: / Line 183: @@
 ==Visualize the Normalized Data==
-There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots display the ratio of the quantity of the red dye to that of the green dye (log2(red/green))for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.
+There are two types of plots that can be used to visualize microarray data before and after normalization: MA plots and box plots. MA plots (M for log fold change and A for intensity) display the ratio of the quantity of the red dye to that of the green dye (log2(red/green)), which will be referred to as the log fold change, for each spot versus the intensity ((1/2)*log2(red*green)) of each spot.
 ===Create MA plots and Box Plots for the GCAT Chips===
@@ Line 189: / Line 189: @@
 First, you will create MA plots for the data before the normalization has occurred. Input the following code in the same window in R that was used to normalize the data.
-*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window. Originally, there were 9 GCAT chips so in the line of code below there are 3 columns and 3 rows of graphs.
+*Create a variable to store the names of the genes on the GCAT chip before the controls have been removed and before the replicates have been averaged.
-  par(mfrow=c(3,3))
+  GCAT.GeneList<-RGG$genes$ID
-*Set a variable (GeneList) for all of the GCAT gene IDs before the controls have been taken out and before replicates have been averaged.
+*Calculate the log fold change of each gene before normalization has been performed.
-  GeneList<-RGG$genes$ID
+  lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))
-*Calculate the log fold changes (M values) for each spot on each chip before normalization has occurred.
+*Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
-  lr<-(log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb)))
+  r0<-length(lg[1,])
+ rx<-tapply(lg[,1],as.factor(GeneList),mean)
+ r1<-length(rx)
+ MM<-matrix(nrow=r1,ncol=r0)
-*Create a blank matrix with as many columns as there are GPR files and as rows as there are genes after replicates have been averaged.
+*Calculate the log fold changes after averaging duplicate genes.
-  r0<-length(lr[1,])
+  for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}
- RX<-tapply(lr[,1],as.factor(GeneList),mean)
- r1<-length(RX)
+*Create a new matrix MC to store the data. It should have as many rows as there are unique genes on the GCAT chip and as many columns as there are GPR files for the GCAT chips.
-  MG<-matrix(nrow=r1,ncol=r0)
+  MC<-matrix(nrow=r1,ncol=r0)
+*Set up a for loop to multiply each row of the matrix MM by the vector of values dw indicating the dye-swapped chips. Store the output to the matrix MC.
+ for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}
-*Calculate the log fold changes (M values) for each spot on each chip after averaging duplicate genes. In the for loop, alter the range to reflect the number of GPR files.
+*Convert the matrix MC to a data frame and set the column names of the data frame to the names of the GCAT chips in the form <strain>_LogFC_t<time point>-<flask number>.
-  for(i in 1:9) {MG[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}
+  MCD<-as.data.frame(MC)
+ colnames(MCD)<-chips
+ rownames(MCD)<-gcatID
-*Calculate the intensity values (A values) for each spot on each chip before normalization has occurred.
+#Calculate the intensity values before normalization has occurred.
   la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))
-*Create a blank matrix with as many columns as there are GPR files and as many rows as there are genes after replicates have been averaged.
+*Create a blank matrix with as many columns as there are GPR files for the GCAT chips and as many rows as there are unique genes on the GCAT chip.
-  r3<-length(la[1,])
+  r2<-length(la[1,])
-  RQ<-tapply(la[,1],as.factor(GeneList),mean)
+  ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean)
-  r4<-length(RQ)
+  r3<-length(ri)
-  AG<-matrix(nrow=r4,ncol=r3)
+  AG<-matrix(nrow=r3,ncol=r2)
-*Calculate the intensity values (A values) after averaging duplicate genes. In the for loop, make sure that the range reflects the number of GPR files.
+*Calculate the intensity values after averaging duplicate genes.
-  for(i in 1:9) {AG[,i]<-tapply(la[,i],as.factor(GeneList),mean)}
+  for(i in 1:r0) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}
-*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.
+*Set how many rows and columns the graphs will be partitioned in to. In the call par(mfrow=c(#,#)), the first # sets the number of rows and the second # sets the number of columns.
-  for(i in 1:9) {plot(AG[,i],MG[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}
+par(mfrow=c(3,3))
+*Plot the log fold changes (labeled "M" on the y axis) against the intensity values (labeled "A" on the x axis) before each chip  has been normalized. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
+  for(i in 1:r0) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
 Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
-Next, you will create MA plots for the data after within array normalization has been performed.
+Next, create MA plots of the data after within array normalization.
-*Set the dimensions of the window in which the graphs will appear to reflect the number of graphs that need to be fit into the window.
+The log fold changes after within array normalization were calculated earlier and stored in the data frame MG2. Therefore, it is only necessary to calculate the intensity values after within array normalization has been performed.
- par(mfrow=c(3,3))
-The log fold changes after normalization is saved in R's memory under the variable RR. Therefore, just the intensity values have to be calculated after within array normalization has occurred.
+*Create a blank matrix with as many columns as there are GPR files and as many rows as there are unique genes.
+ x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean)
+ y0<-length(MAG$A[1,])
+ x1<-length(x0)
+ AAG<-matrix(nrow=x1,ncol=y0)
-*Create a blank matrix with as many columns as there are columns in GPR files and as many rows as there are averaged duplicate genes.
+*Calculate the intensity values after normalization has occurred and after duplicate genes have been averaged.
-  X1<-tapply(MAG$A[,1],as.factor(MAG$genes[,4]),mean)
+  for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}
- y0<-length(MAG$A[1,])
- y1<-length(X1)
- AAG<-matrix(nrow=y1,ncol=y0)
-*Calculate the intensity values (A) after normalization has occurred and after duplicate genes have been averaged. In the for loop, make sure that the range reflects the number of GPR files.
+*Set how many rows and columns the graphs will be partitioned in to.
- for(i in 1:9) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes[,4]),mean)}
+par(mfrow=c(3,3))
-*Plot the log fold changes (M) against the intensities (A). In the for loop, make sure that the range reflects the number of GPR files.
+*Plot the log fold changes (labeled "M" on the y axis) versus the intensity values (labeled "A" on the x axis) for each chip after within array normalization has been performed performed. The y axis limits (ylim) and x axis limits (xlim) may need to be altered depending on the magnitude of outliers.
-  for(i in 1:9) {plot(AAG[,i],RR[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}
+  for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))}
 Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
-Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after scale normalization (dividing each chip by its MAD) has occurred.
+Use the following code to generate boxplots of the log fold changes for the GCAT chips before normalization has occurred, after within array normalization has been performed, and after between array normalization has occurred.
-*Change the dimensions of the window in which the graphs will appear to reflect how many graphs need to be fit into the window. Since you will be generating three graphs, one for each stage in the normalization process, you can set the dimensions to one row with three columns.
+*Set how many rows and columns the graphs will be partitioned in to.
   par(mfrow=c(1,3))
-*Create a boxplot of the log fold changes before normalization has occurred. The number within the brackets next to the variable designating the matrix of nonnormalized log fold changes denotes a GPR file. Also, set the range of the y-axis (ylim) so that the range of the boxplot for each GPR file is visible.
+*Create a box plot of the log fold changes before normalization has occurred.
-  boxplot(MG[,1],MG[,2],MG[,3],MG[,4],MG[,5],MG[,6],MG[,7],MG[,8],MG[,9],ylim=c(-5,5))
+ boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
+*Specify where the tick marks will appear on the x axis of the plot.
+ axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)
+*Place the tick mark labels along the x axis of the plot.
+  text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
+*Create a box plot of the log fold changes after within array normalization has been performed.
+ boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
+*Place the tick mark labels along the x axis of the plot.
+ axis(1,at=xy.coords(chips)$x,labels=FALSE)
+*Place the tick mark labels along the x axis of the plot.
+ text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
-*Create a boxplot of the log fold changes after within array normalization has occurred. The number within the brackets next to the variable designating the matrix of within array normalized log fold changes denotes a GPR file. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots of the nonnormalized data.
+*Create a box plot of the log fold changes after between array normalization has been performed.
-  boxplot(RR[,1],RR[,2],RR[,3],RR[,4],RR[,5],RR[,6],RR[,7],RR[,8],RR[,9],ylim=c(-5,5))
+  boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
-*Create a boxplot of the log fold changes after scale normalization has occurred. The number within the brackets next to the variable designating the matrix of scale normalized log fold changes denotes a GCAT GPR file within the matrix of all of the scale normalized data for all of the chips (both Ontario and GCAT). Therefore, it is important to make sure that you have the right order of GCAT GPR files. Also, make sure that the range of the y axis (ylim) is the same as in the previous set of boxplots.
+*Place the tick mark labels along the x axis of the plot.
-  boxplot(XY[,1],XY[,5],XY[,6],XY[,10],XY[,11],XY[,14],XY[,15],XY[,19],XY[,20],ylim=c(-5,5))
+  axis(1, at=xy.coords(chips)$x,labels=FALSE)
-If you used the alternative way to filter the GCAT chips, then use the following code:
+*Place the tick mark labels along the x axis of the plot.
-  boxplot(merged.MAD[,1],merged.MAD[,5],merged.MAD[,6],merged.MAD[,10],merged.MAD[,11],merged.MAD[,14],merged.MAD[,15],
+  text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
- merged.MAD[,19],XY[,20],ylim=c(-5,5))
 Maximize the window in which the plots have appeared. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.

Dahlquist:Microarray Data Processing in R: Difference between revisions

Revision as of 23:07, 28 May 2013

Contents

Input Files for R

Create a targets file with information about the .gpr files

Data Normalization

Within Array Normalization for the Ontario Chips

Within Array Normalization for the GCAT Chips

Between Array Normalization of Merged GCAT and Ontario Chip Data

Between Array Normalization of Only Ontario Chip Data

Visualize the Normalized Data

Create MA plots and Box Plots for the GCAT Chips

Create MA plots and Box Plots for the Ontario Chips

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools