# Katrina Sherbina: Week 2

## 05/25/2011

Scatter plots were generate for normalized log fold change ratios versus nonnormalized fold change ratios for Dahlquist wildtype flask 3 t15/t0 chip, Dalhquist wildtype flask 3 t90/t0 chip, delGLN3 flask 1 t60/t0, delGLN3 flask 1 t30/t0, and delGLN3 flask 2 t15/t0 as indicated in GPR files. While the scatter plot for the wildtype flask showed a linear relationship between nonnormalized and normalized log fold change ratios with an R^2 value of 1, the scatter plots for each of the delGLN3 flasks did not display a very linear relationship between nonnormalized and normalized log fold change ratios. In the scatter plots for the deletion strain, the data points were very dispersed around the origin. The plot for delGLN3 flask 1 showed much more spread and had a lower r^2 value than the plot for the delGLN3 flask 2. After observing the nonlinear nature of the scatter plots for the deletion strain, scatter plots were generated for the normalized values for the red channel versus the nonnormalized values for the red channel and normalized values for the green channel versus the nonnormalized values for the green channel for delGLN3 flask 1 t60/t0, delGLN3 flask 1 t30/t0, and delGLN3 flask 2 t15/t0. In all of the scatter plots, the data points represented something like a diagonal arrow facing down toward the origin. The data points were widely dispersed close to the origin but then began to display a linear relationship as the nonnormalized value for the color increased.

In addition, data from GPR files for the wildtype strain was normalized using the R statistical package. A source code was written that performed normalization within arrays and normalization between arrays and also created MA plots before and after within array normalization and a box plot of log fold change ratios (M) after within array normalization. While both normalizations were performed and the MA plots were generated, the box plot could not be generated. In addition, we were not able to write the results of the normalizations to a cvs.file using the write.table command.

Katrina Sherbina 20:42, 25 May 2011 (EDT)

## 05/26/2011

Using the data generated by completing the within array normalization yesterday, all of the M values for replicate genes were average using the tapply code in R [tapply(MA$M,as.factor(MA$genes[,5]),mean)], which Dr. Fitzpatrick supplied. In writing this data to a table, it was realized that the way in which the code was written told R to only average for the replicate genes in only one microarray chip (flask 3 t30/t0). Then, a matrix of the M values was created to which a for loop was applied so that the M values for the replicate genes were average for all microarray chips. Then, a scatter plot was create in Excel comparing these normalized values to nonnormalized values provided in an Excel sheet by Dr. Dahlquist. There was no linear relationship between the nonnormalized and normalized values.

Then, only one GPR file corresponding to one chip (flask 4 t30/t0) was looked at. This GPR file was read by R to create an RG file. Then a code was written to get rid of the data for the Arabadopsis and 3XSSC genes. Then the log fold changes for all of the other genes was calculated in R by finding the log 2 ratio of the difference of the green and green background divided by the difference of the red and red background. The green to red ratio was found instead of the red to green ratio because flask 4 t30/t0 is a dye swap. Then the log ratios for the replicate genes were average. Using a scatter plot, these log ratios were compared to the log fold change ratios averaged for replicates generated by the within array normalization. This scatter plot showed that there was a strong linear relationship between the nonnormalized data (the log fold change generated in R) and the within array normalized data.

targets<-readTargets(file.choose()) f <- function(x) as.numeric(x$Flags > -99) RG <- read.maimages(targets, source="genepix.median", wt.fun=f) RG2<-RG[RG$genes$Name!="Arabidopsis",] RG3<-RG2[RG2$genes$Name!="3XSSC",] GeneList<-RG3$genes$Name lr<-log2((RG3$G-RG3$Gb)/(RG3$R-RG3$Rb)) lravg<-tapply(lr,as.factor(GeneList),mean) write.table(lravg,"Log_Ratios_RG_Avg_Rep_Minus_Controls.csv",sep=",")

Writing the within array normalized data using the write.table function was still unsuccessful.

Katrina Sherbina 20:24, 26 May 2011 (EDT)

## 05/27/2011

The tapply function in R was applied to the log fold change ratios from the nonnormalized GPR files for Flask 3 t15/t0 and Flask 4 t15/t30 to average all of the replicates. These log fold change ratios were all multiplied by -1 because both Flask 3 and Flask 4 were dye swapped. The corrected log fold change ratios were compared to the log fold change ratios in a Master Data sheet provided by Dr. Dahlquist. A scatter plot was created comparing the corrected log fold change ratios versus the log fold change ratios from the Master Data sheet. The scatter plot displayed what looked to be like a linear relationship with the exception that the data points were more dispersed around 0. The tapply function in R was also applied to the log fold change ratios from the normalized GPR files for Flask 3 t15/t0 and Flask 4 t15/t30 to average all of the replicates. These log fold change ratios were all multiplied by -1 because both Flask 3 and Flask 4 were dye swapped. Due to the way in which the Master Data was compiled, it was predicted that these corrected log fold changes would be of the same value and magnitude as the log fold changes in the Master Data. In comparing the corrected log fold changes versus the log fold changes in the Master Data, this was found to be case.

Also, the code developed yesterday to analyze the GPR files for the wild type strains was built upon. The matrix generated by creating a for loop for the within array normalization data was able to be saved to a csv.file and viewed within Excel. A matrix was also created for the between array normalization data using a for loop. This matrix was also successfully written to a .csv file.

targets<-readTargets(file.choose()) f <- function(x) as.numeric(x$Flags > -99) RG <- read.maimages(targets, source="genepix.median", wt.fun=f) par(mfrow=c(1,2)) plotMA(RG, main="Before Within Array Normalization") MA<-normalizeWithinArrays(RG,method="loess",bc.method="normexp") plotMA(MA, main="After Within Array Normalization") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n1<-length(M1) n0<-length(MA$M[1,]) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for (i in 1:14) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"WANorm_Avg_Reps_Log_Fold_Change_Matrix.csv",sep=",") MAScale<-normalizeBetweenArrays(MA, method="scale") M2<-tapply(MAScale$M[,1],as.factor(MAScale$genes[,5]),mean) n2<-length(M2) n0<-length(MAScale$M[1,]) MM2<-matrix(nrow=n1,ncol=n0) MM2[,1]=M2 for (i in 1:14) {MM2[,i]<-tapply(MAScale$M[,i],as.factor(MAScale$genes[,5]),mean)} write.table(MM2,"BANorm_Avg_Reps_Log_Fold_Change_Matrix.csv",sep=",")

In addition, box plots were generated for the data before between array normalization, which was also after within array normalization, and for after between normalization.

par(mfrow=c(3,6)) for(i in 1:14){boxplot(MM[,i], ylim=c(-4,10))} (Before within array normalization) for(i in 1:14){boxplot(MM2[,i], ylim=c(-5,12))} (After within array normalization)

Then this code involving within array normalization, between array normalization, for loops for both types of normalization, MA plots, and box plots was modified to analyze the data from GPR files for the dGLN3 strain and the dCIN5 strain.

So far, only the GPR files from Ontario microarray chips have been normalized. The next step is to find a way in R to normalize the GPR files from the GCAT microarray chips.

Katrina Sherbina 19:19, 27 May 2011 (EDT)