My Computational Journal

May 20, 2011
The R statistical package was used to read nonnormalized data in the form of a GPR file for all of the microarray chips prepared in the lab. It was found that the script used yesterday to read the nonnormalized data and then to normalize the data was incorrect. Without clearly specifying whether the mean or median values for each labelling dye should be used, R would with some parts of the script use mean values and with others the median values. In going back to the data today, we specified that we wanted the median values to be used in calculating log ratios. Specifically, we used R to calculate the log base 2 ratio of red to green dye was calculated for the Dahlquist wildtype flask 3 t15/t0 microarray chip. The ratio was calculated by subtracting the median value for the red background from the median value of the red foreground and then this difference was divided by the difference between the median value for the green foreground and the median value for the green background. The negative log ratio was calculated because the chip being analyzed was dye swapped. The first five log ratios calculated by R (two of which were duplicates of two of the three genes looked at) were compared to the first five log ratios in the GPR file corresponding to the chip being analyzed. It was observed that the log ratios of R were the negative values of the log ratios in the GPR file which was expected because of the dye swap done on the chip. However, we should ask Dr. Dalhquist why she used a red to green ratio instead of a green to red ratio for the dye swaps. Also, the log ratios generated by R were then compared to the log ratios within an Excel file of the log ratios for all of the chips. There were no duplicate genes in the Excel file. Therefore, an average of the log ratios calculated in R for each of the two duplicated genes was found and compared to the log ratios within the Excel file. It was observed that the former log ratios did not correspond to the log ratios for the three genes on the Dahlquist wildtype flask 3 t15/t0 chip. We need to talk to Dr. Dahlquist regarding how she did her calculations to maybe shed some light on why the log ratios calculated by R did not match the log ratios that were provided in Excel.

Katrina Sherbina 16:38, 20 May 2011 (EDT)

May 25, 2011
Scatter plots were generate for normalized log fold change ratios versus nonnormalized fold change ratios for Dahlquist wildtype flask 3 t15/t0 chip, Dalhquist wildtype flask 3 t90/t0 chip, delGLN3 flask 1 t60/t0, delGLN3 flask 1 t30/t0, and delGLN3 flask 2 t15/t0 as indicated in GPR files. While the scatter plot for the wildtype flask showed a linear relationship between nonnormalized and normalized log fold change ratios with an R^2 value of 1, the scatter plots for each of the delGLN3 flasks did not display a very linear relationship between nonnormalized and normalized log fold change ratios. In the scatter plots for the deletion strain, the data points were very dispersed around the origin. The plot for delGLN3 flask 1 showed much more spread and had a lower r^2 value than the plot for the delGLN3 flask 2. After observing the nonlinear nature of the scatter plots for the deletion strain, scatter plots were generated for the normalized values for the red channel versus the nonnormalized values for the red channel and normalized values for the green channel versus the nonnormalized values for the green channel for delGLN3 flask 1 t60/t0, delGLN3 flask 1 t30/t0, and delGLN3 flask 2 t15/t0. In all of the scatter plots, the data points represented something like a diagonal arrow facing down toward the origin. The data points were widely dispersed close to the origin but then began to display a linear relationship as the nonnormalized value for the color increased.

In addition, data from GPR files for the wildtype strain was normalized using the R statistical package. A source code was written that performed normalization within arrays and normalization between arrays and also created MA plots before and after within array normalization and a box plot of log fold change ratios (M) after within array normalization. While both normalizations were performed and the MA plots were generated, the box plot could not be generated. In addition, we were not able to write the results of the normalizations to a cvs.file using the write.table command.

Katrina Sherbina 20:42, 25 May 2011 (EDT)

May 26, 2011
Using the data generated by completing the within array normalization yesterday, all of the M values for replicate genes were average using the tapply code in R [tapply(MA$M,as.factor(MA$genes[,5]),mean)], which Dr. Fitzpatrick supplied. In writing this data to a table, it was realized that the way in which the code was written told R to only average for the replicate genes in only one microarray chip (flask 3 t30/t0). Then, a matrix of the M values was created to which a for loop was applied so that the M values for the replicate genes were average for all microarray chips. Then, a scatter plot was create in Excel comparing these normalized values to nonnormalized values provided in an Excel sheet by Dr. Dahlquist. There was no linear relationship between the nonnormalized and normalized values.

Then, only one GPR file corresponding to one chip (flask 4 t30/t0) was looked at. This GPR file was read by R to create an RG file. Then a code was written to get rid of the data for the Arabadopsis and 3XSSC genes. Then the log fold changes for all of the other genes was calculated in R by finding the log 2 ratio of the difference of the green and green background divided by the difference of the red and red background. The green to red ratio was found instead of the red to green ratio because flask 4 t30/t0 is a dye swap. Then the log ratios for the replicate genes were average. Using a scatter plot, these log ratios were compared to the log fold change ratios averaged for replicates generated by the within array normalization. This scatter plot showed that there was a strong linear relationship between the nonnormalized data (the log fold change generated in R) and the within array normalized data.

targets<-readTargets(file.choose) f <- function(x) as.numeric(x$Flags > -99) RG <- read.maimages(targets, source="genepix.median", wt.fun=f) RG2<-RG[RG$genes$Name!="Arabidopsis",] RG3<-RG2[RG2$genes$Name!="3XSSC",] GeneList<-RG3$genes$Name lr<-log2((RG3$G-RG3$Gb)/(RG3$R-RG3$Rb)) lravg<-tapply(lr,as.factor(GeneList),mean) write.table(lravg,"Log_Ratios_RG_Avg_Rep_Minus_Controls.csv",sep=",")

Writing the within array normalized data using the write.table function was still unsuccessful.

Katrina Sherbina 20:24, 26 May 2011 (EDT)

May 27, 2011
The tapply function in R was applied to the log fold change ratios from the nonnormalized GPR files for Flask 3 t15/t0 and Flask 4 t15/t30 to average all of the replicates. These log fold change ratios were all multiplied by -1 because both Flask 3 and Flask 4 were dye swapped. The corrected log fold change ratios were compared to the log fold change ratios in a Master Data sheet provided by Dr. Dahlquist. A scatter plot was created comparing the corrected log fold change ratios versus the log fold change ratios from the Master Data sheet. The scatter plot displayed what looked to be like a linear relationship with the exception that the data points were more dispersed around 0. The tapply function in R was also applied to the log fold change ratios from the normalized GPR files for Flask 3 t15/t0 and Flask 4 t15/t30 to average all of the replicates. These log fold change ratios were all multiplied by -1 because both Flask 3 and Flask 4 were dye swapped. Due to the way in which the Master Data was compiled, it was predicted that these corrected log fold changes would be of the same value and magnitude as the log fold changes in the Master Data. In comparing the corrected log fold changes versus the log fold changes in the Master Data, this was found to be case.

Also, the code developed yesterday to analyze the GPR files for the wild type strains was built upon. The matrix generated by creating a for loop for the within array normalization data was able to be saved to a csv.file and viewed within Excel. A matrix was also created for the between array normalization data using a for loop. This matrix was also successfully written to a .csv file.

targets<-readTargets(file.choose) f <- function(x) as.numeric(x$Flags > -99) RG <- read.maimages(targets, source="genepix.median", wt.fun=f) par(mfrow=c(1,2)) plotMA(RG, main="Before Within Array Normalization") MA<-normalizeWithinArrays(RG,method="loess",bc.method="normexp") plotMA(MA, main="After Within Array Normalization") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n1<-length(M1) n0<-length(MA$M[1,]) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for (i in 1:14) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"WANorm_Avg_Reps_Log_Fold_Change_Matrix.csv",sep=",") MAScale<-normalizeBetweenArrays(MA, method="scale") M2<-tapply(MAScale$M[,1],as.factor(MAScale$genes[,5]),mean) n2<-length(M2) n0<-length(MAScale$M[1,]) MM2<-matrix(nrow=n1,ncol=n0) MM2[,1]=M2 for (i in 1:14) {MM2[,i]<-tapply(MAScale$M[,i],as.factor(MAScale$genes[,5]),mean)} write.table(MM2,"BANorm_Avg_Reps_Log_Fold_Change_Matrix.csv",sep=",")

In addition, box plots were generated for the data before between array normalization, which was also after within array normalization, and for after between normalization.

par(mfrow=c(3,6)) for(i in 1:14){boxplot(MM[,i], ylim=c(-4,10))} (Before within array normalization) for(i in 1:14){boxplot(MM2[,i], ylim=c(-5,12))} (After within array normalization)

Then this code involving within array normalization, between array normalization, for loops for both types of normalization, MA plots, and box plots was modified to analyze the data from GPR files for the dGLN3 strain and the dCIN5 strain.

So far, only the GPR files from Ontario microarray chips have been normalized. The next step is to find a way in R to normalize the GPR files from the GCAT microarray chips.

Katrina Sherbina 19:19, 27 May 2011 (EDT)

June 1, 2011
One Master Excel workbook was created with a spreadsheet of the GCAT chip normalizations that a coworker completed today, a spreadsheet of all of the within array normalizations already done last week for all of the Ontario chips, a spreadsheet with the between array normalizations for the Ontario chips, and also a spreadsheet that integrated both the GCAT chip normalizations and the within array normalizations for the Onatario chips.

To create the integrated spreadsheet mentioned above, Microsoft Access was used to merge the normalized data from the GCAT and the Ontario chips eliminating any genes in the GCAT chips that were not also in the Ontario chips. The steps are as follow.
 * 1) Save the Excel files of the data for the GCAT and Ontario chiops as tab delimited files.
 * 2) Create a new database on Access.
 * 3) Import the data (File->Get External Data->Import)
 * 4) Go through the import Wizard: specify the data as delimited, keep the delimiter as tab and the text qualifier as none and indicate that the first row contains field names, and choose my primary key as the ID names (the genes). The import wizard is gone through twice once with the GCAT data and once with the Ontario data.
 * 5) In the window for the current database, go to queries and select "Create query in Design View".
 * 6) Add both imported tables (the GCAT and Ontario).
 * 7) In the "Select Query" window, join GCAT ID and Ontario ID with a line. Right click on the line and press the "Join Properties" option and then third option (to include all records from the Ontario data and only those of the GCAT data that are also within the Ontario data).
 * 8) Select all of the fields in the GCAT query window and drag into the first box in the "Field" row in the table below.
 * 9) Select all of the fields in the Ontario query window and drag into the next free box in the "Field" row in the table below.
 * 10) Create a new table for this joined data (Query->Make-Table Query).
 * 11) Copy and pastet the new table into a new Excel spreadsheet.

Katrina Sherbina 20:27, 1 June 2011 (EDT)

June 2, 2011
It was found that the MA plots generated before and after within array normalization only corresponded to the first microarray chip (first GPR file) in the targets file imported into R. A new code was written to generate MA plots before and after within array normalizations for all of the GPR files in the targets file. The number of rows and columns of MA plots, the number of iterations for the for loop, and the limits of the y-axis were altered for each strain.

par(mfrow=c(3,5)) for (i in 1:14) {plotMA(RG[,i],ylim=c(-4,4))} for (i in 1:14) {plotMA(MA[,i])}

Originally, individual graphs of boxplots were generated side by side for before and after between array normalization for all the GPR files in the targets file. A new code was written to generate all of the boxplots for all of the GPR files in one graph. The limits for the y-axis were changed for each strain.

x<-as.matrix(MA$M) boxplot(x[,1],x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8],x[,9],x[,10],x[,11],x[,12],x[,13],x[,14],ylim=c(-6,6)) y<-as.matrix(MAScale$M) boxplot(y[,1],y[,2],y[,3],y[,4],y[,5],y[,6],y[,7],y[,8],y[,9],y[,10],y[,11],y[,12],y[,13],y[,14],ylim=c(-6,6))

In addition, box plots were generated in one graph for all the GPR files in the targets file before any kind of normalization. The code was similar to the code for the boxplot above with the exception that in place of the x or y values the log base 2 ratio of red-red background to green-green background was found for each individual microarray chip. For chips that word dye swapped, the negative log base 2 ratio was found.

Katrina Sherbina 19:34, 2 June 2011 (EDT)

June 8, 2011
Using the multtest package that comes with the R program, we began calculating p values to analyze whether or not genes that were differentially expressed through experimentation were differentially expressed by chance or not. Only the between array normalized data from the wildtype strains from time point t15 were used. This data was extracted from its master file and written to a new .csv file. This file was then imported into R. In multtest, the method used was the step-down minP and the test used was the one sample t test. When first running the code, few numerical p values were generated. Mostly the output was a series of NaN's. In modifying the code to include standardize=FALSE, multtest was able to successfully produce p-values.

data1<-read.csv("wt_t15_normalized.csv",header=TRUE,sep=",") data2<-as.matrix(data1[,2:4]) data3<-rep(0,length(data1[,2:4])) seed<-99 exp<-MTP(X=data2,Y=data3,test="t.onesamp",B=100,method="sd.minP",seed=seed,standardize=FALSE) write.table(as.matrix(exp@rawp),"wt_t15_pvalues.csv",sep=",")

The next step is to compare the raw p-values to p-values generated by hand with the same data in Excel using Dr. Dahlquist's method.

Katrina Sherbina 20:07, 8 June 2011 (EDT)

June 9, 2011
The T statistic and P values for the data collected for the wild type strains for all replicates at t15 were calculated in Excel using the formulas AVERAGE((range of cells)/(STDEV(range of cells)/SQRT(number of replicates)) and TDIST(ABS(cell containing T statistic),degrees of freedom,2) where degrees of freedom is the number of replicates minus 1, respectively. The Bonferroni correction was also calculated for each data point.

Using the multtest package in R, the raw p values and the adjusted p values were collected for the same data mentioned above using multiple testing procedues such as the single-step maxT (ss.maxT), single-step minP (ss.minP), step-down maxT (sd.maxT), and step-down minP (sd.minP). Also, raw p and adjusted p values were calculated by controlling the false discover rate.

data1<-read.csv("wt_t15_normalized.csv",header=TRUE,sep=",") data2<-as.matrix(data1[,2:4]) data3<-rep(0,length(data1[,2:4])) seed<-99 exp<-MTP(X=data2,Y=data3,test="t.onesamp",standardize=FALSE,typeone="fdr",fdr.method="conservative",B=3000,seed=seed) exp1<-MTP(X=data2,Y=data3,test="t.onesamp",standardize=FALSE,method="sd.minP",B=3000,seed=seed) exp2<-MTP(X=data2,Y=data3,test="t.onesamp",standardize=FALSE,method="sd.maxT",B=3000,seed=seed) exp3<-MTP(X=data2,Y=data3,test="t.onesamp",standardize=FALSE,method="ss.minP",B=3000,seed=seed) exp4<-MTP(X=data2,Y=data3,test="t.onesamp",standardize=FALSE,method="ss.maxT",B=3000,seed=seed)

Also four graphical summaries were created for the results of the ss.minP: number of rejected hypotheses vs. Type I error rate, sorted adjusted p-values vs. number of rejected hypotheses, adjusted p-values vs. test statistics, and adjusted p-values versus index.

The next step will be to compare the raw and adjusted p values calculated by the R multiple testing procedures with the p value and Bonferroni corrections calculated in Excel, respectively.

Katrina Sherbina 19:43, 9 June 2011 (EDT)

June 10, 2011
For wildtype t15 data, A scatter plot was creating comparing the t-statistic derived raw p values with the raw p values obtained from performing a multtest using FDR and found that the t-stat derived raw p values were more conservative than the FDR raw p values. Also, for the dCIN5 data for all time point, f-statistic derived raw p values were calculated. These raw p values were comparied to the raw p values from a multtest using FDR, ss.minP, and sd.minP in three separate scatter plots. Each of these three scatter plots showed no correlation between the f-statistic derived raw p values and any of the p values calculated by the different multtests performed. Also, a benjamini hochberg correction wqas applied to the f-statistic derived p values. These were then compared to the adjusted p-values calculated using the aformentioned multtests. Again, it was not possible to discern any relationship between the values.

In addition, Dr. Fitzpatrick suggested comparing t-statistic derived p values for the dCIN5 data to the f-statistic derived p calues for the same data. However, if the same t-statistic calculations performed on the wildtype strain t15 data are performed on the dCIN5 data, this comparison cannot be made because while the f-statistic derived p values take into account all time points, at least as was calculated today, the t-statistic derived p values are for each individual time point.

Katrina Sherbina 19:48, 10 June 2011 (EDT)

June 13, 2011
Several errors were found in the master excel spread sheet, the compilation of all of the normalizing done with all of the GPR files in R. To remedy these errors, within array and between array normalization was performed on the GCAT chips and the new numbers were inputted into the master excel spread sheet. In addition, the within array and between array normalized data for dZAP1 was corrected within the master excel spread sheet.

In preparation for adjusting the current model for the gene regulatory network, means and standard deviations were calculated with the between array normalized data for fifteen genes in the wildtype, dCIN5, dGLN3, dHMO1, and dZAP1 strains. An original spread sheet for the modeling was modified to create five separate Excel spread sheets for each of the five strains inputting the means calculated as aforementioned for the log 2 concentrations and the standard deviations calculated as aforementioned for the concentration sigmas.

Katrina Sherbina 19:40, 13 June 2011 (EDT)

June 14, 2011
Each of the five excel spread sheets created yesterday (with the log 2 concentrations and concentration sigmas) were used in running already created estimation codes for MatLab. For each strain, a graph was created modeling the the log fold changes versus time for each of the fifteen yeast genes focused on. While the graphs of the wildtype strain showed marked differences in the log fold changes with time, many of the graphs for the deletion strains did not.

We began to look at changing the alpha values for the wildtype and dGLN3 data to find an alpha value that gives a middle ground between a large least squares error but a small penalty and a small least squares error but a large penalty. We ran the estimation codes for MatLab again for the two strains for six different alpha values (0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001). An alpha value of 0.005 was used the first time the estimation codes were run for all strains.

Katrina Sherbina 21:13, 14 June 2011 (EDT)

June 15, 2011
In analyzing the least squares error and penalty data computed for six different alpha values yesterday, we decided to try three intermediate alpha values between 0.005 and 0.001, namely 0.002, 0.003, and 0.004, for both the wildtype and dGLN3 data. A scatter plot was generated plotting the least squares error versus the penalty values for all ten alpha values. The optimal alpha value was found to be 0.002.

Next, we computed the relative error for the production rates, network weights, and network thresholds for the alpha values aforementioned for the wildtype and dGLN3 data. This error was calculated by comparing the values generated by an alpha value of 0.002 with the values generated by each of the other alpha values. The error was scaled by the sum of the squared sum of the values generated by an alpha value of 0.002 and he squared sum of the values generated by the other alpha values being compared. There was a general trend between the relative errors calculated for the production rates, network weights, and network thresholds. The relative error would decrease as the alpha value initially decreased but then it began to increase with decreasing alpha values beginning at the middle alpha value. The goal of calculating these relative error values is to determine whether the production rates or the network weights or the network thresholds will be the most difficult to estimate.

Katrina Sherbina 20:08, 15 June 2011 (EDT)

June 17, 2011
For the dGLN3 and wiltype data, the production rates, network weights, and network thresholds were graphed versus the ten alpha values that have been used with the MatLab estimation codes. It was found that the network weights were the most dispersed. In addition, the production rates, network weights, and network thresholds generated by the alpha values of 0.001, 0.002, and 0.005 were each graphed in separate scatter plots against a numerical index. Within each scatter plots, there were similar trends between the data generated by the alpha values.

Yesterday, it was found that we did not have the GPR's from flask 3 for dGLN3 when performing the normalization. Therefore, within array and between array normalization was performed within R again with the additional data. The between array normalized data was used to compute the average log fold change and standard deviation for the fifteen genes constituting the current predicted regulatory network. The comparisons and plots mentioned in the previous paragraph were not generated for the corrected between array normalized data because both the wildtype and previous dGLN3 data showed simliar trends in these plots and comparisons. Therefore, it was decided in conjuction with Dr. Fitzpatrick that it was not necessary to generate these plots again.

YEASTRACT was used to generate a new network between the fifteen transcription factors and target genes already being studies. In addition a new network was generated in YEASTRACT with these transcrption factors and genes in addition to GLN3, HMO1, and ZAP1. New excel data sheets were created to use with MatLab code using the network determined by YEASTRACT. In addition, degradation rates for GLN3, HMO1, and ZAP1 were inputed from another research papaer and production rates were calculated by doubling the degradationr rates. In addition, the average log fold change and standard deviation was calulated for these three genes added to the network for the wildtype and each of the deletion strains.

We began running the estimation codes in MATLAB again with these new Excel sheets. We will be running further simulations to test which alpha value will be the best to use in additional simulations with the new network.

Katrina Sherbina 19:45, 17 June 2011 (EDT)

June 20, 2011
The first thing done today was changing all of the RSC1 labels (RSC1 being a synonym for AFT1) to AFT1. The 15 gene and 18 gene networks previously generated with YEASTRACT were done so using documented regulations only from direct evidence. Today we generated new 15 gene and 18 gene networks in YEASTRACT using documented regulations from both direct and indirect evidence. In changing the RSC1 labels to AFT 1, the networks generated for the 15 and 18 gene networks using direct evidence were different than those previously generated. Therefore, the MatLab simulations that were run last Friday have to be redone. In addition, we generated two networks with a set of 21 genes (this network contains all the genes from the 15 gene network in addition to 6 more genes). Both generated using YEASTRACT, one was generated with documented regulations from only direct evidence and the other was generated with documented regulations from both direct and indirect evidence.

New excel spreadsheets were generated for each of the new networks generated today in preparation for running new MatLab simulations. For the 21 gene networks, the average log fold changes and the standard deviations were calculated for the 6 genes added to the previous 15 genes to create the 21 gene network. For any of the genes whose transcripts did not have a half life value in the source being used to calculate degradation rates, a half life of 25.5 minutes was used. We used the degradation rate calcualted using a half life of 25.5 minutes to also calculated the production rate.

We also experimented with what changes are made to the network generated in YEASTRACT for the original 15 genes being used when different specifications are choosen (such as documented regulations or potential regulations and the specifications within them). We are still trying to figure out where exactly YEASTRACT is getting its direct and indirect evidence from.

Katrina Sherbina 20:46, 20 June 2011 (EDT)

June 21, 2011
The objective was to determine the most appropriate alpha value to use for the new 15 gene network. New MatLab simulations were done with the new 15 gene network obtained by using direct evidence only and the wildtype and dGLN3 data for five alpha values (0.01, 0.005, 0.002, 0.001, and 0.0005). The least squares error was graphed versus the penalty for both the widltype and dGLN3 strains. Upon looking at the scatter plot for the dGLN3 data, it was decided to also try alpha values of 0.003, 0.0015, and 0.0001. When comparing the scatter plot for the wildtype data with that for the dGLN3 data, the two scatter plots seem to suggest that two different alpha values are most appropriate for running further simulations. It will need to be decided whether further simulations must be run to determine the most appropriate alpha value to use for the new 15 gene network generated by using both direct and indirect evidence.

Then the next step is to determine appropriate alpha values by running similar simulations with the 18 gene and 21 gene networks.

Katrina Sherbina 20:47, 21 June 2011 (EDT)

June 22, 2011
We run simulations in MatLab for all strains for the two new 15 gene networks (direct evidence only and direct and indirect evidence). We began running simulations for the two new 18 gene networks (direct evidence only and direct and indirect evidence). All of these simulations were performed using an alpha value of 0.002.

Replicate simulations were performed for wildtype and dGLN3 for the two new 15 gene networks (direct evidence only and direct and indirect evidence) using the alpha value 0.002 to confirm that the optimized network weights and network thresholds are the same within the replicate experiments within the individual networks.

Katrina Sherbina 21:33, 22 June 2011 (EDT)

June 24, 2011
First, we revisited and edited the comparisons we began two days ago for replicate simulations. In contrast to the intial comparisons we made, we made error calcuations and other calculations within Excel to compare the optimized network weights and network thresholds between the networks within the same trial simulation.
 * 1) Calcualted the absolute value of each entry within each of the four matrices (direct network trial 1, direct & indirect network trial 1, direct network trial 2, direct & indirect trial 2) for the wildtype and dGLN3. The sum of each of these matrices was found.
 * 2) The sum A was found of corresponding networks within the same trial (i.e. sum of direct network trial 1 and direct & indirect network trial 1).
 * 3) A new matrix consisting of relative errors was calculated for each trial by finding the difference of corresponding network matrix entries (i.e. difference between first entry in direct network trial 1 matrix and the first entry in direct & indirect network trial 2) and dividing by the sum A.
 * 4) Cells that corresponded to a network connection that's only in one of the networks (direct vs. direct plus indirect) were colored with a yellow fill. Cells that were common to both networks were colored with an orange fill.
 * 5) The signs of the weights and the thresholds were compared between trials within the same network and also between two networks within the trial simulation. A new matrix was generated in which each cell represented a comparison of the signs using the Excel function =if(sign(cell 1)=sign(cell 2),0,1). The sum of this matrix was found to determine how many sign changes occurred.

We ran MatLab simulations for the wildtype for the 15 gene direct and direct plus indirect networks using alpha values of 0.005 and 0.001. Then, we repeated the same calculations above for this data. Due to time constraints, we could not do the same for dGLN3 today. However, we plan to run these simulations and perform the calcuations for the dGLN3 data next week.

We also redid normalization in R for both dHMO1 and dZAP1 because we observed sign changes at some time points when looking at regulatory pathways in GeneMapp modeled with Dr. Dahlquist's original data versus those modeled by with the normalized data in R. Unfortunately we came upon a disturbing discovery when comparing the normalization we did today with the normalized data in the Excel spreadsheet. We found that the normalization changed when the controls 3XSSC and Arabidopsis on the microarray were removed before normalization in comparison to when the controls were kept for the normalization. We found indiscrepencies in the sign of the log fold changes between two sets of normalized data both of which were presumably normalized with the controls. We are currently consulting with Dr. Fitzpatrick and Dr. Dahlquist to determine whether or not the controls should be kept when normalizing the data. We also have to determine why we are getting different results with normalization when performing replicate processes. Consequently, within and between array normalization will be redone next week for all strains and compared to the data we currently have to see if there are any inconsistencies. This also means that the MatLab simulations that have been performed this week may need to be redone. Despite this, the comparisons between replicate experiments that have been mentioned above will be kept because these comparisons look at the output and are thus indpendent of the input. The same trends should be observed in the comparisons if new data is inputted into MatLab to perform replicate simulations.

June 27, 2011
My coworker and I went through all of the data for all five strains and went through the normalization process in R all over again individually.

The first set of code was used to create an MA plot of the data for each strain before any normalization was done.

par(mfrow=c(4,4))

This designates how many rows and columns of graphs should be fitted onto one window. The numbers changing depending on how many target files are being imported into R.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) dyeswap<-c(1,-1,-1,1,-1,-1,1,-1,1,-1,-1,1,-1,-1)

The last line of code is necessary to designate which gpr files corresponding to microarry chips that have been dyeswapped. The "-1" represents a dye-swapped chip. This line of code needs to be changed for each strain in order to accurately reflect which gpr files correspond to chips that are dyeswapped for specific deletion strains.

GeneList<-RG$genes$Name for(i in 1:14) {lr<-dyeswap[i]*(log2((RG$R-RG$Rb)/(RG$G-RG$Gb)))} r0<-length(lr[1,]) RX<-tapply(lr[,1],as.factor(GeneList),mean) r1<-length(RX) M<-matrix(nrow=r1,ncol=r0) M[,1]=RX for(i in 1:14) {M[,i]<-tapply(lr[,i],as.factor(GeneList),mean)}

The "1:14" in the for loop designiates how many target files there are. This also has to be altered for data from each strain.

la<-(1/2*log2((RG$R-RG$Rb)*(RG$G-RG$Gb))) r3<-length(la[1,]) RQ<-tapply(la[,1],as.factor(GeneList),mean) r4<-length(RQ) A<-matrix(nrow=r4,ncol=r3) A[,1]=RQ for(i in 1:14) {A[,i]<-tapply(la[,i],as.factor(GeneList),mean)} for(i in 1:14) {plot(A[,i],M[,i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15))}

The "1:14" in the four loop designiates how many target files there are. This also has to be altered for data from each strain. The y and x limits to the plots have been kept the same for all of the strains as decided upon looking at the data. However, these limitations may need to be changed with new data. The first line in the above set of code was used to find the average log fold change for each spot. All of the duplicate spots (spots on the microarray that corresponding to the same gene) were averaged using the second to last line of code.

This next set of code was used to create an MA plot of the data for each strain after within normalization was applied. For within array normalization, the controls were retained.

par(mfrow=c(4,4)) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:14) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"New_wt_WANorm_Matrix.csv",sep=",") X1<-tapply(MA$A[,1],as.factor(MA$genes[,5]),mean) y0<-length(MA$A[1,]) y1<-length(X1) AA<-matrix(nrow=y1,ncol=y0) AA[,1]=X1 for(i in 1:14) {AA[,i]<-tapply(MA$A[,i],as.factor(MA$genes[,5]),mean)} for(i in 1:14) {plot(AA[,i],MM[,i],ylab="M",xlab="A",ylim=c(-5,5),xlim=c(0,15))}

Before between array normalization could be applied, the .csv file that was written had to be accessed. The first two rows, corresponding to the controls, were deleted. In addition the Master Index was also deleted.

MA2<-read.csv("New_wt_WANorm_Matrix.csv",header=TRUE,sep=",") MA3<-as.matrix(MA2) MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL) MAC<-as.matrix(MAB) write.table(MAC,"New_wt_BANorm_Matrix.csv",sep=",")

The first line of the above set of code reinputted the .csv file that was edited to remove the controls. The second line of code transformed this data into a matrix so that between array normalization could be performed.

In addition to normalizating and creating the MA plots, boxplots were also created for each strain graphing the average log fold change.

par(mfrow=c(1,3)) targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) dyeswap<-c(1,-1,-1,1,-1,-1,1,-1,1,-1,-1,1,-1,-1) GeneList<-RG$genes$Name for(i in 1:14) {lr<-dyeswap[i]*(log2((RG$R-RG$Rb)/(RG$G-RG$Gb)))} r0<-length(lr[1,]) RX<-tapply(lr[,1],as.factor(GeneList),mean) r1<-length(RX) RR<-matrix(nrow=r1,ncol=r0) RR[,1]=RX for(i in 1:14) {RR[,i]<-tapply(lr[,i],as.factor(GeneList),mean)} boxplot(RR[,1],RR[,2],RR[,3],RR[,4],RR[,5],RR[,6],RR[,7],RR[,8],RR[,9],RR[,10],RR[,11],RR[,12],RR[,13],RR[,14],ylim=c(-5,5)) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} boxplot(MM[,1],MM[,2],MM[,3],MM[,4],MM[,5],MM[,6],MM[,7],MM[,8],MM[,9],MM[,10],MM[,11],MM[,12],MM[,13],MM[,14],ylim=c(-5,5)) MA2<-read.csv("New_wt_WANorm_Matrix.csv",header=TRUE,sep=",") MA3<-as.matrix(MA2) MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL) boxplot(MAB[,1],MAB[,2],MAB[,3],MAB[,4],MAB[,5],MAB[,6],MAB[,7],MAB[,8],MAB[,9],MAB[,10],MAB[,11],MAB[,12],MAB[,13],MAB[,14],ylim=c(-5,5))

Again, before between array normalization could occur, the .csv file containing the within array normalized data minus the controls had to be imported into R.

All of the data generated after within array normalization and between array normalization for all strains was compiled into one Excel file. For each time point for each strain, the average log fold change was found by averaging the log fold change for all of the flasks within the time point. The standard deviation was found for each time point for each strain. The t statistic was calculated for each time point for each strain using the formula

=AVERAGE(range of cells)/(STDEV(range of cells)/SQRT(number of replicates)).

The p value was calculated for each time point for each strain using the formula

=TDIST(ABS(cell containing T statistic),degrees of freedom,2).

June 28, 2011
Using all of the between array normalized data generated yesterday along with the statistics calculated yesterday, new Excel spread sheets were created for the deletion strains to input into MATLAB. These spread sheets followed the same format as discussed during week 5. While the production rates and degradation rates remained the same, the log fold changes and the concentration sigmas had to be changed in accordance with the new normalized data. The alpha value remained 0.002. These spread sheets have yet to be created for the wildtype because we are having issues performing between array normalization for the GCAT chips for the wildtype.

MATLAB simualtions were begun for dZAP1 and dHMO1 for the 15 gene direct network generated by YEASTRACT. Unexpectedly, the simulation for dHMO1 took much longer than the simulation for dZAP1.

Katrina Sherbina 20:46, 28 June 2011 (EDT)

July 5, 2011
To minimize chances for human error, the procedure for normalizing all of the GPRs was altered. Rather than normalizing the wildtype and the four deletion strains separately, all of the Ontario chip GPRs were normalized at once. In order to do so, first a one target file had to be created listing all of the GPRs corresponding to all of the Ontario chips from the wildtype and deletion strains. The GCAT chips were still normalized separately and the target files previously created for the top and bottom of each of the chips were used. Also, a .csv file consiting of a singla column of 1's and -1's was created along with the targets file for the Ontario chps to indicate which chips were dye swapped. The dye-swapped chips were multipled by -1. In addition, after performing within array normalization on the Ontario and GCAT chips separately, a new .csv file was created so that the Ontario and GCAT chips could be normalized between arrays in one step. Also, rather than dyeswapping all of the appropirate chips by hand in Excel, a few lines of code were added so that R could perform the dyeswapping.

The code dealing with the Ontario chips up through within array normalization is as follows.

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")

M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}
 * 1) Create a blank matrix which has as many rows as there are unique genes and as many columns as there are GPR files.
 * 2) Average all of the replicate spots by using the tapply function. Fill in the matrix previously created with these values.

ds<-read.csv("dyeswap_matrix.csv",header=TRUE,sep=",") MN<-matrix(nrow=6191,ncol=94) for (i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}
 * 1) Import the .csv file denoting which chips are dyeswapped.
 * 1) Create a new matrix.
 * 1) Multiply the values in the matrix containing the averaged replicates log fold changes by the dyeswap list created.

write.table(MN,"ont_WANorm_dyeswapped_Matrix.csv",sep=",")
 * 1) Write the data to a table.

Open up the exported table in Excel and delete the first two rows of values corresponding to the controls.

To within array normalize the GCAT chips,

targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RG<-rbind(RT,RB)
 * 1) Read the top and bottom chips into R separately and rbind them
 * 2) to model the data.frame after ontario chips, 14000 spots in each column.

MA<-normalizeWithinArrays(RG,method="loess", bc.method="normexp")
 * 1) Normalize within each array.

M<-matrix(nrow=6403,ncol=9) for(i in 1:9) {M[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,4]),mean)}
 * 1) Use the tapply function to average duplicate spots so that only unique
 * 2) spots remain in the data.

write.table(M,"GCAT_WANorm_rep.csv",sep=",")
 * 1) Write the table.

Before between array normalization is performed, the exported table with the within array normalized GCAT chips must be put into Microsoft Access to eliminate any entries (genes) that are not within the Ontario chips. Follow the procedure outlined in the June 1, 2011 entry within this computational journal. Rather then importing all of the wildtype Ontario chip data into Microsoft Access, it is sufficient ot import just a list of the Ontario chip ID's in the form of a list in a .csv file. Once it has been edited in Access, the within array normalized data for the GCAT chips must be combined with the within array normalized data for the Ontario chips into one .csv file. Then, between array normalization is performed as follows.

MA2<-read.csv("ALL_WANorm_dyeswapped_Matrix.csv",header=TRUE,sep=",") MA3<-as.matrix(MA2) MAB<-normalizeBetweenArrays(MA3,method="scale",targets=NULL) MAC<-as.matrix(MAB) write.table(MAC,"ALL_BANorm_dyeswapped_Matrix.csv",sep=",")
 * 1) Read the .csv file containing the within array normalized data for all of the chips into R.
 * 1) Force the .csv file into a matrix.
 * 1) Perform between array normalization.
 * 1) Write the data to a table.

All of this code was performed by both myself and my coworker individually. There were some discrepencies in the data we exported after within array normalization and between array normalization. However, our values differed by at most a number to of the magnitude E-10. However, when we individually repeated the process, there was no difference between the replicate in comparison to the intial output.

We still need to regenerate all of the MA plots and all of the boxplots by making some additions to the aforementioned lines of code.

Katrina Sherbina 20:55, 5 July 2011 (EDT)

July 11, 2011
A protocol was written up to perform within array normalization and between array normalization as previously with the exception that all of the ontario chip GPR files are treated at once rather than separately by deletion strain. This protocol is currently up on the Dahlquist Lab OpenWetWare page. However, in further looking into between array normalization, it was questioned whether or not this was the proper way to scale the data. It is not clear whether R finds the MAD of all of the chips or if it finds the MAD of individual chips. As a result, we tried to program R to scale each GPR by its own MAD rather than the MAD of all of the GPRs. The following line of code was used.

for (i in 1:15) {MS[,i]<-MX2[,i]/mad(MX2[,i],center=median(MX2[,i]),na.rm=FALSE,low=FALSE,high=FALSE)}

The scaled data was plotted against the data scaled using between array normalization. It was found that the difference between the data scaled with the above line of code and the data scaled by between array normalization is a scale factor of approximately 3.

Another problem that we began to address today was how to keep the genes as row names and the strain, flask, and timepoint as column names within R so that the row names and the column names do not have to be added after outputting the normalized data from R. First, we created a .csv file with three columns the first one which had a sample of the GPR files, the second one which had the strain, flask, and timepoint corresponding to the GPR, and the third which had a "1" or a "-1" depending on whether the chip was dyeswapped. After importing the file, we used the following code to assign column and row names through within array normalization and averaging replicates.

targets<-read.csv("GPR_and_column_headers.csv") f<-function(x) as.numeric(x$Flags > -99) RGO<-read.maimages(targets, source="genepix.median", wt.fun=f)
 * 1) Read in file with GPR names, "user-friendly" names, and dyeswap.

row<-read.csv(file.choose) row2<-as.matrix(row) col<-targets[,2] ds<-targets[,3]
 * 1) Set from what part of the file the row names are taken.
 * 1) Set from what part of the file the column names are taken.
 * 1) Designate where the dyeswap list comes from.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")
 * 1) Normalize Within Arrays

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:15) {MM[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}
 * 1) Set up a for loop to average replicates.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MN<-matrix(nrow=n1,ncol=n0) MN[,1]=M1 for (i in 1:15) {MN[,i]<-ds[i]*MM[,i]}
 * 1) Set up a for loop to dyeswap the necessary GPRs.

MX<-as.data.frame.matrix(MN) rownames(MX)<-row2 colnames(MX)<-col
 * 1) Turn the matrix into a data frame.
 * 2) Label the rows and columns.

MX2<-MX[c(3:6191),]
 * 1) Delete the rows with the controls.

Thus, in turning the output of within array normalization into a data frame, we could assign column names and row names within R. However, we cannot, at least for the moment, find a way to use similar code to retain column and row names after the data has been scaled by the MAD. We are concerned that if we try the using a data frame we still cannot be confident that we can just insert the same column names and row names after scaling has occurred because we are not sure if in doing this scaling the gene order might be rearranged.

Katrina Sherbina 20:17, 11 July 2011 (EDT)

July 12, 2011
The code in the previous entry was further refined today. The controls were taken out earlier, namely right after replicates were averaged. Then, the dyeswaps were performed using a more simplified code.

MM2<-MM[c(3:6191),]
 * 1) Delete the rows with the controls.

rows<-length(MM2[,1]) cols<-length(MM2[1,]) MN<-matrix(nrow=rows,ncol=cols) for (i in 1:15) {MN[,i]<-ds[i]*MM2[,i]/mad(MM2[,i])}
 * 1) Set up a for loop to dyeswap the necessary GPRs.

In addition, the code to turn the resulting matrix after dyeswapping into a data frame was slightly altered. A "row3" was specified using the code

row3<-row2[c(3:6191)]

and this "row 3" was used to define the row names of data frame of dyeswapped within array normalized data.

Also, we succeeded in preventing the column headers to be shifted over to the left in outputting the normalized data from R as a .csv file. The following code was used

write.table(MX,"MAD_scaling_rep.csv",sep=",",col.names=NA,row.names=TRUE)

In addition, we began to look into how to get rid of GCAT data that does not correspond to genes on the Ontario chips in R rather than Access as we have done thus far. First, within array normalization and averaging of replicates was performed as recorded in the journal entry for July 5th with the exception that the matrix for the tapply function was not hard coded. Rather the following matrix was created first and then the tapply function was used.

m1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) g0<-length(MAG$M[1,]) g1<-length(m1) M<-matrix(nrow=g1,ncol=g0) M[,1]=m1 for(i in 1:9) {M[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

Then the resulting matrix was converted to a data frame and the row and column names were labeled with the GCAT gene names and the GPR file names, respectively.

GCATID<-read.csv(file.choose,header=TRUE,sep=",") GCATNames<-GCATID[,1] r<-as.matrix(GCATNames) GPR<-as.matrix(RGG$targets) col<-GPR M2<-as.data.frame.matrix(M) rownames(M2)<-r colnames(M2)<-col
 * 1) Read GCAT ID's (GCAT_ID.csv)
 * 1) Define columns.
 * 1) Convert to data frame

Then, the GCAT data was filtered so that the data for the genes on the GCAT chips that were not located on the Ontario chips was taken out. In order to do so, first a list of Ontario gene names was imported and this list was used to filter the data for the GCAT chips.

ontID<-as.data.frame(read.csv(file.choose,header=TRUE,sep=","))
 * 1) Read ontario ID (Ont_Chip_GeneID.csv)

data<-M2 rownames(data)<-GCATID[,1] names.to.keep<-ontID[,1] GO<-subset(data,rownames(data) %in% names.to.keep)
 * 1) Filter GCATs.

The data for the GCAT genes not on the Ontario chips was deleted. However, rows corresponding to Ontario genes that the GCAT chips did not have data for were also deleted from the Therefore, instead of producing the desired output with 6189 rows (as has been produced in Access because it keeps the rows where the data for the genes unique to the GCAT chips was), the output contained only 6136 rows. As a result, we experimented with R finding the Ontario chip genes that were not in the GCAT chips. We used the following lines of code.

ont1<-subset(ontID,Ontario_ID!="Arabidopsis") ont2<-subset(ont1,Ontario_ID!="3XSSC") OntNames<-ont2[c(1:6403),] ONT<-subset(OntNames,!OntNames %in% GCATNames)
 * 1) Filter Ontario chips.

Then, we tried to bind this list of Ontario gene names not on the GCAT chips ("ONT") with the filtered within array normalized GCAT data ("GO"). We ran into errors using the rbind function to do this. Rather than adding the set of rows with the Ontario genes not on the GCAT chips, a single row was added to the end of the filtered within array normalized GCAT data with numbers. We tried to manipulate "ONT" to have as many rows and as columns as GO and to label the columns of ONT after rows and columns were added with the names of the GCAT GPR files to see if this would enable "ONT" and "GO" to be merged using rbind. However, this was also not successful.

Katrina Sherbina 20:22, 12 July 2011 (EDT)

July 13, 2011
The code for the GCAT chips listed in the previous two entries was altered.

First, rather than importing two separate text files (one for the top of the GCAT chips and one for the bottom of the GCAT chips), one .csv file was created and imported with a column indicating whether the GPR file corresponds to the top or bottom of the chip; a column listing all of the GPR files; and a column naming the GPR files with the strain, timepoint, and flask. This file was separated into the tops of the GCAT chips and the bottom of the GCAT chips. Then, the two were combined using the rbind function.

targets<-read.csv(file.choose) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,2], source="genepix.median", wt.fun=f) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets[10:18,2], source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

Within array normalization was performed using the same code as before. Duplicate spots were averaged using the same lines of code was before with the exception that the line M[,1]=m1 was taken out.

Then, the GCAT gene names were read in and defined with a variable to later use to set row names. Also, the GPR files names were given a variable so as to be used later to set column names. GCATID<-read.csv(file.choose,header=TRUE,sep=",") GCATNames<-as.matrix(GCATID[,1]) GPR<-as.matrix(targets[1:9,3])

The matrix M (corresponding to the data with averaged replicates) was converted to a data frame and the the GCAT gene names were set to the row names while the GCAT GPR file names were set as the column names.

M2<-as.data.frame.matrix(M) rownames(M2)<-GCATNames colnames(M2)<-GPR[,1]

As written in the previous journal entry, the .csv file with the Ontario gene names was read in. Then this list of names was filtered to get rid of the controls. Then, a subset of the filtered Ontario gene names was taken corresponding to the Ontario genes that the GCAT chips do not have data for.

The code to filter the within array normalized GCAT data so that it only contains Ontario gene names was slightly altered to reduce unnecessary lines of code.

rownames(M2)<-GCATNames names.to.keep<-ontID[,1] GO<-subset(M2,row.names(M2) %in% names.to.keep)

Then, the aforementioned subset of the filtered Ontario gene names was successfully merged with the filtered within array normalized GCAT data. First a matrix was created with as many rows as the number of Ontario gene names that the GCAT chips did not have data for and with as many columns as GCAT GPR files. This matrix was converted to a data frame and the rows were labeled with the Ontario genes not on the GCAT chips while the columns were labeled with the GPR file names. Then, this data frame was merged to the filtered within array normalized GCAT data frame using rbind.

new<-matrix(nrow=length(ONT[,1]),ncol=length(GPR[,1])) new2<-as.data.frame.matrix(new) rownames(new2)<-ONT$Ontario_ID colnames(new2)<-GPR[,1] final<-rbind(GO,new2)

The rows in the data frame were sorted so that the gene names would be listed in alphabetical order. final.sort<-final[order(row.names(final)),]

Then, the data was scalaed by the MAD. First, a matrix had to be created with the appropriate number of columns and rows. Then this matrix had to be converted to a data frame, the rows of which were labeled with the row names from the final.sort data frame and the columns of which were labeled by the GPR file names. Only afterward was the data scaled with the MAD.

r<-length(ont2$Ontario_ID) col<-length(GPR[,1]) MADM<-matrix(nrow=r,ncol=col) for (i in 1:9) {MADM[,i]<-final.sort[,i]/mad(final.sort[,i],na.rm=TRUE)} GMAD<-as.data.frame.matrix(MADM) rownames(GMAD)<-row.names(final.sort) colnames(GMAD)<-GPR[,1]

A new .csv file was created for all the Ontario chips listing the GPR file names in one column; the GPR files designated by strain, timepoint, and flask in another column; and whether or not the chip was dyeswapped in another column. This .csv file was imported into R. Then similar code was used as in the entries for July 11th and July 12th to within array normalized and scale normalize the data.

The following code was used to read in the .csv file aforementioned.

targets<-read.csv(file.choose) f<-function(x) as.numeric(x$Flags > -99) RGO<-read.maimages(targets[,1], source="genepix.median", wt.fun=f)

Then, a .csv file with the Ontario gene names was read into to later use for row names. Also, the portion of the targets file with the strain, timepoint, and flask was defined by a different variable to later use for column names. The portion fo the targets with the dyeswap list was also designated with its own variable.

row<-read.csv(file.choose) column<-targets[,2] ds<-targets[,3]

Then, the data was within array normalized with the same code as before.

MAO<-normalizeWithinArrays(RGO, method="loess", bc.method="normexp")

Then the replicates were averaged using very similar lines of code with the exception that the for loop was designated for 1:94, corresponding to the 94 Ontario GPR files.

M1<-tapply(MAO$M[,1],as.factor(MAO$genes[,5]),mean) n0<-length(MAO$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MAO$M[,i],as.factor(MAO$genes[,5]),mean)}

The above matrix (MM) was converted to a data frame, the rows of which were labeled with the Ontario gene names and the columns of which were labeled with the strain, timepoint, and flask designations of the GPR files.

MM2<-as.data.frame.matrix(MM) rownames(MM2)<-row$Ontario_ID colnames(MM2)<-column

Then the controls were removed.

MM3<-subset(MM2,row.names(MM2)!="Arabidopsis") MM4<-subset(MM3,row.names(MM3)!="3XSSC")

Then the GPR files that needed to be dyeswapped were dyeswapped. Also, simultaneously, each GPR file was scaled by its own MAD. rows<-length(MM4[,1]) cols<-length(MM4[1,]) MN<-matrix(nrow=rows,ncol=cols) for (i in 1:94) {MN[,i]<-ds[i]*MM4[,i]/mad(MM4[,i])}

Then this dyeswapped and scaled matrix was turned into a data frame. The rows were labeled using the same row names as in the data frame (MM4) with the controls removed. The columns were labeled with the strain, timepoint, and flask designations of the GPR files.

MX<-as.data.frame.matrix(MN) rownames(MX)<-row.names(MM4) colnames(MX)<-column

The resulting data frames from the within array normalization and scale normalization of the Ontario chips (MX) and the from the within array normalization and scale normalization of the GCAT chips (GMAD).

merged<-cbind(GMAD,MX)

Katrina Sherbina 20:25, 13 July 2011 (EDT)