Computational Journal

May 20, 2011
After attempting to find a match between each .gpr file and the individual LogFC columns in the original raw data some problems occurred. A single .gpr file was chosen to run a few tests on to try and match with one of the columns. The file that was selected in this case was t15/t0 for flask 3 of the Dahlquist wildtype strain. Flask 3 was chosen for a dye swap of Red/Green to Green/Red, however, when computing the LogFC for each spot on the chip, it was discovered that the LogFC was performed using the Red intensity divided by the Green intensity. This will have to be confirmed with Dr. Dahlquist to see if this should be changed at all or if the Excel sheets obtained from May 19th, 2011, a journal for this day does not exist. However, using Red/Green for the fold change, a column was obtained that matched the LogFC originally performed in the .gpr file. The next step was to try and perform this in the statistics processing software R. The original code given was changed to focus on the median, due to the fact that R was formatted to process the mean foreground intensity over the median background intensity. Once this was done, R successfully created a list of LogFC, of which the first five were focused on to compare to the original .gpr file. Of these five, two were duplicates giving three individual genes to be focused on in the file. The first five data points given by R matched up with the first five genes given in the .gpr file, so we know that this was functioning correctly. The main problem that arose was that when the LogFC given by the .gpr file and R and the LogFC found in the raw data were compared, an inconsistency occurred. These numbers did not match, and after running a seperate test on the .gpr file for t30/t0 for the same culture, the same problem occurred.

May 25, 2011
Spent the morning trying to match up the Log Fold Changes before global normalization and after global normalization from the .gpr files. Ideally the graph should have a linear trend line if global normalization had subtracted the same constant from every spot. However, some of the files output a fuzzy region around the origin, this can be seen in the separate powerpoint with the graphs. The individual red/green intensities were graphed next. The normalized red intensities was graphed against the non-normalized red intensities and the same for the green intensities. These graphs should all be a linear line as well, however some awkward points occurred. The fuzzy region from the LogFC graph is thought to be a resultant from the "error" messages that are scattered throughout the .gpr files. These are a result of the intensity ratio being negative and the Log of any negative number does not exist. The number of "error" messages in each file seemed to correlate with the fuzzy section around the origin, as the number of "error"'s went up, the fuzziness increased to an extent were it almost was not linear anymore.

May 26, 2011
This morning was spent troubleshooting a line of code given by Dr. Fitzpatrick, M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean), meant to average the duplicate spots in the .gpr files after they were locally normalized using loess. We had to come back later fix this code using a for loop so that R would normalize all of the .gpr files instead of the one we had specified, 5 given that that was the first column with gene names when we were working on this yesterday. However, this was a .gpr file so the first four columns were just 1's, were as the MA data frame that was created did not incoporate these columns into the data frame. Once this was all figured out, I went to lunch and to check on the S. cerevisiae cultures (see lab notebook for details). Once we got back, we spent the remainder of the time trying to make sure the loess normalization worked, after talking with Dr. Fitzpatrick for a bit we switched our determined this was best done by graphing the normalized vs. nonnormalized log fold changes, after they had both been averaged using the tapply function. We spent a short amount of time trouble shooting this, mostly clerical errors when typing the source code, after which we were able to graph the relationship, which came out to be a relatively linear line, showing that loess normalization was working correctly. With this we are able to go back and do the loess normalization on the data given by the rest of strains.

May 27, 2011
We ran into a few problems at our first attempts at running the code even though we had had successful previous attempts on the previous day. The main problem we found was for the first for loop, the MM must have the [,i] afterwards, or else the output sheet will only be one column, and not all the columns that were put into the loop. After a few more typos that were taken care of this was the outcome:

par(mfrow=c(2,1)) targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) plotMA(RG,main="before within array normalization") MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"wt_M_avg.csv",sep=",") plotMA(MA,main="after within array normalization") MAB<-normalizeBetweenArrays(MA,method="scale",targets=NULL) b0<-length(MAB$M[1,]) N2<-tapply(MAB$M[,1],as.factor(MAB$genes[,5]),mean) b1<-length(M1) MMB<-matrix(nrow=b1,ncol=b0) MMB[,1]=N2 for(i in 1:9) {MMB[,i]<-tapply(MAB$M[,i],as.factor(MAB$genes[,5]),mean)} write.table(MMB,"between_array_norm.csv",sep=",")

The point of all this code is to take the .gpr files that we were given, put them into the R statistics database, create an MA plot for before normalization and then normalize them using the locally weighted scatterplot smoother (Loess), note the e in loess may stand for estimated seeing as how the we in lowess stands for weighted. After this was done the data was placed into a matrix, because R has trouble translating a MAlist to a data frame, this matrix was then averaged for every duplicate gene using the line of code discussed on May 26th. This was all written into an excel table and saved along with the accompanying picture of the MAplots. This data was then put through a between array normalization, placed into a separate matrix, boxplots were done of the before and after between array normalization, these were all saved. Some notes on the code is that all the for loops must be changed for the number of .gpr files that are put into the target text file. Work is still progressing on the GCAT chips, first of all, the flask numbering for the GCAT chip as well as flask 4 must be double checked because of some discrepancies as well as a want to not have to run all the data twelve times. However, I was able to place both targets onto R, normalize them, and merge them into a single MAlist. This can be done for the RG list before, at the moment, I am only testing this code to make sure it will accomplish what I need by testing segments and checking the attributes of the outcomes. Once the flask discrepancies can be sorted out with Dr. Dahlquist, a majority of the "New Master List" can be assembled and the GCAT can be focused on from there.

June 1, 2011
How to use Microsoft Access to match tables:
 * 1) Open microsoft access
 * 2) Make sure all columns are labeled.
 * 3) Save desired files as tab delimited text files.
 * 4) File -> Get External Data -> Import your text files.
 * 5) Choose ID's to be Field Name
 * 6) Check first box for first row is field names
 * 7) ID column can be text and is ok with duplicates, will kick it back out if it doesn't like it.
 * 8) All other columns should be numbers
 * 9) Repeat for other text file.
 * 10) Choose own access key
 * 11) Create a new query and add both text files
 * 12) Click and drag from one ID field name to the other
 * 13) Right click on line and choose join properties
 * 14) Choose option three, to keep all "Ontario" data and add only new "GCAT" data, arrow should appear showing relationship.
 * 15) Select all the field names and click and drag them to Field on the Query, it will automatically assign them.
 * 16) Query -> Make-Table Query and name your table
 * 17) Click run to get output, select all and copy and paste values into an excel file, make sure format is correct

Spent today working through the GCAT chips, once this all got finished, Dr. Dahlquist showed us how to us Microsoft Access to combine the GCAT and Ontario chips, as well as how to ignore any extra genes on the GCAT chips. After this I spent the afternoon working on constructing a powerpoint with all the MA plots and boxplots as well as a process description and the code. The validation graphs still need to be graphed and placed on the spreadsheet, I will try and get this done tonight so it is ready for June 2nd.

June 2, 2011
Spent the day going over our results with Dr. Fitzpatrick to fully understand them and make sure everything was scaled correctly. We discovered that the MA plots that were originally outputted by us through R are in fact only the first .gpr file for each strain. As we went back to fix all of those and get all of the graphs on the same page. This was solved by running a for loop for our MA plot maker:

for(i in 1:20) {plotMA(MA[,1])}

Next we decided to match up all the boxplots next to each other. This was a particularly tough time and upsetting once we made the discovery that we had to write out the whole code:

boxplot(MA$M[,1],MA$M[,2],MA$M[,3],MA$M[,4],MA$M[,5],MA$M[,6],MA$M[,7],MA$M[,8],MA$M[,9],MA$M[,10],MA$M[,11],MA$M[,12],MA$M [,13],MA$M[,14],MA$M[,15],MA$M[,16],MA$M[,17],MA$M[,18],MA$M[,19],MA$M[,20])

We made a correction to this code, changed the MA$M to an MAB$M in my case, to plot out the after within array normalization MA plots. After this we attempted to go through the "multtest" tutorial that Dr. Fitzpatrick had given us but ran into a few problems installing the packages. These problems persisted even after we thought we had solved them all, however, Dr. Dahlquist should have the mean to help us install these packages. The last piece of code that I generated was the boxplots for the before within array normalization data so we could see the centering effects. This was probably the most annoying code I have written, thankfully I only made one error which was easily corrected. The code is:

boxplot(log2((RG$R[,1]-RG$Rb[,1])/(RG$G[,1]-RG$Gb[,1])),log2((RG$R[,2]-RG$Rb[,2])/(RG$G[,2]-RG$Gb[,2])),-log2((RG$R[,3]-RG$Rb[,3])/(RG$G[,3]-RG$Gb[,3])),-log2((RG$R[,4]-RG$Rb[,4])/(RG$G[,4]-RG$Gb[,4])),log2((RG$R[,5]-RG$Rb[,5])/(RG$G[,5]-RG$Gb[,5])),log2((RG$R[,6]-RG$Rb[,6])/(RG$G[,6]-RG$Gb[,6])),-log2((RG$R[,7]-RG$Rb[,7])/(RG$G[,7]-RG$Gb[,7])),-log2((RG$R[,8]-RG$Rb[,8])/(RG$G[,8]-RG$Gb[,8])),log2((RG$R[,5]-RG$Rb[,9])/(RG$G[,9]-RG$Gb[,9])),log2((RG$R[,10]-RG$Rb[,10])/(RG$G[,10]-RG$Gb[,10])),-log2((RG$R[,11]-RG$Rb[,11])/(RG$G[,11]-RG$Gb[,11])),-log2((RG$R[,12]-RG$Rb[,12])/(RG$G[,12]-RG$Gb[,12])),log2((RG$R[,13]-RG$Rb[,13])/(RG$G[,13]-RG$Gb[,13])),log2((RG$R[,14]-RG$Rb[,14])/(RG$G[,14]-RG$Gb[,14])),-log2((RG$R[,15]-RG$Rb[,15])/(RG$G[,15]-RG$Gb[,15])),-log2((RG$R[,16]-RG$Rb[,16])/(RG$G[,16]-RG$Gb[,16])),log2((RG$R[,17]-RG$Rb[,17])/(RG$G[,17]-RG$Gb[,17])),log2((RG$R[,18]-RG$Rb[,18])/(RG$G[,18]-RG$Gb[,18])),-log2((RG$R[,19]-RG$Rb[,19])/(RG$G[,19]-RG$Gb[,19])),-log2((RG$R[,20]-RG$Rb[,20])/(RG$G[,20]-RG$Gb[,20])))

June 8, 2011
Spent the morning lecturing about statistics and the Family-Wise Error Rate(FWER) and the False Discovery Rate(FDR). After a quick intro to MTP we started working through R to get a better understanding of what MTP needed to process through R. This was probably the most tedious point of the day because the output in R consisted of thousands of NaN's(not a number). This was frustrating to say the least. Once Dr. Fitzpatrick looked at what was going wrong we were able to step through the errors and output the pvalues that we wanted, both raw and adjusted, for all the data as well as an average. We used this line of code to do this:

seed<-99 group<-c(rep(0,3)) m1<-MTP(X=matrix,Y=group,test="t.onesamp",standardize=FALSE,B=100,method="?") The method still needs to be changed for the FDR, without specifying it defaults to one we don't want. The matrix object in this line of code is just the GPR files saved as a CSV excel sheet and written into R. The test also needs to be one sample otherwise it defaults to a two sample t test. See the help section in R or type in ?MTP and look at the examples for more help. Tomorrow will be the real test to see if we can get this to work for the rest of the data points, we also need to compare the raw pvalues that are output with the calculated pvalues from our new normalized master spreadsheet to see how they correlate.

June 9, 2011
Spent most of the day waiting for R to output the adjusted p-values that we wanted. This took a good deal of time because there are 6000 different genes to calculate the p-value for. We adjusted the number of iterations so that we could get a more accurate result, because with a B=100, the p-values that come back are only shown up to two decimal places, while a B=1000 gives you three decimal places. Once all of these were calculated by R, we placed them into a spreadsheet to compare them to the p-values we calculated by hand. After this was done, I spent some time trying to graph the results and compare the number of rejects from each type of adjustment, I managed to get a graph but I am still researching to see if there is a way to specifiy our data some more and manage to have a better graph. Another thing that I need to research is the use of the TPPFP and the gFWER adjustments. As to what I need to learn, I need to figure out where they are applied, exactly why and in which situations, and how to implement these into R. Hopefully the materials sent by Dr. Fitzpatrick and the paper I need to read for Journal Club will give me a better understanding of what is going on.

June 10, 2011
Spent the day comparing the different MTP results using different adjustment methods. Attempting to see a correlation of the data however none has been seen yet. This may be due to the methods, however another and possibly more likely cause is that our calculations are not right. We need to make sure we are comparing the same strains, the same time points, and the same methods, otherwise there is a high possibility that no correlation will be seen between the excel calculated data and the R calculated data. After seeing little to no relationship between the data, we started compiling a spreadsheet consisting of the 15 genes/transcription factors that we are focusing on along with all their time points for every strain and the averages and standard deviations for every time point. After going through some of the data some errors were found, one we need to go back and refill our GLN3 data because we started out with only a fraction of the GPR files. We also need to compare the CIN5 values we were using to test our methods today with other CIN5 data, we managed to compare it to wild type data instead and took that as no correlation.

June 13, 2011
Spent the morning going over the Matlab model that I did this previous semester. After this large review we started making the individual input sheets for Matlab from the yeast regulatory sheet we made on Friday. After a few minutes of this we discovered a mistake in GCAT's in the Master data sheet. So all the Matlab files had to be erased and re-pasted with the corrected data. While going back and fixing the GCAT data we discovered a few more errors in the number of replicates for ZAP1 as well as extra rows due to the controls still being present in the spreadsheet. Just to make sure that nothing was wrong with the rest of the data, we went back and re made the master spreadsheet by combining all of the individual files we got from R. Once all this was done we had to re make the yeast regulatory network spreadsheet, with all of the averages and standard deviations. After this we were able to make the Matlab input sheets for all of the deletion strains. We will start on the Matlab tomorrow morning.

June 14, 2011
Spent the day reviewing the yeast regulatory network and the Matlab simulation that I focused on this past six months. After making another powerpoint with all of the expression graphs that Matlab outputs for each gene in the network, we went on to graph the relationship between the lse and the norm values at different alpha levels. The Matlab iterations took up most of the time for the workday so I did not spend a lot of time on the actual code, and the concepts were mostly a review for me.

June 15, 2011
This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.

June 17, 2011
Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

June 20, 2011
In the morning we convened with Dr. Dahlquist to work out the details in the Matlab models that we are going through this next week. As we made the gene regulation networks in Yeastract we decided to figure out where Yeastract was getting these references for the relationships between the genes. As we went through the references given on the website, Dr. Dahlquist asked us to figure out what the different options in the gene regulation matrix menu were. The two options were to test with direct or indirect evidence, or to test for the number of binding sites on the promoters, anywhere from 1-9. The latter gave us less and less as the number of binding sites to specify increased, so we knew this wasn't right. We talked with Dr. Dahlquist about the direct and indirect options to decide on which to use, they both compare the given transcription factors and genes to a dataset, and decided that a direct set of matrices and a direct & indirect set of matrices were the best option. However, one thing we are still working is exactly where they are getting this dataset from. We have started searching the SGD, or Saccharomyces Genome Database, with any luck we will find it in here. We spent the afternoon on something different than the reference checking.

We put together more data sheets for the direct and the direct & indirect matrices. The first five matrices for the direct evidence matrix had already been put together, however, Dr. Dahlquist wanted us to check on the AFT1 synonym, RCS1, that we had been using in our list of genes and transcription factors to see if this would provide a change in the original direct matrix. This change was evident so we had to go back and recreate all of the original datasheets we had done by changing RCS1 to AFT1 in the list and reforming the matrix in Yeastract. Once these had all been corrected we spent the rest of the day creating the old 21 gene regulatory network matrix so that we could create the target excel sheets to put in Matlab. These corrections and the formatting for these sheets took the remainder of the day, we should start the simulations tomorrow in the Math PC computer lab.

July 5, 2011
Spent the morning going over the steps very thoroughly with Dr. Dahlquist and Dr. Fitzpatrick, so that we could iron out all the discrepancies in our process and have the best chance of avoiding a typo error in our code. We also started compiling the individual codes and tutorials for the GCAT and Ontario chips as well as the processes to output their graphs, both boxplots and MA plots. After going through and matching up our code so that we were as sure as possible that we would get consistent data we went through R and started outputting all of the data. However, we started finding some problems as we went through the code. Mainly that Katrina and I were getting different values, but only different by values of E-10. While this is not a huge difference, one would expect the same code and inputs to output the same data. The bigger problem came when we tried the process with the GCAT chips, of which Katrina was missing the GPR files and any segments of the code so I copied my own over to her flash drive. The same differences arose when we went through this data and this started to make me feel a bit uneasy, it got worse when I tried outputting my own data to see if R was just not computing each run the same, however the output I got matched up perfectly, in value and sign, to the first run of the data. These differences got even larger after between array normalization where the value of the differences increased to E-04. We took our time going through the code to ensure that no typos were present. We also checked the dye swap csv file and the target files to see if we had them all in the correct order, which we did. Currently we cannot find out where this problem is coming from, hopefully Dr. Fitzpatrick or Dr. Dahlquist will be able to weigh in and help. The code for the GCAT is as follows, the code for the Ontario chips will be located approximately 10 or so spaces below. The process for between array normalization is only located on the GCAT code. I added some descriptions of the steps just to break up the code so it is not overwhelming to look at.

>Read the top and bottom chips into R separately and rbind them >to model the data.frame after ontario chips, 14000 spots in each column targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RG<-rbind(RT,RB)

>normalize within each array MA<-normalizeWithinArrays(RG,method="loess", bc.method="normexp")

>tapply an average function to average duplicate spots so that only unique >spots remain in the data M<-matrix(nrow=6403,ncol=9) for(i in 1:9) {M[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,4]),mean)}

>write the table to put through microsoft access write.table(M,"GCAT_WA.csv",sep=",")

>read csv file back in once all logFC are in a single sheet, model after master data >then normalize between arrays B<-read.csv(file.choose,sep=",") BX<-as.matrix(B) BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

>write out as a table write.table(BA,"Master_data.csv",sep=",")

>Read Target Files into R and Loess Normalize targets<-readTargets(file.choose) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")
 * Ontario Code

>Average Duplicate spots so that each GPR has only unique spots left M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>Read dye swap and perform dye swap on ontario chips ds<-read.csv(file.choose,sep=",") MN<-matrix(nrow=6191,ncol=94) for(i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}

>Write file into excel to delete controls (first two rows) write.table(MN,"Ont_WA_ds.csv",sep=",")

July 11, 2011
Today was spent researching how to create an index in R consisting of the gene names or just a simple numerical index, so that as the data moves through R, the labels are still present after each major step. So far I have managed to add a column of names and a row of headers to a data frame. There are still some problems arising as I move through. One being that a matrix converts every cell to a character instead of keeping the distinction of a character versus a numeric. The next major step is finding a line of code to keep these indices connected to their corresponding cells in R, we have found a few examples but have nothing to post on R for these at the moment, we are still working through getting the correct format down. One thing I might want to check on later is keeping everything as a matrix, because some functions have problems with data frames, however we may have moved past this problem with our change in the scale normalization step. We have changed the line of code from:

BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

to:

for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

More research will need to be done to compile a code for the new format of target files we have, that being a csv file with the corresponding columns:FileName,Header,Strain,TimePoint,Flask,Dyeswap. This is so that everything we need is present in R so that we won't make any human errors by adding typos to our data. Here is the compiled code I have so far:

>Read the target file, and name file, separate the index, dye swap, and row ID's Targets<-read.csv("New_Target.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] Index<-Names[,1] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:15) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M1) MN<-matrix(nrow=n3,ncol=n2) MN[,1]=M2 for(i in 1:15) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),]

bn<-readTargets("matrix_names.txt") b0<-length(MP) b1<-length(MP[,1]) BH<-matrix(nrow=b1,ncol=b0,dimnames=list(bn)) for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

July 12, 2011
Spent the morning working through the Ontario code and fixing a few bugs, we managed to get the names to stick and so we moved onto the GCAT chips. Katrina worked on getting the GCAT chips through R using the new segments of the code worked into the original. After reformatting the entire code a bit we have compiled a mostly finished segment of code however there is still one part to work out. That is getting R to perform what microsoft access did, that is get rid of all the GCAT spots we don't want, without having the actually touch the data yet. Katrina has been working on subsets for the remainder of the afternoon, while I was going through on my own to try and compile subsets I stumbled across a function that may do exactly what we want. That is the function, merge, will combine two data frames and match up the cells based on criteria you provide. However, this is where the hard part comes in, which is figuring out how to tell R the criteria. I have been getting an error, that tells me that the memory required for the outcome cannot be allotted in R. I have tried using less gpr files for the Ontario chips, seeing as how I was using all 94 originally, as well as saving the workspace so that the data is saved under in a separate folder. However, none of these have seemed to work yet, trying to fix the memory or R to allot more memory may be my best bet at this point, this will be explored more tomorrow, for now I will post the compiled code thus far: Note: The last four lines of code are currently being tempered with, they are not permanent and may not be present next time.

>Read the target file, and name file, separate the index, dye swap, and row ID's Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) MN[,1]<-M2*ds[1]/mad(M2) for(i in 2:94) {MN[,i]<-ds[i]*MM[,i]/mad(MM[,i])}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),] write.table(MP,"ONT_07122011.csv",sep=",",col.names=NA,row.names=TRUE)

>switch to GCAT directory and start on the GCAT chips targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) RR[,1]=R1 for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

NAM<-as.data.frame(Names[,2]) Q<-merge(P,GD,match(P$row.names,GD$row.names),all.NAM=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),by.MP=row.names(MP),by.GD=row.names(MP),all.MP=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),all.MP=all)