# Difference between revisions of "Computational Journal"

(→December: filled out december 6th journal entry) |
m (→=December 6th, 2011) |
||

Line 557: | Line 557: | ||

end | end | ||

− | This compilation of code is currently in the process of being tweaked. The first difference from the original section is that outarray{i,4} should read: outarray{i,4}=A{ind2(i),2); | + | This compilation of code is currently in the process of being tweaked. The first difference from the original section is that outarray{i,4} should read: |

+ | |||

+ | outarray{i,4}=A{ind2(i),2); | ||

+ | |||

+ | The second is that the whole section of outarray{i,#}=A can be condensded into a smaller line: | ||

+ | |||

+ | outarray{i,3:7}=A{ind2(i),1:5}; | ||

+ | |||

+ | The third piece of code that I am currently tweaking with is that originally the A was a numerical matrix that would not group together with the rest of the columns. This was a problem so A is a cell array that is made from a: | ||

+ | |||

+ | A = cellstr(num2str(q(:,1:5))); | ||

+ | |||

+ | While this did create a cell array in matlab, this did not fully separate all of the columns so several iterations of this code are being used. At the moment this seems like the correct piece of code to use, however when trying to split the columns up and place them individually, an error stating incorrect dimensions is produced and no further progress can be made. | ||

[[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:34, 6 December 2011 (EST) | [[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:34, 6 December 2011 (EST) |

## Revision as of 18:35, 6 December 2011

## Contents

## Week 1

### May 20, 2011

After attempting to find a match between each .gpr file and the individual LogFC columns in the original raw data some problems occurred. A single .gpr file was chosen to run a few tests on to try and match with one of the columns. The file that was selected in this case was t15/t0 for flask 3 of the Dahlquist wildtype strain. Flask 3 was chosen for a dye swap of Red/Green to Green/Red, however, when computing the LogFC for each spot on the chip, it was discovered that the LogFC was performed using the Red intensity divided by the Green intensity. This will have to be confirmed with Dr. Dahlquist to see if this should be changed at all or if the Excel sheets obtained from May 19th, 2011, a journal for this day does not exist. However, using Red/Green for the fold change, a column was obtained that matched the LogFC originally performed in the .gpr file. The next step was to try and perform this in the statistics processing software R. The original code given was changed to focus on the median, due to the fact that R was formatted to process the mean foreground intensity over the median background intensity. Once this was done, R successfully created a list of LogFC, of which the first five were focused on to compare to the original .gpr file. Of these five, two were duplicates giving three individual genes to be focused on in the file. The first five data points given by R matched up with the first five genes given in the .gpr file, so we know that this was functioning correctly. The main problem that arose was that when the LogFC given by the .gpr file and R and the LogFC found in the raw data were compared, an inconsistency occurred. These numbers did not match, and after running a seperate test on the .gpr file for t30/t0 for the same culture, the same problem occurred.

## Week 2

### May 25, 2011

Spent the morning trying to match up the Log Fold Changes before global normalization and after global normalization from the .gpr files. Ideally the graph should have a linear trend line if global normalization had subtracted the same constant from every spot. However, some of the files output a fuzzy region around the origin, this can be seen in the separate powerpoint with the graphs. The individual red/green intensities were graphed next. The normalized red intensities was graphed against the non-normalized red intensities and the same for the green intensities. These graphs should all be a linear line as well, however some awkward points occurred. The fuzzy region from the LogFC graph is thought to be a resultant from the "error" messages that are scattered throughout the .gpr files. These are a result of the intensity ratio being negative and the Log of any negative number does not exist. The number of "error" messages in each file seemed to correlate with the fuzzy section around the origin, as the number of "error"'s went up, the fuzziness increased to an extent were it almost was not linear anymore.

### May 26, 2011

This morning was spent troubleshooting a line of code given by Dr. Fitzpatrick, M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean), meant to average the duplicate spots in the .gpr files after they were locally normalized using loess. We had to come back later fix this code using a for loop so that R would normalize all of the .gpr files instead of the one we had specified, 5 given that that was the first column with gene names when we were working on this yesterday. However, this was a .gpr file so the first four columns were just 1's, were as the MA data frame that was created did not incoporate these columns into the data frame. Once this was all figured out, I went to lunch and to check on the S. cerevisiae cultures (see lab notebook for details). Once we got back, we spent the remainder of the time trying to make sure the loess normalization worked, after talking with Dr. Fitzpatrick for a bit we switched our determined this was best done by graphing the normalized vs. nonnormalized log fold changes, after they had both been averaged using the tapply function. We spent a short amount of time trouble shooting this, mostly clerical errors when typing the source code, after which we were able to graph the relationship, which came out to be a relatively linear line, showing that loess normalization was working correctly. With this we are able to go back and do the loess normalization on the data given by the rest of strains.

### May 27, 2011

We ran into a few problems at our first attempts at running the code even though we had had successful previous attempts on the previous day. The main problem we found was for the first for loop, the MM must have the [,i] afterwards, or else the output sheet will only be one column, and not all the columns that were put into the loop. After a few more typos that were taken care of this was the outcome:

par(mfrow=c(2,1)) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) plotMA(RG,main="before within array normalization") MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"wt_M_avg.csv",sep=",") plotMA(MA,main="after within array normalization") MAB<-normalizeBetweenArrays(MA,method="scale",targets=NULL) b0<-length(MAB$M[1,]) N2<-tapply(MAB$M[,1],as.factor(MAB$genes[,5]),mean) b1<-length(M1) MMB<-matrix(nrow=b1,ncol=b0) MMB[,1]=N2 for(i in 1:9) {MMB[,i]<-tapply(MAB$M[,i],as.factor(MAB$genes[,5]),mean)} write.table(MMB,"between_array_norm.csv",sep=",")

The point of all this code is to take the .gpr files that we were given, put them into the R statistics database, create an MA plot for before normalization and then normalize them using the locally weighted scatterplot smoother (Loess), note the e in loess may stand for estimated seeing as how the we in lowess stands for weighted. After this was done the data was placed into a matrix, because R has trouble translating a MAlist to a data frame, this matrix was then averaged for every duplicate gene using the line of code discussed on May 26th. This was all written into an excel table and saved along with the accompanying picture of the MAplots. This data was then put through a between array normalization, placed into a separate matrix, boxplots were done of the before and after between array normalization, these were all saved. Some notes on the code is that all the for loops must be changed for the number of .gpr files that are put into the target text file. Work is still progressing on the GCAT chips, first of all, the flask numbering for the GCAT chip as well as flask 4 must be double checked because of some discrepancies as well as a want to not have to run all the data twelve times. However, I was able to place both targets onto R, normalize them, and merge them into a single MAlist. This can be done for the RG list before, at the moment, I am only testing this code to make sure it will accomplish what I need by testing segments and checking the attributes of the outcomes. Once the flask discrepancies can be sorted out with Dr. Dahlquist, a majority of the "New Master List" can be assembled and the GCAT can be focused on from there.

## Week 3

### June 1, 2011

How to use Microsoft Access to match tables:

- Open microsoft access
- Make sure all columns are labeled.
- Save desired files as tab delimited text files.
- File -> Get External Data -> Import your text files.
- Choose ID's to be Field Name
- Check first box for first row is field names
- ID column can be text and is ok with duplicates, will kick it back out if it doesn't like it.
- All other columns should be numbers
- Repeat for other text file.
- Choose own access key
- Create a new query and add both text files
- Click and drag from one ID field name to the other
- Right click on line and choose join properties
- Choose option three, to keep all "Ontario" data and add only new "GCAT" data, arrow should appear showing relationship.
- Select all the field names and click and drag them to Field on the Query, it will automatically assign them.
- Query -> Make-Table Query and name your table
- Click run to get output, select all and copy and paste values into an excel file, make sure format is correct

Spent today working through the GCAT chips, once this all got finished, Dr. Dahlquist showed us how to us Microsoft Access to combine the GCAT and Ontario chips, as well as how to ignore any extra genes on the GCAT chips. After this I spent the afternoon working on constructing a powerpoint with all the MA plots and boxplots as well as a process description and the code. The validation graphs still need to be graphed and placed on the spreadsheet, I will try and get this done tonight so it is ready for June 2nd.

### June 2, 2011

Spent the day going over our results with Dr. Fitzpatrick to fully understand them and make sure everything was scaled correctly. We discovered that the MA plots that were originally outputted by us through R are in fact only the first .gpr file for each strain. As we went back to fix all of those and get all of the graphs on the same page. This was solved by running a for loop for our MA plot maker:

for(i in 1:20) {plotMA(MA[,1])}

Next we decided to match up all the boxplots next to each other. This was a particularly tough time and upsetting once we made the discovery that we had to write out the whole code:

boxplot(MA$M[,1],MA$M[,2],MA$M[,3],MA$M[,4],MA$M[,5],MA$M[,6],MA$M[,7],MA$M[,8],MA$M[,9],MA$M[,10],MA$M[,11],MA$M[,12],MA$M [,13],MA$M[,14],MA$M[,15],MA$M[,16],MA$M[,17],MA$M[,18],MA$M[,19],MA$M[,20])

We made a correction to this code, changed the MA$M to an MAB$M in my case, to plot out the after within array normalization MA plots. After this we attempted to go through the "multtest" tutorial that Dr. Fitzpatrick had given us but ran into a few problems installing the packages. These problems persisted even after we thought we had solved them all, however, Dr. Dahlquist should have the mean to help us install these packages. The last piece of code that I generated was the boxplots for the before within array normalization data so we could see the centering effects. This was probably the most annoying code I have written, thankfully I only made one error which was easily corrected. The code is:

boxplot(log2((RG$R[,1]-RG$Rb[,1])/(RG$G[,1]-RG$Gb[,1])),log2((RG$R[,2]-RG$Rb[,2])/(RG$G[,2]-RG$Gb[,2])),-log2((RG$R[,3]-RG$Rb[,3])/(RG$G[,3]-RG$Gb[,3])),-log2((RG$R[,4]-RG$Rb[,4])/(RG$G[,4]-RG$Gb[,4])),log2((RG$R[,5]-RG$Rb[,5])/(RG$G[,5]-RG$Gb[,5])),log2((RG$R[,6]-RG$Rb[,6])/(RG$G[,6]-RG$Gb[,6])),-log2((RG$R[,7]-RG$Rb[,7])/(RG$G[,7]-RG$Gb[,7])),-log2((RG$R[,8]-RG$Rb[,8])/(RG$G[,8]-RG$Gb[,8])),log2((RG$R[,5]-RG$Rb[,9])/(RG$G[,9]-RG$Gb[,9])),log2((RG$R[,10]-RG$Rb[,10])/(RG$G[,10]-RG$Gb[,10])),-log2((RG$R[,11]-RG$Rb[,11])/(RG$G[,11]-RG$Gb[,11])),-log2((RG$R[,12]-RG$Rb[,12])/(RG$G[,12]-RG$Gb[,12])),log2((RG$R[,13]-RG$Rb[,13])/(RG$G[,13]-RG$Gb[,13])),log2((RG$R[,14]-RG$Rb[,14])/(RG$G[,14]-RG$Gb[,14])),-log2((RG$R[,15]-RG$Rb[,15])/(RG$G[,15]-RG$Gb[,15])),-log2((RG$R[,16]-RG$Rb[,16])/(RG$G[,16]-RG$Gb[,16])),log2((RG$R[,17]-RG$Rb[,17])/(RG$G[,17]-RG$Gb[,17])),log2((RG$R[,18]-RG$Rb[,18])/(RG$G[,18]-RG$Gb[,18])),-log2((RG$R[,19]-RG$Rb[,19])/(RG$G[,19]-RG$Gb[,19])),-log2((RG$R[,20]-RG$Rb[,20])/(RG$G[,20]-RG$Gb[,20])))

## Week 4

### June 8, 2011

Spent the morning lecturing about statistics and the Family-Wise Error Rate(FWER) and the False Discovery Rate(FDR). After a quick intro to MTP we started working through R to get a better understanding of what MTP needed to process through R. This was probably the most tedious point of the day because the output in R consisted of thousands of NaN's(not a number). This was frustrating to say the least. Once Dr. Fitzpatrick looked at what was going wrong we were able to step through the errors and output the pvalues that we wanted, both raw and adjusted, for all the data as well as an average. We used this line of code to do this:

seed<-99 group<-c(rep(0,3)) m1<-MTP(X=matrix,Y=group,test="t.onesamp",standardize=FALSE,B=100,method="?")

The method still needs to be changed for the FDR, without specifying it defaults to one we don't want. The matrix object in this line of code is just the GPR files saved as a CSV excel sheet and written into R. The test also needs to be one sample otherwise it defaults to a two sample t test. See the help section in R or type in ?MTP and look at the examples for more help. Tomorrow will be the real test to see if we can get this to work for the rest of the data points, we also need to compare the raw pvalues that are output with the calculated pvalues from our new normalized master spreadsheet to see how they correlate.

### June 9, 2011

Spent most of the day waiting for R to output the adjusted p-values that we wanted. This took a good deal of time because there are 6000 different genes to calculate the p-value for. We adjusted the number of iterations so that we could get a more accurate result, because with a B=100, the p-values that come back are only shown up to two decimal places, while a B=1000 gives you three decimal places. Once all of these were calculated by R, we placed them into a spreadsheet to compare them to the p-values we calculated by hand. After this was done, I spent some time trying to graph the results and compare the number of rejects from each type of adjustment, I managed to get a graph but I am still researching to see if there is a way to specifiy our data some more and manage to have a better graph. Another thing that I need to research is the use of the TPPFP and the gFWER adjustments. As to what I need to learn, I need to figure out where they are applied, exactly why and in which situations, and how to implement these into R. Hopefully the materials sent by Dr. Fitzpatrick and the paper I need to read for Journal Club will give me a better understanding of what is going on.

### June 10, 2011

Spent the day comparing the different MTP results using different adjustment methods. Attempting to see a correlation of the data however none has been seen yet. This may be due to the methods, however another and possibly more likely cause is that our calculations are not right. We need to make sure we are comparing the same strains, the same time points, and the same methods, otherwise there is a high possibility that no correlation will be seen between the excel calculated data and the R calculated data. After seeing little to no relationship between the data, we started compiling a spreadsheet consisting of the 15 genes/transcription factors that we are focusing on along with all their time points for every strain and the averages and standard deviations for every time point. After going through some of the data some errors were found, one we need to go back and refill our GLN3 data because we started out with only a fraction of the GPR files. We also need to compare the CIN5 values we were using to test our methods today with other CIN5 data, we managed to compare it to wild type data instead and took that as no correlation.

## Week 5

### June 13, 2011

Spent the morning going over the Matlab model that I did this previous semester. After this large review we started making the individual input sheets for Matlab from the yeast regulatory sheet we made on Friday. After a few minutes of this we discovered a mistake in GCAT's in the Master data sheet. So all the Matlab files had to be erased and re-pasted with the corrected data. While going back and fixing the GCAT data we discovered a few more errors in the number of replicates for ZAP1 as well as extra rows due to the controls still being present in the spreadsheet. Just to make sure that nothing was wrong with the rest of the data, we went back and re made the master spreadsheet by combining all of the individual files we got from R. Once all this was done we had to re make the yeast regulatory network spreadsheet, with all of the averages and standard deviations. After this we were able to make the Matlab input sheets for all of the deletion strains. We will start on the Matlab tomorrow morning.

### June 14, 2011

Spent the day reviewing the yeast regulatory network and the Matlab simulation that I focused on this past six months. After making another powerpoint with all of the expression graphs that Matlab outputs for each gene in the network, we went on to graph the relationship between the lse and the norm values at different alpha levels. The Matlab iterations took up most of the time for the workday so I did not spend a lot of time on the actual code, and the concepts were mostly a review for me.

### June 15, 2011

This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.

### June 17, 2011

Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

## Week 6

### June 20, 2011

In the morning we convened with Dr. Dahlquist to work out the details in the Matlab models that we are going through this next week. As we made the gene regulation networks in Yeastract we decided to figure out where Yeastract was getting these references for the relationships between the genes. As we went through the references given on the website, Dr. Dahlquist asked us to figure out what the different options in the gene regulation matrix menu were. The two options were to test with direct or indirect evidence, or to test for the number of binding sites on the promoters, anywhere from 1-9. The latter gave us less and less as the number of binding sites to specify increased, so we knew this wasn't right. We talked with Dr. Dahlquist about the direct and indirect options to decide on which to use, they both compare the given transcription factors and genes to a dataset, and decided that a direct set of matrices and a direct & indirect set of matrices were the best option. However, one thing we are still working is exactly where they are getting this dataset from. We have started searching the SGD, or Saccharomyces Genome Database, with any luck we will find it in here. We spent the afternoon on something different than the reference checking.

We put together more data sheets for the direct and the direct & indirect matrices. The first five matrices for the direct evidence matrix had already been put together, however, Dr. Dahlquist wanted us to check on the AFT1 synonym, RCS1, that we had been using in our list of genes and transcription factors to see if this would provide a change in the original direct matrix. This change was evident so we had to go back and recreate all of the original datasheets we had done by changing RCS1 to AFT1 in the list and reforming the matrix in Yeastract. Once these had all been corrected we spent the rest of the day creating the old 21 gene regulatory network matrix so that we could create the target excel sheets to put in Matlab. These corrections and the formatting for these sheets took the remainder of the day, we should start the simulations tomorrow in the Math PC computer lab.

## Week 8

### July 5, 2011

Spent the morning going over the steps very thoroughly with Dr. Dahlquist and Dr. Fitzpatrick, so that we could iron out all the discrepancies in our process and have the best chance of avoiding a typo error in our code. We also started compiling the individual codes and tutorials for the GCAT and Ontario chips as well as the processes to output their graphs, both boxplots and MA plots. After going through and matching up our code so that we were as sure as possible that we would get consistent data we went through R and started outputting all of the data. However, we started finding some problems as we went through the code. Mainly that Katrina and I were getting different values, but only different by values of E-10. While this is not a huge difference, one would expect the same code and inputs to output the same data. The bigger problem came when we tried the process with the GCAT chips, of which Katrina was missing the GPR files and any segments of the code so I copied my own over to her flash drive. The same differences arose when we went through this data and this started to make me feel a bit uneasy, it got worse when I tried outputting my own data to see if R was just not computing each run the same, however the output I got matched up perfectly, in value and sign, to the first run of the data. These differences got even larger after between array normalization where the value of the differences increased to E-04. We took our time going through the code to ensure that no typos were present. We also checked the dye swap csv file and the target files to see if we had them all in the correct order, which we did. Currently we cannot find out where this problem is coming from, hopefully Dr. Fitzpatrick or Dr. Dahlquist will be able to weigh in and help. The code for the GCAT is as follows, the code for the Ontario chips will be located approximately 10 or so spaces below. The process for between array normalization is only located on the GCAT code. I added some descriptions of the steps just to break up the code so it is not overwhelming to look at.

>Read the top and bottom chips into R separately and rbind them >to model the data.frame after ontario chips, 14000 spots in each column

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RG<-rbind(RT,RB)

>normalize within each array

MA<-normalizeWithinArrays(RG,method="loess", bc.method="normexp")

>tapply an average function to average duplicate spots so that only unique >spots remain in the data

M<-matrix(nrow=6403,ncol=9) for(i in 1:9) {M[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,4]),mean)}

>write the table to put through microsoft access

write.table(M,"GCAT_WA.csv",sep=",")

>read csv file back in once all logFC are in a single sheet, model after master data >then normalize between arrays

B<-read.csv(file.choose(),sep=",") BX<-as.matrix(B) BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

>write out as a table

write.table(BA,"Master_data.csv",sep=",")

- Ontario Code

>Read Target Files into R and Loess Normalize

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")

>Average Duplicate spots so that each GPR has only unique spots left

M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>Read dye swap and perform dye swap on ontario chips

ds<-read.csv(file.choose(),sep=",") MN<-matrix(nrow=6191,ncol=94) for(i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}

>Write file into excel to delete controls (first two rows)

write.table(MN,"Ont_WA_ds.csv",sep=",")

## Week 9

### July 11, 2011

Today was spent researching how to create an index in R consisting of the gene names or just a simple numerical index, so that as the data moves through R, the labels are still present after each major step. So far I have managed to add a column of names and a row of headers to a data frame. There are still some problems arising as I move through. One being that a matrix converts every cell to a character instead of keeping the distinction of a character versus a numeric. The next major step is finding a line of code to keep these indices connected to their corresponding cells in R, we have found a few examples but have nothing to post on R for these at the moment, we are still working through getting the correct format down. One thing I might want to check on later is keeping everything as a matrix, because some functions have problems with data frames, however we may have moved past this problem with our change in the scale normalization step. We have changed the line of code from:

BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

to:

for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

More research will need to be done to compile a code for the new format of target files we have, that being a csv file with the corresponding columns:FileName,Header,Strain,TimePoint,Flask,Dyeswap. This is so that everything we need is present in R so that we won't make any human errors by adding typos to our data. Here is the compiled code I have so far:

>Read the target file, and name file, separate the index, dye swap, and row ID's Targets<-read.csv("New_Target.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] Index<-Names[,1] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:15) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M1) MN<-matrix(nrow=n3,ncol=n2) MN[,1]=M2 for(i in 1:15) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),]

bn<-readTargets("matrix_names.txt") b0<-length(MP) b1<-length(MP[,1]) BH<-matrix(nrow=b1,ncol=b0,dimnames=list(bn)) for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

### July 12, 2011

Spent the morning working through the Ontario code and fixing a few bugs, we managed to get the names to stick and so we moved onto the GCAT chips. Katrina worked on getting the GCAT chips through R using the new segments of the code worked into the original. After reformatting the entire code a bit we have compiled a mostly finished segment of code however there is still one part to work out. That is getting R to perform what microsoft access did, that is get rid of all the GCAT spots we don't want, without having the actually touch the data yet. Katrina has been working on subsets for the remainder of the afternoon, while I was going through on my own to try and compile subsets I stumbled across a function that may do exactly what we want. That is the function, merge(), will combine two data frames and match up the cells based on criteria you provide. However, this is where the hard part comes in, which is figuring out how to tell R the criteria. I have been getting an error, that tells me that the memory required for the outcome cannot be allotted in R. I have tried using less gpr files for the Ontario chips, seeing as how I was using all 94 originally, as well as saving the workspace so that the data is saved under in a separate folder. However, none of these have seemed to work yet, trying to fix the memory or R to allot more memory may be my best bet at this point, this will be explored more tomorrow, for now I will post the compiled code thus far: Note: The last four lines of code are currently being tempered with, they are not permanent and may not be present next time.

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) MN[,1]<-M2*ds[1]/mad(M2) for(i in 2:94) {MN[,i]<-ds[i]*MM[,i]/mad(MM[,i])}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),] write.table(MP,"ONT_07122011.csv",sep=",",col.names=NA,row.names=TRUE)

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) RR[,1]=R1 for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

NAM<-as.data.frame(Names[,2]) Q<-merge(P,GD,match(P$row.names,GD$row.names),all.NAM=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),by.MP=row.names(MP),by.GD=row.names(MP),all.MP=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),all.MP=all)

### July 13, 2011

Continued on with the merge() code this morning to figure out the secondary commands and try and get it to work. After browsing through a few hundred emails, forums, and instruction manuals, I managed to find an example that showed how to work out the parts I was still having trouble with. My next problem came trying to get only the Ontario IDs on both the Ontario and GCAT chips. This is done by adding the all.x=TRUE line to the inside of merge(). However, R seemed to just ignore the command and I could not figure out why. After trying a few examples to try and make it work, I moved on to creating a subset of the data frame that consisted of only these IDs. Katrina had a doctors appointment so it took us a while to do a comparison because I had to wait for her to come back. After a simple typo that lead us to thinking our outputs were different in only the GCAT chips, we managed to confirm both of our methods were doing the same thing. As to which one we will us I do not know at the moment. The last hour of the afternoon was spent fixing up our presentation for journal club tomorrow. As usual I will post my compiled code at the end, this code is correct and as long as the input files are consistent then everything should run smoothly. The only break that should occur is done by the programmer when they need to switch directories. This can be managed by paying attention to the headers posted above the code, they should list when the change occurs. Here is the code:Note: I originally had a hard delete to get rid of the ONT controls, I will be working in a subsetting procedure to do this instead.

>Read the ONT target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,Ontario_ID!="Arabidopsis") ont2<-subset(ont1,Ontario_ID!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Deleting unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)} write.table(XX,"Master_07132011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

### July 14, 2011

Still working out all the kinks in our code so that when the final data sheet is outputted it will be the master data sheet, that is, without all of the averages, tstats and pvals. Everything has been solved up to keeping the column and row names through out R. The last step we are on is ordering the flasks in the correct order. So far we have developed a hard sort, where we tell R where the columns are going to go. In theory this is fine, however, we are trying to develop a way for R to do this on its own so that we can minimize human error. This is for in the future when more chips are added so that the hard sort does not have to be corrected or added to every time a new deletion strain is added to the data set. This is not as big a problem as previously thought because only the GCAT chips are out of order so only they would have to be changed, or even ignored if the GCAT chips are 1-9 and the Ontario chips are 10-..., this way the strains would only have to be added and the final number would have to be adjusted for the total number of chips being loaded into R. There is one package I found, gregmisc, that contains a function matchcols() and is supposed to match up the columns of one data frame based on the criteria of a vector. I am trying to work out the problems with this code to see if it is even a viable option. The other problem with this code is that it was developed using 2.13.1 meaning that some previous function being called could have changed and our data is no longer consistent. To determine this I need to compare the within array data between versions 2.7.2 and 2.13.1, the between array step has already been removed so this is no longer a problem. However I have a financial problem so this must be taken care of tomorrow. As usual, here is my compiled code, note: the last few lines with no description above them are still being worked with:

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XY<-XX[,c(1,100,2,3,4,96,101,5,6,7,97,102,8,9,98,103,10,11,12,99,104,13,14,15,16:95)]

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.vector(XZ) GW<-matchcols(colnames(XX),with=c(as.character(XZ)),method="and") C<-grep(XZ[,1],colnames(XX),ignore.case=FALSE,fixed=TRUE)

F<-sort(XX,by=colnames(XZ))

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

>optional within array data

write.table(Z,"within_array_07142011.csv",sep=",")

### July 15, 2011

This morning consisted of finishing up the ordering of the chips once they are in R. After going through my code separately to make sure everything worked we compared our outputs. To our relief, they all turned out to match perfectly, which means even though Katrina and I have different codes we still both get the same output. The afternoon was spent putting the code on openwetware, this is located under the Dahlquist Lab protocols page in the LMU section. A description of each step was added so now there are rougly as many comments as there are lines of code. Once this was done we went through and did the Pvals individually. We also did the Bonferroni correction to the p-values, which is done by multiplying each p-value by the total number of spots. We also did the Benjamini Hochberg correction, which is similar in that you multiply each spot by the total number of spots, but you then divide by the location of the spot in the index, i.e. the first spot is Pval x 6189/1, the second is Pval x 6189/2, and so on. After this was all done we had two options, to do the F stats and find the p-values with them, or to start working on the multcomp package. While Katrina did the F stats for her data, I worked on learning about two pieces of code, on located in the base package and one located in the multcomp package. The first was lm(), which fits the data to a linear model that you set, however to apply this to our data I will need to know what the model is. The second is glht() which is the general linear hypotheses, this is a much more detailed process and I am still working on getting through the pieces inside the parentheses. The other part I am working on this is the linfct() which is located inside the parentheses in glht(), these lines of code just require some time spent with them to figure out what they do and how to adjust them correctly for our data set. As usual I will post the compiled code, for now this seems to be the final version, and changes will be commented on later. Here is the code:

>Read the target file, and name file, separate the index, dye swap, and row ID's

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.data.frame.matrix(XX) XY<-XV[,match(XZ[,1],colnames(XV))]

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

## October

### October 4th, 2011

To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution. The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH. The null hypothesis being that no genes experience any significant change in expression, therefore the population mean(μ) is 0. To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120. SSH=Σ_{i}Σ_{j}(Y_{ij})^{2}. The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene. This is represented by subtracting the population mean from the log fold change before squaring it. The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point. These values are then squared, and summed over every flask i, and time point j. SSF=Σ_{i}Σ_{j}(Y_{ij}-μ_{j})^{2}. Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks. This will give you the F-distribution with degrees of freedom F, N-F. [(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F). This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F. Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains. While the Benjamini Hochberg correction is our main focus due to its robustness when compared to the Bonferroni, both the Benjamini and Hochberg and Bonferroni were calculated to compare the results of signifcant change in gene expression. The Bonferroni was calculated by multiplying all of the p-values by the total number of hypotheses, in this case the 6189 genes being expressed. The Benjamini and Hochberg was calculated by first sorting all of the p-values from least to greatest, and then multiplying by the total number of hypotheses and divided by its position it was sorted in.

### October 25th, 2011

The stem clusters were obtained using the STEM package and instructions located under Assignments: Week 12. Wildtype, dGLN3, and dCIN5 were the three strains that were examined, dHMO1 had an absence of genes in this case and the uncorrected values are to be used in obtaining this profile. During the lecture, Dr. Fitzpatrick asked if I had gone over the Hill plots yet, and in turn, I thought about it and suggested using these plots to somehow estimate the cooperativity, or weight, between each transcription factor. This will have to be elaborated on more in the future. The master data was also formatted such that an extra sheet was added at the end to summarize all of the values we were looking for. This order that the sheet was formatted in was this: first, the index; second, the Gene Symbols; third, the normalized Log fold changes obtained through R; fourth, the average Log fold changes; fifth, the F-distribution, the calculation of which is described above on October 4th as well as the Dahlquist Dry Labs protocol; sixth, the p-values, seventh, the Bonferroni corrections; and finally eighth, the Benjamini & Hochberg corrections. And in turn, each of these columns was grouped by their strain, such that these eight groupings exist for the wildtype, then dCIN5, then dGLN3, and so on. While the all of the columns up to the Bonferroni corrections were sorted in R, the Benjamini & Hochberg corrections were sorted in excel. This poses a problem seeing how excel and R rank symbols, such as a hyphen inside of the Gene Symbol. This can be fixed by a simple step, once all of the data, not including the B&H corrections, is in the new sheet with the Gene Symbol column, sort said column from A→Z. This will allow the hypenated terms to be sorted in excel, and allow the terms associated with these symbols to be sorted along with them. Once everything is sorted properly, the B&H corrections can be added to the excel sheet.

## December

## =December 6th, 2011

Most of the day was spent using a MATLAB tutorial on the website <http://www.math.ufl.edu/help/matlab-tutorial/matlab-tutorial.html#SEC5>. Using this tutorial, a previously prepared matlab worksheet was examined to better this understanding. This script was to be used in creating a network of significant genes, as well as a network using the transcription factors. The first step in this process, however, is troubleshooting the code. Suffice to say, this took a majority of the time due to MATLABs "user-friendly" coding and layout. At the moment a matrix is trying to be created using several lines of code:

for i = 1:size(ind2) outarray{i,1}=b{1+ind2(i),7}; outarray{i,2}=IDX(i); outarray{i,3}=A{ind2(i),1}; outarray{i,4}=B{ind2(i),2}; outarray{i,5}=A{ind2(i),3}; outarray{i,6}=A{ind2(i),4}; outarray{i,7}=A{ind2(i),5}; outarray{i,8:12}=C{IDX(i),1:5}; end

This compilation of code is currently in the process of being tweaked. The first difference from the original section is that outarray{i,4} should read:

outarray{i,4}=A{ind2(i),2);

The second is that the whole section of outarray{i,#}=A can be condensded into a smaller line:

outarray{i,3:7}=A{ind2(i),1:5};

The third piece of code that I am currently tweaking with is that originally the A was a numerical matrix that would not group together with the rest of the columns. This was a problem so A is a cell array that is made from a:

A = cellstr(num2str(q(:,1:5)));

While this did create a cell array in matlab, this did not fully separate all of the columns so several iterations of this code are being used. At the moment this seems like the correct piece of code to use, however when trying to split the columns up and place them individually, an error stating incorrect dimensions is produced and no further progress can be made.

Nicholas A. Rohacz 20:34, 6 December 2011 (EST)