# Difference between revisions of "Computational Journal"

(→Week 9: added addition to july 10 and the july 11 journal entry) |
m (→addition) |
||

Line 789: | Line 789: | ||

[[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:06, 10 July 2012 (EDT) | [[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:06, 10 July 2012 (EDT) | ||

− | ==== | + | ====Addition==== |

The results still did not turn out too well, we are changing some of the parameters in hopes that we can take better step sizes and get a best estimate. | The results still did not turn out too well, we are changing some of the parameters in hopes that we can take better step sizes and get a best estimate. | ||

[[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:33, 11 July 2012 (EDT) | [[User:Nicholas A. Rohacz|Nicholas A. Rohacz]] 20:33, 11 July 2012 (EDT) | ||

− | |||

===July 11, 2012=== | ===July 11, 2012=== |

## Revision as of 17:33, 11 July 2012

## Contents

# Summer 2011

## Week 1

### May 20, 2011

After attempting to find a match between each .gpr file and the individual LogFC columns in the original raw data some problems occurred. A single .gpr file was chosen to run a few tests on to try and match with one of the columns. The file that was selected in this case was t15/t0 for flask 3 of the Dahlquist wildtype strain. Flask 3 was chosen for a dye swap of Red/Green to Green/Red, however, when computing the LogFC for each spot on the chip, it was discovered that the LogFC was performed using the Red intensity divided by the Green intensity. This will have to be confirmed with Dr. Dahlquist to see if this should be changed at all or if the Excel sheets obtained from May 19th, 2011, a journal for this day does not exist. However, using Red/Green for the fold change, a column was obtained that matched the LogFC originally performed in the .gpr file. The next step was to try and perform this in the statistics processing software R. The original code given was changed to focus on the median, due to the fact that R was formatted to process the mean foreground intensity over the median background intensity. Once this was done, R successfully created a list of LogFC, of which the first five were focused on to compare to the original .gpr file. Of these five, two were duplicates giving three individual genes to be focused on in the file. The first five data points given by R matched up with the first five genes given in the .gpr file, so we know that this was functioning correctly. The main problem that arose was that when the LogFC given by the .gpr file and R and the LogFC found in the raw data were compared, an inconsistency occurred. These numbers did not match, and after running a seperate test on the .gpr file for t30/t0 for the same culture, the same problem occurred.

## Week 2

### May 25, 2011

Spent the morning trying to match up the Log Fold Changes before global normalization and after global normalization from the .gpr files. Ideally the graph should have a linear trend line if global normalization had subtracted the same constant from every spot. However, some of the files output a fuzzy region around the origin, this can be seen in the separate powerpoint with the graphs. The individual red/green intensities were graphed next. The normalized red intensities was graphed against the non-normalized red intensities and the same for the green intensities. These graphs should all be a linear line as well, however some awkward points occurred. The fuzzy region from the LogFC graph is thought to be a resultant from the "error" messages that are scattered throughout the .gpr files. These are a result of the intensity ratio being negative and the Log of any negative number does not exist. The number of "error" messages in each file seemed to correlate with the fuzzy section around the origin, as the number of "error"'s went up, the fuzziness increased to an extent were it almost was not linear anymore.

### May 26, 2011

This morning was spent troubleshooting a line of code given by Dr. Fitzpatrick, M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean), meant to average the duplicate spots in the .gpr files after they were locally normalized using loess. We had to come back later fix this code using a for loop so that R would normalize all of the .gpr files instead of the one we had specified, 5 given that that was the first column with gene names when we were working on this yesterday. However, this was a .gpr file so the first four columns were just 1's, were as the MA data frame that was created did not incoporate these columns into the data frame. Once this was all figured out, I went to lunch and to check on the S. cerevisiae cultures (see lab notebook for details). Once we got back, we spent the remainder of the time trying to make sure the loess normalization worked, after talking with Dr. Fitzpatrick for a bit we switched our determined this was best done by graphing the normalized vs. nonnormalized log fold changes, after they had both been averaged using the tapply function. We spent a short amount of time trouble shooting this, mostly clerical errors when typing the source code, after which we were able to graph the relationship, which came out to be a relatively linear line, showing that loess normalization was working correctly. With this we are able to go back and do the loess normalization on the data given by the rest of strains.

### May 27, 2011

We ran into a few problems at our first attempts at running the code even though we had had successful previous attempts on the previous day. The main problem we found was for the first for loop, the MM must have the [,i] afterwards, or else the output sheet will only be one column, and not all the columns that were put into the loop. After a few more typos that were taken care of this was the outcome:

par(mfrow=c(2,1)) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) plotMA(RG,main="before within array normalization") MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:20) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)} write.table(MM,"wt_M_avg.csv",sep=",") plotMA(MA,main="after within array normalization") MAB<-normalizeBetweenArrays(MA,method="scale",targets=NULL) b0<-length(MAB$M[1,]) N2<-tapply(MAB$M[,1],as.factor(MAB$genes[,5]),mean) b1<-length(M1) MMB<-matrix(nrow=b1,ncol=b0) MMB[,1]=N2 for(i in 1:9) {MMB[,i]<-tapply(MAB$M[,i],as.factor(MAB$genes[,5]),mean)} write.table(MMB,"between_array_norm.csv",sep=",")

The point of all this code is to take the .gpr files that we were given, put them into the R statistics database, create an MA plot for before normalization and then normalize them using the locally weighted scatterplot smoother (Loess), note the e in loess may stand for estimated seeing as how the we in lowess stands for weighted. After this was done the data was placed into a matrix, because R has trouble translating a MAlist to a data frame, this matrix was then averaged for every duplicate gene using the line of code discussed on May 26th. This was all written into an excel table and saved along with the accompanying picture of the MAplots. This data was then put through a between array normalization, placed into a separate matrix, boxplots were done of the before and after between array normalization, these were all saved. Some notes on the code is that all the for loops must be changed for the number of .gpr files that are put into the target text file. Work is still progressing on the GCAT chips, first of all, the flask numbering for the GCAT chip as well as flask 4 must be double checked because of some discrepancies as well as a want to not have to run all the data twelve times. However, I was able to place both targets onto R, normalize them, and merge them into a single MAlist. This can be done for the RG list before, at the moment, I am only testing this code to make sure it will accomplish what I need by testing segments and checking the attributes of the outcomes. Once the flask discrepancies can be sorted out with Dr. Dahlquist, a majority of the "New Master List" can be assembled and the GCAT can be focused on from there.

## Week 3

### June 1, 2011

How to use Microsoft Access to match tables:

- Open microsoft access
- Make sure all columns are labeled.
- Save desired files as tab delimited text files.
- File -> Get External Data -> Import your text files.
- Choose ID's to be Field Name
- Check first box for first row is field names
- ID column can be text and is ok with duplicates, will kick it back out if it doesn't like it.
- All other columns should be numbers
- Repeat for other text file.
- Choose own access key
- Create a new query and add both text files
- Click and drag from one ID field name to the other
- Right click on line and choose join properties
- Choose option three, to keep all "Ontario" data and add only new "GCAT" data, arrow should appear showing relationship.
- Select all the field names and click and drag them to Field on the Query, it will automatically assign them.
- Query -> Make-Table Query and name your table
- Click run to get output, select all and copy and paste values into an excel file, make sure format is correct

Spent today working through the GCAT chips, once this all got finished, Dr. Dahlquist showed us how to us Microsoft Access to combine the GCAT and Ontario chips, as well as how to ignore any extra genes on the GCAT chips. After this I spent the afternoon working on constructing a powerpoint with all the MA plots and boxplots as well as a process description and the code. The validation graphs still need to be graphed and placed on the spreadsheet, I will try and get this done tonight so it is ready for June 2nd.

### June 2, 2011

Spent the day going over our results with Dr. Fitzpatrick to fully understand them and make sure everything was scaled correctly. We discovered that the MA plots that were originally outputted by us through R are in fact only the first .gpr file for each strain. As we went back to fix all of those and get all of the graphs on the same page. This was solved by running a for loop for our MA plot maker:

for(i in 1:20) {plotMA(MA[,1])}

Next we decided to match up all the boxplots next to each other. This was a particularly tough time and upsetting once we made the discovery that we had to write out the whole code:

boxplot(MA$M[,1],MA$M[,2],MA$M[,3],MA$M[,4],MA$M[,5],MA$M[,6],MA$M[,7],MA$M[,8],MA$M[,9],MA$M[,10],MA$M[,11],MA$M[,12],MA$M [,13],MA$M[,14],MA$M[,15],MA$M[,16],MA$M[,17],MA$M[,18],MA$M[,19],MA$M[,20])

We made a correction to this code, changed the MA$M to an MAB$M in my case, to plot out the after within array normalization MA plots. After this we attempted to go through the "multtest" tutorial that Dr. Fitzpatrick had given us but ran into a few problems installing the packages. These problems persisted even after we thought we had solved them all, however, Dr. Dahlquist should have the mean to help us install these packages. The last piece of code that I generated was the boxplots for the before within array normalization data so we could see the centering effects. This was probably the most annoying code I have written, thankfully I only made one error which was easily corrected. The code is:

boxplot(log2((RG$R[,1]-RG$Rb[,1])/(RG$G[,1]-RG$Gb[,1])),log2((RG$R[,2]-RG$Rb[,2])/(RG$G[,2]-RG$Gb[,2])),-log2((RG$R[,3]-RG$Rb[,3])/(RG$G[,3]-RG$Gb[,3])),-log2((RG$R[,4]-RG$Rb[,4])/(RG$G[,4]-RG$Gb[,4])),log2((RG$R[,5]-RG$Rb[,5])/(RG$G[,5]-RG$Gb[,5])),log2((RG$R[,6]-RG$Rb[,6])/(RG$G[,6]-RG$Gb[,6])),-log2((RG$R[,7]-RG$Rb[,7])/(RG$G[,7]-RG$Gb[,7])),-log2((RG$R[,8]-RG$Rb[,8])/(RG$G[,8]-RG$Gb[,8])),log2((RG$R[,5]-RG$Rb[,9])/(RG$G[,9]-RG$Gb[,9])),log2((RG$R[,10]-RG$Rb[,10])/(RG$G[,10]-RG$Gb[,10])),-log2((RG$R[,11]-RG$Rb[,11])/(RG$G[,11]-RG$Gb[,11])),-log2((RG$R[,12]-RG$Rb[,12])/(RG$G[,12]-RG$Gb[,12])),log2((RG$R[,13]-RG$Rb[,13])/(RG$G[,13]-RG$Gb[,13])),log2((RG$R[,14]-RG$Rb[,14])/(RG$G[,14]-RG$Gb[,14])),-log2((RG$R[,15]-RG$Rb[,15])/(RG$G[,15]-RG$Gb[,15])),-log2((RG$R[,16]-RG$Rb[,16])/(RG$G[,16]-RG$Gb[,16])),log2((RG$R[,17]-RG$Rb[,17])/(RG$G[,17]-RG$Gb[,17])),log2((RG$R[,18]-RG$Rb[,18])/(RG$G[,18]-RG$Gb[,18])),-log2((RG$R[,19]-RG$Rb[,19])/(RG$G[,19]-RG$Gb[,19])),-log2((RG$R[,20]-RG$Rb[,20])/(RG$G[,20]-RG$Gb[,20])))

## Week 4

### June 8, 2011

Spent the morning lecturing about statistics and the Family-Wise Error Rate(FWER) and the False Discovery Rate(FDR). After a quick intro to MTP we started working through R to get a better understanding of what MTP needed to process through R. This was probably the most tedious point of the day because the output in R consisted of thousands of NaN's(not a number). This was frustrating to say the least. Once Dr. Fitzpatrick looked at what was going wrong we were able to step through the errors and output the pvalues that we wanted, both raw and adjusted, for all the data as well as an average. We used this line of code to do this:

seed<-99 group<-c(rep(0,3)) m1<-MTP(X=matrix,Y=group,test="t.onesamp",standardize=FALSE,B=100,method="?")

The method still needs to be changed for the FDR, without specifying it defaults to one we don't want. The matrix object in this line of code is just the GPR files saved as a CSV excel sheet and written into R. The test also needs to be one sample otherwise it defaults to a two sample t test. See the help section in R or type in ?MTP and look at the examples for more help. Tomorrow will be the real test to see if we can get this to work for the rest of the data points, we also need to compare the raw pvalues that are output with the calculated pvalues from our new normalized master spreadsheet to see how they correlate.

### June 9, 2011

Spent most of the day waiting for R to output the adjusted p-values that we wanted. This took a good deal of time because there are 6000 different genes to calculate the p-value for. We adjusted the number of iterations so that we could get a more accurate result, because with a B=100, the p-values that come back are only shown up to two decimal places, while a B=1000 gives you three decimal places. Once all of these were calculated by R, we placed them into a spreadsheet to compare them to the p-values we calculated by hand. After this was done, I spent some time trying to graph the results and compare the number of rejects from each type of adjustment, I managed to get a graph but I am still researching to see if there is a way to specifiy our data some more and manage to have a better graph. Another thing that I need to research is the use of the TPPFP and the gFWER adjustments. As to what I need to learn, I need to figure out where they are applied, exactly why and in which situations, and how to implement these into R. Hopefully the materials sent by Dr. Fitzpatrick and the paper I need to read for Journal Club will give me a better understanding of what is going on.

### June 10, 2011

Spent the day comparing the different MTP results using different adjustment methods. Attempting to see a correlation of the data however none has been seen yet. This may be due to the methods, however another and possibly more likely cause is that our calculations are not right. We need to make sure we are comparing the same strains, the same time points, and the same methods, otherwise there is a high possibility that no correlation will be seen between the excel calculated data and the R calculated data. After seeing little to no relationship between the data, we started compiling a spreadsheet consisting of the 15 genes/transcription factors that we are focusing on along with all their time points for every strain and the averages and standard deviations for every time point. After going through some of the data some errors were found, one we need to go back and refill our GLN3 data because we started out with only a fraction of the GPR files. We also need to compare the CIN5 values we were using to test our methods today with other CIN5 data, we managed to compare it to wild type data instead and took that as no correlation.

## Week 5

### June 13, 2011

Spent the morning going over the Matlab model that I did this previous semester. After this large review we started making the individual input sheets for Matlab from the yeast regulatory sheet we made on Friday. After a few minutes of this we discovered a mistake in GCAT's in the Master data sheet. So all the Matlab files had to be erased and re-pasted with the corrected data. While going back and fixing the GCAT data we discovered a few more errors in the number of replicates for ZAP1 as well as extra rows due to the controls still being present in the spreadsheet. Just to make sure that nothing was wrong with the rest of the data, we went back and re made the master spreadsheet by combining all of the individual files we got from R. Once all this was done we had to re make the yeast regulatory network spreadsheet, with all of the averages and standard deviations. After this we were able to make the Matlab input sheets for all of the deletion strains. We will start on the Matlab tomorrow morning.

### June 14, 2011

Spent the day reviewing the yeast regulatory network and the Matlab simulation that I focused on this past six months. After making another powerpoint with all of the expression graphs that Matlab outputs for each gene in the network, we went on to graph the relationship between the lse and the norm values at different alpha levels. The Matlab iterations took up most of the time for the workday so I did not spend a lot of time on the actual code, and the concepts were mostly a review for me.

### June 15, 2011

This morning consisted of running more Matlab simulations to try and find the switching point on the lse vs norm graph. The wild type graph produced a smooth curve, however, the dGLN3 graph had a awkward jump at the alpha value of 0.002. We chose this point to focus on for more statistical analysis. The afternoon was spent finding the relative error between each alpha value and 0.002, this was done for the optimized production rates, optimized weights, and the optimized thresholds. Having to do the relative error for every alpha value in all three categories caused this step to take the rest of the day.

### June 17, 2011

Spent the morning comparing the estimated production rates, network weights and network thresholds with the seperate p-values that we simulated the other day. After comparing the graphs and seeing that an alpha value of 0.002 was a good choice and that the trends in the graphs are similar, we decided to re-build the regulation networks and run the simulations with them. The afternoon consisted of building the 10 different excel files, one for each strain and 1 for both the old regulation matrix and the new one with the genes in the strain names added in to see the effect. The first simulation took the rest of the day and the rest will be done over the coming week.

## Week 6

### June 20, 2011

In the morning we convened with Dr. Dahlquist to work out the details in the Matlab models that we are going through this next week. As we made the gene regulation networks in Yeastract we decided to figure out where Yeastract was getting these references for the relationships between the genes. As we went through the references given on the website, Dr. Dahlquist asked us to figure out what the different options in the gene regulation matrix menu were. The two options were to test with direct or indirect evidence, or to test for the number of binding sites on the promoters, anywhere from 1-9. The latter gave us less and less as the number of binding sites to specify increased, so we knew this wasn't right. We talked with Dr. Dahlquist about the direct and indirect options to decide on which to use, they both compare the given transcription factors and genes to a dataset, and decided that a direct set of matrices and a direct & indirect set of matrices were the best option. However, one thing we are still working is exactly where they are getting this dataset from. We have started searching the SGD, or Saccharomyces Genome Database, with any luck we will find it in here. We spent the afternoon on something different than the reference checking.

We put together more data sheets for the direct and the direct & indirect matrices. The first five matrices for the direct evidence matrix had already been put together, however, Dr. Dahlquist wanted us to check on the AFT1 synonym, RCS1, that we had been using in our list of genes and transcription factors to see if this would provide a change in the original direct matrix. This change was evident so we had to go back and recreate all of the original datasheets we had done by changing RCS1 to AFT1 in the list and reforming the matrix in Yeastract. Once these had all been corrected we spent the rest of the day creating the old 21 gene regulatory network matrix so that we could create the target excel sheets to put in Matlab. These corrections and the formatting for these sheets took the remainder of the day, we should start the simulations tomorrow in the Math PC computer lab.

## Week 8

### July 5, 2011

Spent the morning going over the steps very thoroughly with Dr. Dahlquist and Dr. Fitzpatrick, so that we could iron out all the discrepancies in our process and have the best chance of avoiding a typo error in our code. We also started compiling the individual codes and tutorials for the GCAT and Ontario chips as well as the processes to output their graphs, both boxplots and MA plots. After going through and matching up our code so that we were as sure as possible that we would get consistent data we went through R and started outputting all of the data. However, we started finding some problems as we went through the code. Mainly that Katrina and I were getting different values, but only different by values of E-10. While this is not a huge difference, one would expect the same code and inputs to output the same data. The bigger problem came when we tried the process with the GCAT chips, of which Katrina was missing the GPR files and any segments of the code so I copied my own over to her flash drive. The same differences arose when we went through this data and this started to make me feel a bit uneasy, it got worse when I tried outputting my own data to see if R was just not computing each run the same, however the output I got matched up perfectly, in value and sign, to the first run of the data. These differences got even larger after between array normalization where the value of the differences increased to E-04. We took our time going through the code to ensure that no typos were present. We also checked the dye swap csv file and the target files to see if we had them all in the correct order, which we did. Currently we cannot find out where this problem is coming from, hopefully Dr. Fitzpatrick or Dr. Dahlquist will be able to weigh in and help. The code for the GCAT is as follows, the code for the Ontario chips will be located approximately 10 or so spaces below. The process for between array normalization is only located on the GCAT code. I added some descriptions of the steps just to break up the code so it is not overwhelming to look at.

>Read the top and bottom chips into R separately and rbind them >to model the data.frame after ontario chips, 14000 spots in each column

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RG<-rbind(RT,RB)

>normalize within each array

MA<-normalizeWithinArrays(RG,method="loess", bc.method="normexp")

>tapply an average function to average duplicate spots so that only unique >spots remain in the data

M<-matrix(nrow=6403,ncol=9) for(i in 1:9) {M[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,4]),mean)}

>write the table to put through microsoft access

write.table(M,"GCAT_WA.csv",sep=",")

>read csv file back in once all logFC are in a single sheet, model after master data >then normalize between arrays

B<-read.csv(file.choose(),sep=",") BX<-as.matrix(B) BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

>write out as a table

write.table(BA,"Master_data.csv",sep=",")

- Ontario Code

>Read Target Files into R and Loess Normalize

targets<-readTargets(file.choose()) f<-function(x) as.numeric(x$Flags > -99) RG<-read.maimages(targets, source="genepix.median", wt.fun=f) MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp")

>Average Duplicate spots so that each GPR has only unique spots left

M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>Read dye swap and perform dye swap on ontario chips

ds<-read.csv(file.choose(),sep=",") MN<-matrix(nrow=6191,ncol=94) for(i in 1:94) {MN[,i]<-ds[i,]*MM[,i]}

>Write file into excel to delete controls (first two rows)

write.table(MN,"Ont_WA_ds.csv",sep=",")

## Week 9

### July 11, 2011

Today was spent researching how to create an index in R consisting of the gene names or just a simple numerical index, so that as the data moves through R, the labels are still present after each major step. So far I have managed to add a column of names and a row of headers to a data frame. There are still some problems arising as I move through. One being that a matrix converts every cell to a character instead of keeping the distinction of a character versus a numeric. The next major step is finding a line of code to keep these indices connected to their corresponding cells in R, we have found a few examples but have nothing to post on R for these at the moment, we are still working through getting the correct format down. One thing I might want to check on later is keeping everything as a matrix, because some functions have problems with data frames, however we may have moved past this problem with our change in the scale normalization step. We have changed the line of code from:

BA<-normalizeBetweenArrays(BX,method="scale",targets=NULL)

to:

for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

More research will need to be done to compile a code for the new format of target files we have, that being a csv file with the corresponding columns:FileName,Header,Strain,TimePoint,Flask,Dyeswap. This is so that everything we need is present in R so that we won't make any human errors by adding typos to our data. Here is the compiled code I have so far:

>Read the target file, and name file, separate the index, dye swap, and row ID's Targets<-read.csv("New_Target.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] Index<-Names[,1] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:15) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M1) MN<-matrix(nrow=n3,ncol=n2) MN[,1]=M2 for(i in 1:15) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),]

bn<-readTargets("matrix_names.txt") b0<-length(MP) b1<-length(MP[,1]) BH<-matrix(nrow=b1,ncol=b0,dimnames=list(bn)) for(i in 1:15) {BH[,i]<-MP[,i]/mad(MP[,i])}

### July 12, 2011

Spent the morning working through the Ontario code and fixing a few bugs, we managed to get the names to stick and so we moved onto the GCAT chips. Katrina worked on getting the GCAT chips through R using the new segments of the code worked into the original. After reformatting the entire code a bit we have compiled a mostly finished segment of code however there is still one part to work out. That is getting R to perform what microsoft access did, that is get rid of all the GCAT spots we don't want, without having the actually touch the data yet. Katrina has been working on subsets for the remainder of the afternoon, while I was going through on my own to try and compile subsets I stumbled across a function that may do exactly what we want. That is the function, merge(), will combine two data frames and match up the cells based on criteria you provide. However, this is where the hard part comes in, which is figuring out how to tell R the criteria. I have been getting an error, that tells me that the memory required for the outcome cannot be allotted in R. I have tried using less gpr files for the Ontario chips, seeing as how I was using all 94 originally, as well as saving the workspace so that the data is saved under in a separate folder. However, none of these have seemed to work yet, trying to fix the memory or R to allot more memory may be my best bet at this point, this will be explored more tomorrow, for now I will post the compiled code thus far: Note: The last four lines of code are currently being tempered with, they are not permanent and may not be present next time.

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) MM[,1]=M1 for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) MN[,1]<-M2*ds[1]/mad(M2) for(i in 2:94) {MN[,i]<-ds[i]*MM[,i]/mad(MM[,i])}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row MP<-MO[-c(1,2),] write.table(MP,"ONT_07122011.csv",sep=",",col.names=NA,row.names=TRUE)

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) RR[,1]=R1 for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

NAM<-as.data.frame(Names[,2]) Q<-merge(P,GD,match(P$row.names,GD$row.names),all.NAM=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),by.MP=row.names(MP),by.GD=row.names(MP),all.MP=all)

Z<-merge(MP,GD,match(MP$row.names,GD$row.names),all.MP=all)

### July 13, 2011

Continued on with the merge() code this morning to figure out the secondary commands and try and get it to work. After browsing through a few hundred emails, forums, and instruction manuals, I managed to find an example that showed how to work out the parts I was still having trouble with. My next problem came trying to get only the Ontario IDs on both the Ontario and GCAT chips. This is done by adding the all.x=TRUE line to the inside of merge(). However, R seemed to just ignore the command and I could not figure out why. After trying a few examples to try and make it work, I moved on to creating a subset of the data frame that consisted of only these IDs. Katrina had a doctors appointment so it took us a while to do a comparison because I had to wait for her to come back. After a simple typo that lead us to thinking our outputs were different in only the GCAT chips, we managed to confirm both of our methods were doing the same thing. As to which one we will us I do not know at the moment. The last hour of the afternoon was spent fixing up our presentation for journal club tomorrow. As usual I will post my compiled code at the end, this code is correct and as long as the input files are consistent then everything should run smoothly. The only break that should occur is done by the programmer when they need to switch directories. This can be managed by paying attention to the headers posted above the code, they should list when the change occurs. Here is the code:Note: I originally had a hard delete to get rid of the ONT controls, I will be working in a subsetting procedure to do this instead.

>Read the ONT target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap and scale normalization in the same step

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,Ontario_ID!="Arabidopsis") ont2<-subset(ont1,Ontario_ID!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-readTargets("Top.txt") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets, source="genepix.median", wt.fun=f) targets<-readTargets("Bottom.txt") f<-function(x) as.numeric(x$Flags > -99) RB<-read.maimages(targets, source="genepix.median", wt.fun=f) RGG<-rbind(RT,RB)

>add headers and row names for GCAT

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[,2] Grow<-GNames[,2]

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Deleting unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)} write.table(XX,"Master_07132011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

### July 14, 2011

Still working out all the kinks in our code so that when the final data sheet is outputted it will be the master data sheet, that is, without all of the averages, tstats and pvals. Everything has been solved up to keeping the column and row names through out R. The last step we are on is ordering the flasks in the correct order. So far we have developed a hard sort, where we tell R where the columns are going to go. In theory this is fine, however, we are trying to develop a way for R to do this on its own so that we can minimize human error. This is for in the future when more chips are added so that the hard sort does not have to be corrected or added to every time a new deletion strain is added to the data set. This is not as big a problem as previously thought because only the GCAT chips are out of order so only they would have to be changed, or even ignored if the GCAT chips are 1-9 and the Ontario chips are 10-..., this way the strains would only have to be added and the final number would have to be adjusted for the total number of chips being loaded into R. There is one package I found, gregmisc, that contains a function matchcols() and is supposed to match up the columns of one data frame based on the criteria of a vector. I am trying to work out the problems with this code to see if it is even a viable option. The other problem with this code is that it was developed using 2.13.1 meaning that some previous function being called could have changed and our data is no longer consistent. To determine this I need to compare the within array data between versions 2.7.2 and 2.13.1, the between array step has already been removed so this is no longer a problem. However I have a financial problem so this must be taken care of tomorrow. As usual, here is my compiled code, note: the last few lines with no description above them are still being worked with:

>Read the target file, and name file, separate the index, dye swap, and row ID's

Targets<-read.csv("Targets.csv",sep=",") Names<-read.csv("ONT_Index_ID.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) ds<-Targets[,6] row<-Names[,2] col<-Targets[,2] RG<-read.maimages(Targets[,1],source="genepix.median",wt.fun=f)

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

>Assign MAList to a data frame and assign col and row names >delete first two rows as the are controls

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XY<-XX[,c(1,100,2,3,4,96,101,5,6,7,97,102,8,9,98,103,10,11,12,99,104,13,14,15,16:95)]

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.vector(XZ) GW<-matchcols(colnames(XX),with=c(as.character(XZ)),method="and") C<-grep(XZ[,1],colnames(XX),ignore.case=FALSE,fixed=TRUE)

F<-sort(XX,by=colnames(XZ))

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

>optional within array data

write.table(Z,"within_array_07142011.csv",sep=",")

### July 15, 2011

This morning consisted of finishing up the ordering of the chips once they are in R. After going through my code separately to make sure everything worked we compared our outputs. To our relief, they all turned out to match perfectly, which means even though Katrina and I have different codes we still both get the same output. The afternoon was spent putting the code on openwetware, this is located under the Dahlquist Lab protocols page in the LMU section. A description of each step was added so now there are rougly as many comments as there are lines of code. Once this was done we went through and did the Pvals individually. We also did the Bonferroni correction to the p-values, which is done by multiplying each p-value by the total number of spots. We also did the Benjamini Hochberg correction, which is similar in that you multiply each spot by the total number of spots, but you then divide by the location of the spot in the index, i.e. the first spot is Pval x 6189/1, the second is Pval x 6189/2, and so on. After this was all done we had two options, to do the F stats and find the p-values with them, or to start working on the multcomp package. While Katrina did the F stats for her data, I worked on learning about two pieces of code, on located in the base package and one located in the multcomp package. The first was lm(), which fits the data to a linear model that you set, however to apply this to our data I will need to know what the model is. The second is glht() which is the general linear hypotheses, this is a much more detailed process and I am still working on getting through the pieces inside the parentheses. The other part I am working on this is the linfct() which is located inside the parentheses in glht(), these lines of code just require some time spent with them to figure out what they do and how to adjust them correctly for our data set. As usual I will post the compiled code, for now this seems to be the final version, and changes will be commented on later. Here is the code:

>Read the target file, and name file, separate the index, dye swap, and row ID's

>normalize within arrays

MA<-normalizeWithinArrays(RG, method="loess", bc.method="normexp") M1<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n0<-length(MA$M[1,]) n1<-length(M1) MM<-matrix(nrow=n1,ncol=n0) for(i in 1:94) {MM[,i]<-tapply(MA$M[,i],as.factor(MA$genes[,5]),mean)}

>dye swap

M2<-tapply(MA$M[,1],as.factor(MA$genes[,5]),mean) n2<-length(MA$M[1,]) n3<-length(M2) MN<-matrix(nrow=n3,ncol=n2) for(i in 1:94) {MN[,i]<-ds[i]*MM[,i]}

MO<-as.data.frame.matrix(MN) colnames(MO)<-col rownames(MO)<-row ont1<-subset(MO,row.names(MO)!="Arabidopsis") MP<-subset(ont1,row.names(ont1)!="3XSSC")

>switch to GCAT directory and start on the GCAT chips

targets<-read.csv("GCAT_Targets.csv",sep=",") f<-function(x) as.numeric(x$Flags > -99) RT<-read.maimages(targets[1:9,1],source="genepix.median",wt.fun=f) RB<-read.maimages(targets[10:18,1],source="genepix.median",wt.fun=f) RGG<-rbind(RT,RB)

>normalize within arrays and tapply the data to average all the "empty" spots

MAG<-normalizeWithinArrays(RGG,method="loess",bc.method="normexp") R1<-tapply(MAG$M[,1],as.factor(MAG$genes[,4]),mean) r0<-length(MAG$M[1,]) r1<-length(R1) RR<-matrix(nrow=r1,ncol=r0) for(i in 1:9) {RR[,i]<-tapply(MAG$M[,i],as.factor(MAG$genes[,4]),mean)}

>make data frame and attach GCAT row names and headers

Header<-read.csv("GCAT_Targets.csv",sep=",") GNames<-read.csv("GCAT_ID.csv",sep=",") Gcol<-Header[1:9,2] Grow<-GNames[,2] GD<-as.data.frame.matrix(RR) colnames(GD)<-Gcol rownames(GD)<-Grow

>Getting rid of the microsoft access step and disposing of unnecessary GCAT IDs using R

Q<-merge(MP,GD,by="row.names",all=T) Z<-subset(Q,Q[,1] %in% Names[,2]) x0<-length(Z[1,]) x1<-length(Z[,1]) XX<-matrix(nrow=x1,ncol=x0) XX[,1]=Z[,1] colnames(XX)=colnames(Z) for(i in 2:104) {XX[,i]<-Z[,i]/mad(Z[,i],na.rm=TRUE)}

XZ<-read.csv("Ordered_headers.csv",sep=",") XV<-as.data.frame.matrix(XX) XY<-XV[,match(XZ[,1],colnames(XV))]

write.table(XY,"Master_07142011.csv",sep=",",col.names=NA,row.names=TRUE,append=FALSE)

## October

### October 4th, 2011

To analyze the significant changes in our gene expression, p-values were calculated using an F-distribution. The first step to calculating this F-distribution is by calculating the sum of the squares of the null hypothesis, or SSH. The null hypothesis being that no genes experience any significant change in expression, therefore the population mean(μ) is 0. To calculate the SSH each genes log fold change was squared and summed over every flask i; i=1, 2, 3, 4, 5; and every time point j; j=t15, t30, t60, t90, and t120. SSH=Σ_{i}Σ_{j}(Y_{ij})^{2}. The second step is to calculate the sum of squares of the alternate hypothesis, or SSF, the difference from the SSH being that the hypothesis states there is at least one significantly changed gene. This is represented by subtracting the population mean from the log fold change before squaring it. The population mean can be calculated by averaging the log fold change for each time point, these values were subtracted from each log fold change for each gene at their respective time point. These values are then squared, and summed over every flask i, and time point j. SSF=Σ_{i}Σ_{j}(Y_{ij}-μ_{j})^{2}. Finally, the F-distribution is calculated by subtracting the SSF from the SSH and dividing by the SSH, then by multiplying this value by the number of flasks(F) subtracted from the number of trials(N) divided by the number of flasks. This will give you the F-distribution with degrees of freedom F, N-F. [(SSH-SSF)/SSF * (N-F)/F] ~ F(F,N-F). This F-distribution, F(F,N-F), can then be converted to p-values using the FDIST() command in excel with the degrees of freedom F, N-F. Note, the F value is 5 for every deletion strain and the wild type, while N-F will vary because the wildtype has 23 repetitions and not 20 like the deletion strains. While the Benjamini Hochberg correction is our main focus due to its robustness when compared to the Bonferroni, both the Benjamini and Hochberg and Bonferroni were calculated to compare the results of signifcant change in gene expression. The Bonferroni was calculated by multiplying all of the p-values by the total number of hypotheses, in this case the 6189 genes being expressed. The Benjamini and Hochberg was calculated by first sorting all of the p-values from least to greatest, and then multiplying by the total number of hypotheses and divided by its position it was sorted in.

### October 25th, 2011

The stem clusters were obtained using the STEM package and instructions located under Assignments: Week 12. Wildtype, dGLN3, and dCIN5 were the three strains that were examined, dHMO1 had an absence of genes in this case and the uncorrected values are to be used in obtaining this profile. During the lecture, Dr. Fitzpatrick asked if I had gone over the Hill plots yet, and in turn, I thought about it and suggested using these plots to somehow estimate the cooperativity, or weight, between each transcription factor. This will have to be elaborated on more in the future. The master data was also formatted such that an extra sheet was added at the end to summarize all of the values we were looking for. This order that the sheet was formatted in was this: first, the index; second, the Gene Symbols; third, the normalized Log fold changes obtained through R; fourth, the average Log fold changes; fifth, the F-distribution, the calculation of which is described above on October 4th as well as the Dahlquist Dry Labs protocol; sixth, the p-values, seventh, the Bonferroni corrections; and finally eighth, the Benjamini & Hochberg corrections. And in turn, each of these columns was grouped by their strain, such that these eight groupings exist for the wildtype, then dCIN5, then dGLN3, and so on. While the all of the columns up to the Bonferroni corrections were sorted in R, the Benjamini & Hochberg corrections were sorted in excel. This poses a problem seeing how excel and R rank symbols, such as a hyphen inside of the Gene Symbol. This can be fixed by a simple step, once all of the data, not including the B&H corrections, is in the new sheet with the Gene Symbol column, sort said column from A→Z. This will allow the hypenated terms to be sorted in excel, and allow the terms associated with these symbols to be sorted along with them. Once everything is sorted properly, the B&H corrections can be added to the excel sheet.

## December

### December 6th, 2011

Most of the day was spent using a MATLAB tutorial on the website <http://www.math.ufl.edu/help/matlab-tutorial/matlab-tutorial.html#SEC5>. Using this tutorial, a previously prepared matlab worksheet was examined to better this understanding. This script was to be used in creating a network of significant genes, as well as a network using the transcription factors. The first step in this process, however, is troubleshooting the code. Suffice to say, this took a majority of the time due to MATLABs "user-friendly" coding and layout. At the moment a matrix is trying to be created using several lines of code:

for i = 1:size(ind2) outarray{i,1}=b{1+ind2(i),7}; outarray{i,2}=IDX(i); outarray{i,3}=A{ind2(i),1}; outarray{i,4}=B{ind2(i),2}; outarray{i,5}=A{ind2(i),3}; outarray{i,6}=A{ind2(i),4}; outarray{i,7}=A{ind2(i),5}; outarray{i,8:12}=C{IDX(i),1:5}; end

This compilation of code is currently in the process of being tweaked. The first difference from the original section is that outarray{i,4} should read:

outarray{i,4}=A{ind2(i),2);

The second is that the whole section of outarray{i,#}=A can be condensded into a smaller line:

outarray{i,3:7}=A{ind2(i),1:5};

The third piece of code that I am currently tweaking with is that originally the A was a numerical matrix that would not group together with the rest of the columns. This was a problem so A is a cell array that is made from a:

A = cellstr(num2str(q(:,1:5)));

While this did create a cell array in matlab, this did not fully separate all of the columns so several iterations of this code are being used. At the moment this seems like the correct piece of code to use, however when trying to split the columns up and place them individually, an error stating incorrect dimensions is produced and no further progress can be made.

Nicholas A. Rohacz 20:34, 6 December 2011 (EST)

# Summer 2012

## Week 1

### May 14, 2012

Today was spent going through the new ANOVA code that Dr. Fitzpatrick sent to us. This code will be posted later on as revisions are made and a finalized version is reached. For now we are attempting to understand all of the lines of code. Some problems that we have run into are what is happening in the two strain comparison code that causes 15 columns of timepoints to be written in the output file. While the output file could be reproduced using the exact files Dr. Fitzpatrick used, the WildType, dGLN3 and his code, we began running into problems once the strains were changed, or some of the timepoints. Another problem that we ran into was when beta = X\Y <-note this is not division, this is a backslask matrix operator in matlab. The Y array ends up as one column, while the for loop it is located in calls for it to be computed for every gene. We are checking with Dr. Fitzpatrick for exactly what this is. The final problem we had with the code was that the first column of many of the simple ANOVA outputs for different strains stayed the same. That is the values for the whole column did not change once the strains changed, but the other columns in the output did. This also showed us that we were running into a problem, because none of the columns from a simple dGLN3 ANOVA output matched the two strain comparison output. Thus we were unable to identify the unknown columns mentioned above.

Nicholas A. Rohacz 20:36, 14 May 2012 (EDT)

### May 15, 2012

Katrina and I spent the day covering two tasks, the first was to combine all of the deletion strains into the two-strain comparison code so that we could analyze all of the strains at once, a side point to fix was to take into account any deleted values, arrising from the difference in genes between the GCAT and Ontario microarray chips being processed. The second task was to begin coding the stochastic model sheets, the first step being figuring out how to use the code {fminsearch} in matlab. While Katrina focused on the first task I began figuring out the coding process for optimizing some test code created a few weeks back labeled my-F and my-F-pt1. These were created to visualize the minimization process using some sample equations. Most of the day was spent dechipering the instructions left by Dr. Fitzpatrick. After I had to ask him about some of the inputs he wanted in the code, he gave me a better heading and I was able to make a little way, but the biggest problem I have run into is putting a function handle on the my-F function such that I can call it from the newly created sheet. While trying to balance this problem with my code, another problem began to arise, the coding began to stop recognizing some of my parameters, or began altering the outcome of my-F-pt1, which was working fine before this started. I am working on putting a handle on the function from my-F, or finding away to code around it.

Nicholas A. Rohacz 20:08, 15 May 2012 (EDT)

### May 16, 2012

Spent the day working out the fstochmin.m code. After attempting different ways for hours to get the equation connected to a function handle(@), I managed to work out the code. The fstochmin.m function is now being called by a separate script, my-F-pt1.m, which also graphs the minimization of the equation in my-F.m. The next step to focus on removing or ignoring the NaNs from the input file and indices such that the output will have all of the correct data points and none of the blank rows that have been resulting from the NaNs. Some ways around this are possibly using accumarray() or something similar, or to use the ignoreNaN.m function I found online. The latter will take some time to figure out how to tailor the function specifically for our needs. Other options are also being researched at the moment by Katrina. I have been leaving early for the past few days due to MCAT studying. The test occurs this coming thursday, the 24th, so after that these entries will be more complete.

Nicholas A. Rohacz 20:26, 16 May 2012 (EDT)

### May 17, 2012

Dr Fitzpatrick emailed me corrected versions of my code last night. No major changes, mostly just rearranging the code around. my-F-pt1.m is now the fstochmin.m function, this plots the minimization as it happens. This was the only sheet with a change to specify the theta values in their new role. The old fstochmin.m was renamed to my-f-driver.m, which just called on the functions and outputted the final values from fminsearch() in the fstochmin.m function. This script held the parameters to be called on in the function. The last function, new-F.m, simply held the function to be minimized, as well as any necessary parameters. The day was spent trying to remove the NaN values from the input data sheets when calcuting the F distribution and Pvals. This became increasingly more challenging as we discovered that matlab, in fact, sees NaN as a real value. While not a integer, it is still classes as being numeric, and thus is throwing off our code. While Katrina is trying to build new regression matrices for the two strain comparison code, I spent my time focusing on the ignoreNaN.m function I discovered. However, this function uses accumarray to piece together the indices, and applies a sum or prod function to the values. After being unable to find a replacement function for accumarray, I decided to help Katrina with debugging her current code and attempting to create separate regression matrices. We have had no luck yet and my MCAT books are calling my name. Some ideas for tomorrows work are to somehow interpolate the missing NaNs back into the Y matrix to somehow account for the empty spaces. Another is to specify exactly which columns are supposed to be used in the calculations for the beta values.

Nicholas A. Rohacz 19:57, 17 May 2012 (EDT)

## Week 2

### May 21, 2012

Today Katrina managed to get past our problems of having the NaNs from the GCAT chips delete the entire row of wildtype data for those specific genes. An updated version of this code will be posted once the two piece comparisons have been compiled for the all strain comparisons. Dr. Fitzpatrick gave me some more equations to work on in the stochastic model, on being a sum of squares equation and the other being a line fit of a least squares error (LSE) equation. The latter requires much more extra coding than the first equation, which involved solely changing the equation function in matlab. I am working on getting through this coding but my MCAT is this Thursday and I am freaking out mildly whenever I look at a practice test. Thus my studying will take priority so that I do not sit down for my test on Thursday and not have a single idea for any question because I am too nervous to think.

Nicholas A. Rohacz 18:04, 21 May 2012 (EDT)

### May 22, 2012

The day was spent looking over the estimation code that was used in last weeks of the class journals and assignments posted in my openwetware. Albeit, this version of the code has been revised several times, the outcome is still the same in that it maps out the gene network using a differential equation and the ODE45 solver in matlab. The first goal in this code is to partition out the general-parameters-estimation-function.m that was received into separate parts. These being part 1: the input parameters and xls spreadsheets, part 2: the least squares error (lse) function, part 3: generating the graphs, and part 4: generating the output xls files. So far only the first partition has been started, this is due to confusion that arose when going over the code. Much of this confusion was solved by Dr. Fitzpatrick telling us that we no longer use the threshold(tau) pieces of the function and instead are using the threshold(b) options throughout the function. The other part that was not removed by Dr. Fitzpatrick that was the process of "rolling" up the network matrix into a vector that, combined with a new thresholds(b) vector, could be used by the lse and other functions. Much of this code will be continued tomorrow and hopefully a few working partitions can be separated.

Nicholas A. Rohacz 20:37, 22 May 2012 (EDT)

### May 23, 2012 (MCAT DAY TOMORROW!!!!)

Today was spent revising the estimation code that we've been working on this week so far. Some of the major changes today consist of changing the input from the average Log2FC to the originals and averaging in matlab. This is important for later on in the LSE function. There are more changes to be done to this function, such as some changes to the model (possibly) and the averaged log2FC may be incomplete as is. The main problem with today is that I have my MCAT tomorrow and I have had trouble focusing all day. My plan is to come in this memorial day weekend and get some more work done. Hopefully I do well on my MCAT.

## Week 3

### May 28, 2012

Memorial Day, No Work.

### May 29, 2012

The day was spent catching up on a the changes that Katrina made to the code during the days I missed last week. Some of the changes were similar to mine but featured better ways to call up necessary inputs so I replaced much of my code with hers. The second part of the day before journal club was spent making minor tweaks to the code so that I could more easily understand it when I needed to find out certain parts or try and debug a problem. The final half of the day after our Vu and Voradhsky journal club was to continue to alter the code such that we could use it on multiple strains. Some of the major changes today include telling matlab when the deletions happen and altering a good deal of the code to handle all the timepoints and strains instead of just one strain at a time using the average Log2FC. As of now the plans for tomorrow are to debug Katrinas code, which is almost identical in function to mine with some minor difference but will not run all the way through like mine will, and to work on correcting the output because so far the optimized parameters are coming out as the same number continuously or as all 0's. Work on altering the code for multiple strains will also continue tomorrow.

Nicholas A. Rohacz 20:39, 29 May 2012 (EDT)

### May 30, 2012

The entire day was spent trying to debug a certain problem in the code. When we tried to optimize our parameters using the fmincon function, the input all checked out ok, along with everything else that had been called. But after the minimization, the optimized weights only contained seven different values, the rest of the values were same, about 0.00653. This is a problem because it means that our code is not using our entire dataset. Katrina and I spent the morning and afternoon trying to solve this problem but to no avail. Tomorrow I have my last day of traveling this summer during work, after that I will be here the entire time. I will be looking over this code on my computer because I have a version of MATLAB available on there.

Nicholas A. Rohacz 19:40, 30 May 2012 (EDT)

### June 1, 2012

Turns out that we were missing a line of code from our sheets. Even more disheartening, the line of code we were missing was present in Dr. Fitzpatricks original code he sent us. After that was all worked out in my absence, Katrina informed me of a problem with the dz output, which is the production part of the differential equation minus the degradation part of the equation. It turned out though that running the code in pieces gave different outputs while running it in sequence gave the same, even with the old code that Dr. Dahlquist and Dr. Fitzpatrick used originally. The remainder of the day was spent working in the multiple strain deletions for the code. The line of code that ended up working for us was known as strcmp(...) which allowed us to compare individual parts of an array with what we wanted. Katrina is working on getting MATLAB to identify the necessary sheets it wants. I went with the hard code route and just told MATLAB what sheets it was looking for. The hard coding worked for multiple deletions, however I can't get the weights to delete themselves with each deletion strain, but I can get the b values to delete themselves for each deletion strain. The output files also save for each individual deletion strain, allowing for comparisons. The next step is to get the weights to delete themselves with each deletion. I did discover that the way I have the code written, that I recall all the parameters, arrays and matrices each time I go through a successive loop, MATLAB doesn't stack the deletions on top of each other, such that the dCIN5 deletion is not present when it is going through any of the successive loops. While this may slow down the processing a little bit it helps with the problem previously stated. A good idea to look into would be to separate the calls for the parameters, matrices and arrays that need to stay the same with each deletion, and separate any that are changing. This will be looked into Monday.

Nicholas A. Rohacz 20:17, 1 June 2012 (EDT)

## Week 4

### June 4, 2012

The morning was spent working out how to get the deletions in the network of edges to occur step wise and get all the necessary outputs. It wasn't until lunch that we learned we might have being doing something entirely different. In all honesty I don't necessarily understand where we are going at this point and I will raise my questions to Dr. Fitzpatrick when I'm not dying from high blood sugar.

Nicholas A. Rohacz 19:50, 4 June 2012 (EDT)

### June 5, 2012

The morning was spent helping Dr. Dahlquist run the first microarray slides of the summer. The strain was BY4741 of S. cerevisiae, and in this case two extra timepoints were recorded, t5 and t240. After the microarray data was final outputted, we made our way back to UHALL only to realize that we had forgotten which slides matched with timepoints. On the way back up to Seaver again we made a wonderful discovery of a gathering of people trying to view the transit of Venus across the Sun in front of the building. After partaking in this experience I've been waiting for for the past week. Once notifying Dr. Dahlquist and the rest of our lab mates, we gathered the necessary information and made our way back to UHALL to normalize the data. Upon comparison between our two separate trials we found that our outputs matched and our processes were still ok, after a few minor tweaks to deal with the lack of GCAT chips and the different number of microarray slides we processed. Tomorrow we will have an impromptu meeting with Dr. Fitzpatrick about our degradation rate paper because we are actually going to calculate them and then back to the deletions and making the stochastic model.

Nicholas A. Rohacz 20:13, 5 June 2012 (EDT)

### June 6, 2012

This morning was spent going over the Journal Club paper on decay rates. Towards the end of our meeting with Dr. Fitzpatrick, he decided to take a look at another paper that was closer to what we were doing in lab, using DNA levels to measure gene expression, whereas the Grigull, old journal club, paper focused on mRNA levels to determine decay rates. It was then decided that we would switch our journal club paper and go over the new one tomorrow morning. The rest of my afternoon was spent fixng the code to output the multiple deletion strains. The biggest problem we are running into is that the output graphs are graphing the model estimated by ODE45 and the general network dynamics function. The plot line of code calls for the model outputted by the ODE45 function, however, we have not worked a switch in anywhere into the code, thus everything outputs correctly on the graph except for the blue model curve fitted to the data points. The next problem is that nowhere in this code does the general network dynamics function call for a switchable input, thus I am not sure how to get the deletion strains to run through that code. Dr. Fitzpatrick told me that if I put a loop after the general network dynamics function estimation and put the ODE45 line of code inside that loop is should work, but this ran into a problem that it was only looking at the wildtype data still. Much of the work this afternoon was debugging the function as new lines of code were added and new data structures were created to house all of the output data and input data in MATLAB. I am working on getting this to work, at the same time we now have a new journal club to read for tomorrow.

Nicholas A. Rohacz 20:09, 6 June 2012 (EDT)

### June 7, 2012

This morning I finished debugging the multiple strains code, with the help of Dr. Fitzpatrick, and managed to start getting outputs. My problem originally had been I was unable to identify the exact input for the ode45 function in MATLAB. It turns out though that the ode45 function picked its own data according to separate conditions, the one I was failing to specify yesterday was the deletions. After some quick debugging there everything worked and we managed to output graphs. This afternoon was spent assembling the graphs into a powerpoint. The format of the powerpoint was that each deletion strain was separated for the old and new code and both were compared to the multiple strain code. This means that the powerpoint had an astonishing 106 slide, 21 for each gene, x5 for each strain, plus the title slide. Tomorrow will likely be spent working on the stochastic model again to try and build the stochastic model for our experiment.

Nicholas A. Rohacz 21:26, 7 June 2012 (EDT)

### June 8, 2012

Today was spent debugging the code and working on our journal club while matlab ran simulations. Hopefully, the last problem we are having is with the outputs. When the model.mod is being written to the outputs, it is log2 fold changed to standardize the values. However, this is changing the values in the first row, for gene ACE2, and is not writing out the last row, for ZAP1. After numerous attempts at writing the output a separate way to try and maintain the values once they've been log2FC, I've narrowed down the problem to likely resulting from one of the loops being too nested in the graphs sheets. The likely places for this are in the LSE sheet. Another likely error to look at would be the loops in the Graphs sheet and see if they are rewriting the values or performing multiple log2 fold changes on the model.mod optimized values. This problem will be looked at over the weekend and on Monday, before and after journal club if I can't figure out the problem before that.

Nicholas A. Rohacz 20:10, 8 June 2012 (EDT)

## Week 5

### June 11, 2012

This morning I managed to fix the problem we had last week where the outputted log2 optimized fold changes were incorrect for some of the strains. The problem was that the dZAP1 strain was being saved as all of the outputs. Once this was specified to change, all of the optimized log2FC's were being outputted correctly. This afternoon consisted of journal club and a continuation of debugging the matlab code, along with starting the abstracts for the symposium this summer in Tennessee for the Society of Mathematical Biology. We are doing two posters, one on the deterministic model, and one on the stochastic model. There is a third poster, of all the statistical testing, but Nicki and Chiddy are going to handle that poster along with the new S. paradoxis information. The problem we are working on now is to get the code to output a single set of weights and bs from our multiple inputs, that works for all the inputs. Right now it seems as thought all the other optimized outputs match up with the wildtype optimized outputs, when running each strain individually using the new code that takes into account all the time points and replicates. It took a while to really understand that we are looking for this output, which seemed to be happening, but it probably won't match up with any of the individual outputs that we already are outputting. That is the main focus of tomorrow, to output a single optimized weights and bs that fit for every model. And by fit, we are looking for the differences between strains to not flip signs and have drastic changes in magnitude.

Nicholas A. Rohacz 20:42, 11 June 2012 (EDT)

### June 12, 2012

This morning Dr. Fitzpatrick emailed us a cleaned up version of the code I had sent him the previous evening. With the tune up, the outputted Wts and Bs looked better and didn't have as many drastic sign flips and magnitude changes. The All strain Wts also differed slightly from the Wildtype Wts which is better than when we were getting the same line. Whether this is what we are looking for or not is up to Dr. Dahlquist and Dr. Fitzpatrick but as of now it looks good. The rest of the morning was spent fixing the output sheet and renaming some of the variables that were deleated in the clean up. This afternoon was spent putting together some comparison graphs, and even developing a piece of code for the process. This came in handy because now I am performing some alterations to the base parameters, such as the alpha value, and some of the scaling taking place in the code. Work also began on the Stochastic Model abstract for the SMB symposium in Tenn. this summer. I have a general understanding of the stochastic model and what it is trying to represent, I just have a few questions for Dr. Fitzpatrick the rest of the experiement, i.e. whether or not its the same in most aspects, what parts are different, and some questions on the stochastic process.

Nicholas A. Rohacz 20:11, 12 June 2012 (EDT)

### June 13, 2012

Today was spent running alpha value comparisons. The first part of the comparisons was to get some scaling factors in the code for the penalty and the lse out values. The lse out values were scaled by 1/# of data points, that is every gene for every time point for every flask in every strain. This was under the variable name errormat, or the error function, in the general least squares error script and was scaled by the length of this array and not the actual values. The penalty was scaled by 1/# of free parameters, that is every parameter that was not fixed and therefore was changing throughout the estimation code. This was given by the variable named theta, and the penalty was scaled by the length of this theta array. Once these scaling factors were successfully incorporated into the code, different alpha values were tested to determine a "best fit". This "best fit" was determined by graphing the lse out vs penalty out to make an L curve. Where this L curve goes from a mostyl vertical slope to a mostly horizontal slope is where the optimal alpha value is. While the L curve for all strains is complete, the L curve for each individual strain is almost done. A problem arose at the last second where some of the values cause the curve to double back on itself, making the graph look like a hairpin. This is likely do to one of the original scaling factors being present in the code still for some, but not all of the output mat files. This just requires me to re run the code and obtain the correct outputs again, once this is done the graphs shouldn't take more than a few seconds.

Nicholas A. Rohacz 20:47, 13 June 2012 (EDT)

### June 14, 2012

This morning was short because I accidently slept through my alarms after the 3am growth curve shift. The rest of the day was spent looking at the abstract Dr. Fitzpatrick had drawn up, a majority of it was similar to the deterministic model, the rest of it was mainly what I had questions on before I began writing it. The other project for today was to test some extra alpha values, but closer to the 0.01 value we picked using the L curves, instead of the wide range, 1-0.0001, used in those curves. This took a while mostly because the individual strain code that Katrina and I had was actually a jumbled compilation of our work before we got the all strain estimation code running. This created a lot of cleaning up that needed to be done before the alpha comparisons could be made and sent to Dr. Dahlquist and Dr. Fitzpatrick.

Nicholas A. Rohacz 21:28, 14 June 2012 (EDT)

## Week 6

### June 18, 2012

Today was spent reading some papers that Dr. Fitzpatrick emailed us last Friday, which was spent putting together a ppt on all the work we had done up until now for journal club. Journal club took a good 3 hours today because our presentation was on a years worth of work. Once we got back, we decided to run some confidence tests to make sure the code is really doing what we want it to do. The first test was to run the new code for all the strains simultaneously, but instead using one individual strain at a time, along with using all of the means instead of the flask data. This was pretty quick, especially computationally, the only problem was that our newest version of the code did not have all the code it needed. And since we are testing the code and aren't allowed to change it, we can't be sure if this causes problems or not, for example, the standard deviations are calculated in MATLAB in the new code, but the old input format called for having a sheet with the sigmas present. Since this was so, the graphical output did not have the standard deviations present, and some other place in the code may have been affected. The next test was to make sure the the optimized model would output the same data if the optimized parameter values were put back into MATLAB. This was difficult though because we need to make a new model of 3 gene with no connections and rewrite all of the input sheets. This has not been run yet, some problems that may occur are size problems, if MATLAB is hardcoded at any point for # of data points or time or flasks, the other problem is that we aren't sure if we need to run an initial optimization with our new network, or if we can use the optimized values from our 21 gene network run with this new network. The problem with the latter idea is that this may change the optimization of the network if it is specific for the 21 gene network and the 50 connections it has already. The final goal for today was to figure out if our code was running in an AND format, where the exact number of TFs that are required to transcribe the mRNA are required to make the model, or if it is in an OR format, where at least one TF has to bind. Biologically, the latter idea would make more sense, even though some genes do require more than one TF, those would likely be case specific. This needs to be checked for both activators or repressors, and for a combination of the two to give us a good idea of what we are running. At this point Dr. Fitzpatrick assumes that it is somewhere in between the two formats. The last two confidence tests will be run tomorrow, hopefully we can get this minor problems worked out too.

Nicholas A. Rohacz 20:06, 18 June 2012 (EDT)

### June 19, 2012

Today was spent finishing up the comparisons that Dr. Fitzpatrick wanted performed, as a confidence test to make sure our codes are doing the same functions, and also we started going over what an AND vs an OR model would look like. The AND model would be expected to have F=1 in Q1 where all variables are positive, but F=0 in all the other quadrants. This is because the conditions state the gene expresses only if all necessary TFs are bound, which sounds somewhat correct for the model, but not for every gene. The OR model would be expected to have F=1, F being the expression, in all quadrants except when all the input variables are negative, in Q3. This expresses the assumption that the gene will be expressed to some degree as along as there is at least one TF bound to the gene. After talking to Dr. Fitzpatrick and going over what an OR model would look like, we decided to look at what an activator and a repressor would look like. After trying numerous alterations to the OR model equation we got, I finally found the correct transformation and have a graph that looks like it may work. If you go to Wolfram Alpha and graph out 1/((1+exp(-x1))^3*(1+exp(x2))) then the graph looks somewhat like what we want, when F=1 for Q4, F=1/2 for Q1 and F=0 for Q2 and Q3. After trying this equation in the model we already have and butchering any results we had already gotten with the previous model, by that I mean that none of the Wts or B values look remotely the same, in fact many are close to 0 if not equal to 0. After wracking my brain for a little to try and fix this problem, Katrina pointed out she was having a problem with the deletions when testing the accuracy of the old model. After checking the outputs I got for the normal estimation of the model, I noticed there was a problem with my outputs as well. So the remainder of the day was spent cleaning up the code so that everything is correctly specified and outputted. The equation I was working on is posted on the blackboard in UHALL and I had a question for Dr. Fitzpatrick about log2 fold changing the estimated values of the model. Without this change the model starts at 1 for its simtimes, with this change it starts at 0, by simtimes I mean the model graphs outputted by the code.

Nicholas A. Rohacz 20:13, 19 June 2012 (EDT)

### June 20, 2012

This morning was spent trying to come up with an equation for the one activator and one repressor sigmoidal model that was described in the previous journal entry. For the better part of an explanation this idea was put on hold. For now we are just thinking about some other ideas that could possibly work and fit a better biological definition, such as michaelis-menten kinetics. We have a few problems with some of the ideas, such as whether or not the production rates we are using are still valid in some cases, such as michaelis menten protein kinetics when trying to use protein production rates. Also, how exactly we would format some of our ideas to fit a production degradation model. These questions are better left to tomorrow when I will have access to my Biochemistry book, when considering some processes and assumptions, and access to Dr. Fitzpatrick and Dr. Dahlquist.

Nicholas A. Rohacz 20:04, 20 June 2012 (EDT)

### June 21, 2012

Today was spent trying to work on some of the new models we are going over. One in particular, the michaelis menten protein kinetics model, was found to be a good estimation of what is occurring biologically as well as mathematically. Upon finding a paper that outlines a lot of what we are trying to do we decided to move forward with this model for now. The day was spent trying to run a comparison between the two different models to see how the overall analysis would compare. In this case we are only running a forward simulation of the code and not a reverse, meaning we are not estimating and optimizing any of the parameters, just solving for some fixed values. As of this point, we have run into a major problem. It seems as though the weights are not be computed correctly in the forward simulation. This is causing our debugging and any results we are looking at to be false. Our best assumption is that the error lies in the general network dynamics function, in the loop where the wts vector of weight values is being unrolled and put back into its original matrix format. I am not sure what is going on at this point, but this needs to be fixed before we can move on in the forward simulation. Perhaps Dr. Fitzpatrick will have an idea as to what is causing this bizarre error.

Nicholas A. Rohacz 20:12, 21 June 2012 (EDT)

### June 22, 2012

Today was spent constructing the michaelis menten function into the forward model so that we could run some comparisons. Upon doing some tests, we came across a few errors in the original code Dr. Fitzpatrick sent us. First of all, when unrolling the weights back into a matrix, the values were read column by column, and replace back into the rows that we thought they should be in. This was a problem because it put the wrong values into the output due to a sorting problem. Katrina managed to fix this problem with a line of code called sortrows() that sorts the values based on specifications made in matlab. This allowed us to move on to trying to get the general network dynamics to run correctly. The problem we started having with this section of code was that there were NaNs in Matlab that came from any rows that had no connections in the matrix. This caused the calculation of part of the MM function to return a NaN because it was dividing by the 0 connections in the row. After a few ideas as to how to get around this problem, the both of us came up with separate ways to perform the calculations and keep the values where they are. The version Katrina came up with was a more concise version of what I had so we decided to go with that code. Once we finally got the forward simulation to run, we ran a comparison between the two and sent them to Dr. Fitzpatrick, along with a comparison for when the b's are fixed at 0 in the sigmoid model. The next step was to run a reverse simulation and try and estimate and optimize the parameters for these two functions. Some of the graphs are a little funky so they would be best shown to Dr. Fitzpatrick to see if we have an error anywhere in our code causing some of the problems with the ACE2 and GLN3 model graphs.

Nicholas A. Rohacz 20:05, 22 June 2012 (EDT)

## Week 7

### June 25, 2012

Today was spent putting together some comparisons for Dr. Fitzpatrick in preparation for the afternoon lab. We eventually found out that there was a double log2 being performed somewhere in the code and this was skewing our data. Once we went in and took this out of the code, all the expression graphs looked very similar to one another. Then we had journal club for a few hours up in Seaver. Upon returning to Uhall, we started work on L curve simulations that would allow us to pick better alpha values, because some of the expression curves having weird wiggles. These simulations took almost the remainder of the day, at which point I spent the last hour putting together the L curve for Dr. Fitzpatrick to pick a value. At this point its between 0.01 and 0.05, because both are located where the L curve makes its major transition from a mostly vertical slope to a mostly horizontal one. The reason for the need for Dr. Fitzpatrick to take a look is because the two values are located very near to one another and I still don't have a firm idea about which would be better on an L curve. The next step is to run the new code with the michaelis menten model with all strains present, this was done today but the deletions were still commented out in Matlab and thus it ran the whole simulation with only the Wildtype data. Tomorrow, once we have an alpha value, we can run the simulation. It is likely that we may run a comparison between the 0.01 and 0.05 alpha values for the weights and production values being optimized.

Nicholas A. Rohacz 20:14, 25 June 2012 (EDT)

### June 26, 2012

Today was spent running simulations to try and better determine the alpha value to be used in our model simulations. The problem with this was that our simulation counter decided to jump up to a total of 15000 instead of the usual 8000 so our simulations took at least an hour each. On Katrinas computer, which I was using, this meant it took 2 hours each and the third computer wasn't much faster. Tomorrows plan is to run simulations in the math computer lab in Uhall.

Some of the work that took place today was trying to put the legends into the model graphs that are outputted. This is because once the colors are changed for each strain it is much harder to determine which is which. Not to say when all the curves were blue that it was easier. Some of the problems with this were that the graphs were graphing ten sets of data, the model and the original data for each strain, so 2*5 = 10. This caused the legends to match the strains with the data and the model, and not just the model curves. We worked on this for a few hours before I decided that fixing the legends is secondary to getting the production rates correct. This was where the real work began. We needed to program the code such that it outputted the production rates correctly, however, with the deletions and the genes with all repressors, the output had several zeros that did not make sense biologically, but did make sense mathematically. So we set out to correct this discrepancy between the two fields.

We needed to input the base production rates into any gene with all repressors, because the gene still has a production even if it is being influenced by repressors or had no inputs. The other 0's for the deletions needed to be taken care of, because they were originally for the strain selection in the general network dynamics. Upon adding some lines of code that we thought may work and running a few simulation trials, well one for me stupid slow computer, we realized we needed to do some more thinking. Essentially, it made sense that the production rates were were changing for all of the genes with connections, because these connections would influence the production rate. However, the base production rates that we had put into the model, were being optimized. It would also make sense that the outputted production rates can be influenced by the deletions from strain to strain, such that any gene connected to or had a connection with a gene that was deleted, would have slightly different values for their production rates. The all repressor genes were the last problem, however we could not decide if they should maintain their base production rate, or have one optimized with the model since they have a repressor connection influencing the outcome. The problem I had to deal with first was that I was still getting 0's. I am running the code now to determine if I had a minor error or what is the problem.

Nicholas A. Rohacz 20:12, 26 June 2012 (EDT)

### June 27, 2012

Today was spent trying to debug the code such that the production rates for the genes with no in connections were their starting rates, and HAP5 the gene with all repressors had a rate. Sadly this took the whole day only to find out the next day that we had been getting the output that we wanted all day. It turned out that the production rates for the gene with all repressors connections were correct because HAP5's degradation rate was small enough that its production rate, which was nonexistant except for its own base production, was optimized to 0. The genes with no connections were correct too, it turned out we wanted to optimize the initial production rates that were given in the input.

Nicholas A. Rohacz 20:23, 28 June 2012 (EDT)

### June 28, 2012

Today was spent running the simulations to try and have some alpha and parameter comparisons. Unfortunately, all of today was spent running the alpha value simulations because we kept getting some weird kink in the graph. Our solutions were to try and switch computers so we could run them all simultaneously, however this resulted in the same problem. It wasn't until we emailed Dr. Fitzpatrick what he had that he suggested running the first simulation at alpha of 0.4, and then running the next simultaion, in this case 0.2, with the optimized parameters from the previous simulation. This will have to be done tomorrow because the simulations all take about 45 minutes to an hour at least, and with the optimized output being run as the input of the next simulation, we need to run one at a time. Tomorrow will take a while to get the comparison curve done, then next comes the individual strains vs the all strain comparison.

Nicholas A. Rohacz 20:23, 28 June 2012 (EDT)

### June 29, 2012

Today was spent running the alpha comparisons from yesterday. The difference we made today to try and get rid of the kink is that we used all the previous estimation outputs for the next input. For example, an alpha value of 0.4 was our first estimation that we used all 1's, 0's and the base production rates to estimate the parameters. We then used this output as the input for the next alpha value, which was 0.2, and so on and so forth. Once this was done, we finally got a nice graph that we could pick an alpha value with. My concern with this was that it might alter the curve and cause us to pick an alpha value that was not really where the curve turns, i.e. the corner. However, Dr. Fitzpatrick came and talked to us about doing the simulations this way and everything seemed like it would work, because the smaller alpha value would perform a more exact simulation that the previous such that it really just adds on to one another. The alpha value we decided to pick was 0.04, which was next to where the curve turns, 0.05, which is our usual value, was close to this corner, however this point on the new L curve was still on the more vertical section of the curve. Once we had this alpha value, we ran the all strain and individual simulations so that we could compare the estimated weights and production rates. These have all been sent to Dr. Fitzpatrick and we are ready for the next step. It is likely that we still have some cleaning up to do with this code based on how the weights curve looked, however the production rates seemed like they were estimating themselves nicely for all the different conditions we have been debugging for this past week. Hopefully we start the stochastic model soon, or else I won't have much to put on the TN conference poster about it.

Nicholas A. Rohacz 20:03, 29 June 2012 (EDT)

## Week 8

### July 3rd, 2012

Today we finished up the other outputs we wanted for our final summer presentation. Katrina spent the day creating a graphical output comparing the sigmoidal model vs. the michaelis menten model as well as the graphs for all strains vs. individual. All of the graphs of these conditions, four total, were placed on the graph along with the final legend and some labels to help identify which curve is which. I spent the morning running the last of the simulations we need. First was any simulation MAT files that Katrina needed to do the graphs, next came the schade data with the michaelis menten network dynamics, this took a short amount of time. The last simulations I had to run were the normal, all strain, michaelis menten dynamics trial, but with alternate equations for the michaelis menten funciton part of the dynamics. The first equation instead of x/(1+x) was 1/(1+exp(-x)), this took almost 58000 iterations to complete the simulation, with five strains this was approximately 3 hours. The next equation was x^2/(1+x^2), this equation did not take nearly as long to output. I still need to check with Dr. Fitzpatrick to see if he wants the comparisons done and if so for what parameters or what conditions to compare. The last part of the day was spent working on the mystery stochastic code that Katrina worked on, I say mystery because neither of us remember this code being made. While we still have a few questions about parts of the code and the output of the code, I have managed to understand what is going on fairly well, which is good considering I have a presentation to give on this in 3 weeks. One thing we started to work on was to change how the probability jumps were calculated in the stochastic network dynamics. We have been working on the MATLAB notation for these conditions we need to specify but we keep getting weird results and are not sure what the actual results are supposed to look like. Tomorrow we need to ask Dr. Fitzpatrick if the stochastic model is just outputting the model graphs again and if so how similar should they look compared to the old graphs from the deterministic model.

Nicholas A. Rohacz 20:09, 3 July 2012 (EDT)

### July 4, 2012

Happy 4th of July!!!!

Nicholas A. Rohacz 14:32, 4 July 2012 (EDT)

### July 5, 2012

Today was the first day we truly started work on the stochastic model, it seems a little late for our Tennessee conference but all the more urgency then. Our first goal with the stochastic model is to piece it together with the deterministic model, so that we can input our data and have it output the expression and put together the gene regulatory network. Three major switches that needed to be taken care of include switching the ode45 call to the simple stochastic simulation that Dr. Fitzpatrick had emailed us last Tuesday, the 3rd of July. The next switch involved switching the fmincon function call to instead call the fstochmin function that I put together over the summer. The last switch that needed to be made was not a full change, instead it involved taking the general least squares function and altering it such that it ran the stochastic code. While these are all tedious switches, they are a bit confusing when looking at brand new code and trying to figure out the function as well as improve it. Overall, these scripts and functions must be integrated into the overall stochastic code. Along the way, some of the other sheets were altered a good deal, and thus were changed to become the stochastic version of the code. For now we are stuck on a problem with dimension sizes in the fstochmin function of the code. The problem lies in trying to add together the theta array and the d and dx arrays that are multiplied together. As for now we are unsure as to what is wrong with the dimensions, it is likely that this function script still needs some modifications to accomodate the stochastic scripts. My idea now is talk to Dr. Fitzpatrick, because we created the fstochmin code together and there are some parameters that I don't fully understand.

Nicholas A. Rohacz 20:19, 5 July 2012 (EDT)

### July 6, 2012

Today was a rather dull day. We spent the whole day running only a few simulations, in fact we only ran two. The first was a normal simulation that we used to try and troubleshoot the code some. The second simulation was performed after Dr. Fiztpatrick visited and corrected a few lines of code. The second simulation, however, ran until sometime between 9 pm and 9 am, on the night of Sunday the 8th. This is known because we were working on our presentation for journal club Monday and for our SURP presentation next week.

Nicholas A. Rohacz 19:55, 9 July 2012 (EDT)

## Week 9

### July 9, 2012

Today we began looking through our code until Dr. Fitzpatrick arrived to talk about the problems we were having. After a few minor revisions of the code, mostly for book-keeping purposes, Dr. Fitzpatrick showed up and we went to journal club. Once we came back, we went through the code a little bit to try and get it working. Our first move was to lower the computational costs of the stochastic model by only running the wildtype data until we managed to work out the code. The next step was to only run the forward simulation of the code. This was done by setting iestimate = 0 which caused the code to skip over the reverse simulation. We did this to try and find the cost of these computations since our code we ran Friday took all weekend to run. Once this was all set, we did some cleaning up so that we could get the forward simulation to run all the way through, this took some changing of variable names, especially any that dealt with the different ways time was recorded. Whether it was just 15, 30, 60, or if we had steps in between the times that added to the variables and thus were a different name. We continued this cleaning effort until we hit our most recent snag. We can have the forward simulation run all the way through, and we can have an output, however the output is incorrect. The model is starting at the first number in the probability distribution and thus the model graphs are not starting at 0, in the log space, or 1, in the normal data. This is a problem because the model is not being solved correctly if every point its solving is being minimized by our Markov Chain methods in our code. Tomorrow we need to ask Dr. Fitzpatrick what the interp1() line of code does exactly and if our whole model should be changing with it. My guess is that no it should not be and we need to set a fixed starting point for our model to begin stepping from.

Nicholas A. Rohacz 20:05, 9 July 2012 (EDT)

### July 10, 2012

This morning Dr. Fitzpatrick came in to go over the code with us. It turned out that our initial guess for the simple stochastic simulation script was being rewritten with each pass through the for loop. This caused our model to change its initial values. Once this was fixed, by changing the call for making the first row in the Nlist equal to the intial guess x0, we ran the code for just the wildtype to determine the length of the time elapsed during the simulation. This came out to about 71 minutes for the total simulation of one strain. After fixing the code so that we had fixed production rates for the steps in the network_rates script, we had a look at the outputted models. Some of the models simulated did not match the data as well as we had hoped. This was likely due to the low number of minimizations we were performing with the fstochmin code, only 1000 iterations. Dr. Fitzpatrick opted to increase this value to 5000, the problem with this is that this will cause the time it takes the simulation to multiply by 5 as well. Thus our new simulation is estimated to take 355 minutes or just under 6 hours. This means that we will see our results tomorrow morning.

Nicholas A. Rohacz 20:06, 10 July 2012 (EDT)

#### Addition

The results still did not turn out too well, we are changing some of the parameters in hopes that we can take better step sizes and get a best estimate.

Nicholas A. Rohacz 20:33, 11 July 2012 (EDT)

### July 11, 2012

Today was the codefest meeting with Dr. Dahlquist and a number of other people that showed to discuss Biomathematics and other mathematic and code related subjects. I helped set up and got to meet a few people but we didn't stay for long because we had some work to do. We started corrections on the SURP final presentation until Dr. Fitzpatrick showed. Once he arrived, we started messing with a few of the parameters to see if we could get an even better fit, because a lot of the steps that were taken only showed production. It seemed like degradation was being overshadowed, thus we had some work to get done. Our first parameter we were playing with was -a- in the fstochmin.m script we are using. The initial step included a = 0.001/i^(0.5), this was an extremely small step and thus we started changing the 0.001 to higher values such that we could obtain a better fit. These comparisons were sent to Dr. Fitzpatrick and it seemed like the model still wasn't fitting perfectly. He suggested that we run a trial with the number of iterations at 20000 and the a starting at 0.05 instead of 0.001. This will take an estimated 20 hours to complete and so it was my final task of the day. Results to come tomorrow.

Nicholas A. Rohacz 20:33, 11 July 2012 (EDT)