# My Computational Journal Summer 2014

## May 19, 2014

Began putting together MATLAB code to create a stochastic model for a dummy network consisting of 4 genes. In this network, one gene regulates itself and all genes are regulated but at least one other gene.

- Created an Excel workbook that will be the input to the MATLAB code. Has similarities to the input workbook for the sigmoidal and Michaelis-Menten model simulations.
**"data"**: Contains dummy data for each of the four genes for four replicates for each time point (t15, t30, and t60). The data was generated with the following random number generator code where the min (-4) and max (4) values were determined based upon actual microarray data: =RAND()*(4-(-4))+(-4). Data must then be copied and pasted as values.**"network"**: Adjacency matrix for gene network.**"network_weights"**: Copy of adjacency matrix for gene network.**"run_control_parameters"**: Parameters that control the simulation run.**"simulation_times"**: Time points for which to find the model.

- Started to put together the script with some comments about the sequence of commands.

Katrina Sherbina 18:56, 19 May 2014 (EDT)

## May 20, 2014

Began looking over at the MATLAB scripts Nick had on his backup disk for the stochastic model from the Spring of 2013. As a result, I made some changes to the input workbook I began yesterday (Input_4_Gene_Dummy_Network_Stochastic_Model.xls) in order to match his input workbook so that I can use his code:

- Changed the name of the "data" worksheet to "wt".
- Copied the following variables from Nick's "optimization_parameters" sheet to my "run_control_parameters" sheet: fix_P, fix_b, Strain, Sheet, and Deletion.
- Added a "degradation_rates" worksheet with randomly computed degradation rates for each gene using the following formula: (0.001)+(0.3-0.001)*rand(), where the min 0.001 and max 0.3 was determined from looking at the min and max degradation rates in Nick's worksheet.
- Added a "production_rates" worksheet with listing the production rate for each gene as twice its degradation rate.

***NOTE: In Nick's corresponding worksheet, the production rate is not twice the degradation rate, but for most genes, with the exception of two, the production rate is a bit bigger than the degradation rate.**

- Changed the "network_weights" worksheet to estimates of the weight of each edge computed with the following formula: (-1)+(1.4-(-1))*rand(), where the min -1 and max 1 was determined from looking at the min and max weight in Nick's corresponding worksheet.

The new order of the workbook is as follows:

- "wt"
- "run_control_parameters"
- "degradation_rates"
- "production_rates"
- "simulation_times"
- "network"
- "network_weights"

To test out the simplified way Fitzpatrick has suggested to compute the state probabilities for each gene, I began putting together some MATLAB code to run a forward simulation using some of Nick's code as a foundation from the Spring of 2013. The first two scripts are the driver that runs the simulation (Stochastic_Driver_20140520.m) and the script that imports all of the data from the Excel workbook (Stochastic_Parameters_20140520.m). In adapting the second script from Nick's corresponding script, I noticed that the imported production rates and degradation rates are divided by log(2) where the log as it is used in MATLAB is the natural log. I need to ask Dr. Fitzpatrick why this is because this this transformation is not used in the deterministic model code. I created another script called Stochastic_Forward_Simulation_20140520.m that takes the code from Nick's Stochastic_LSE.m that corresponds to just the forward simulation. However, I have a few questions about this script from Nick that may or may not result in changes having to be made to my forward simulation script:

- While the variable "nmc" is set to 10 in the script that imports all the Excel data, nmc is reset in the Stochastic_LSE.m script. First of all, what is nmc (the number of iterations?) and why is the value of it reset?
- In Nick's Stochastic_LSE.m, there are 3 different forward simulation runs between which there do not seem to be any differences other than nmc is set to 100 before the last of these 3 runs. Did Nick construct these runs for troubleshooting purposes?

After the Stochastic_Forward_Simulation_20140520, I will need to modify the simple_stochastic_sim.m and network_rates.m which seem to have a similar purpose to ode45 and the general network dynamics scripts for the deterministic model, respectively. I still don't quite understand how simple_stochastic_sim.m actually works so that is what I will be working on tomorrow. As for network_rates.m, I believe this script can be very easily altered removing many of the variables according to how Dr. Fitzpatrick suggested to modify the state probabilities computations. Namely, I think this will involve expressing the production of a gene (designated by the variable pro) using the equation Dr. Fitzpatrick gave for the probability that the state of a gene will be 1 (when the activators are on) in the next time step given some initial state. Then, the degradation of a gene (designated by the variable deg) can be expressed using the equation Dr. Fitzpatrick gave for the probability that the state of a gene will be -1 (when the repressors are on) in the next time step given some initial state.

Katrina Sherbina 18:25, 20 May 2014 (EDT)

## May 21, 2014

Today, I was working on better understanding and modifying network_rates.m. In commenting the code, a couple of questions arose.

- In the case that a target gene is not controlled by any of the transcription factors in the network, the number of positive steps (designated by the variable upj) is 0. However, why is it not the production rate of the target gene?
- In finding the product of the weights and states of the transcription factors that control a particular target gene, the state of a transcription factor in the product is given by 2^state. I am not sure why this is.

In the case that the target gene is controlled by at least one transcription factor in the network, I changed the network_rates code to reflect the state probabilities that I talked about in the journal entry for May 20, 2014. However, at this point, I am a bit stuck and have some more questions:

- The probabilities for the state of a target gene to be 1 or -1 have a weighting constant that is unique to each gene. I am wondering if this constant should also be dependent upon the number of activators and repressors that control the gene. In other words, there should be a specific weighting constant for the probability of a state of a target gene being 1 when the activators are on and another weighting constant specific to the probability of a state of a target gene to be -1 when the repressors are on.
- Are both these probabilities used to compute the production of a target gene? How should this computation look?
- Should the aforementioned computation be weighted by the production rate constant of the target gene?
- How do you represent the degradation of a target gene?
- What do you do with the probability of the state of a target gene being 0?

Katrina Sherbina 22:26, 21 May 2014 (EDT)

## May 27, 2014

I have been working on altering the network rates function based upon an email Dr. Fitzpatrick sent me May 22nd. This email contained a Word attachment with state probabilities computed differently than Fitzpatrick and I originally discussed, which I have described in previous entries. In these the probabilities that the state of a gene is 1 and the state of the gene is -1, the denominator is only the summation of the weights of all the edges in the network. For the probability that the state of a gene is 0, the alpha parameter in the equation is multiplied by the sum of all of the weights of all transcription factors that control a target gene divided by the summation of the weights of all edges in the network. To find the summation of the weights of all edges, I actually added the following lines of code to the Stochastic_Forward_Simulation script (includes comments designated by %):

%Considering the edges in the network, targ is the index of a gene %controlled by a transcription factor in the network and tfact is the %index of the transcription factor that controls the target gene [targ,tfact] = find(wtmat ~= 0); edgeindex = sortrows([targ,tfact],1); %Sort based on the index of the target gene for ii = 1:nedges allweights(ii,1) = wtmat(edgeindex(ii,1),edgeindex(ii,2)); %Put all of the network weights into a vector end sumall = sum(allweights); %Find the sum of all the weights

After some discussion with Dr. Fitzpatrick today, I will actually try the forward simulation with these altered state probabilities as well as the state probabilities that we initially talked about. In modifying the network rates function for both of these sets of state probabilities, I changed the output structure to a number of genes x 3 dimensional array where the first column corresponds to the probability that the state of a gene will be 1, the second column corresponds to the probability that the state of a gene will be -1, and the third column corresponds to the probability that the state of a gene will be 0. For both network_rates functions, neither the production rates or degradation rates are used.

For both network_rates functions, there is also still some work to be done in computing the state probabilities for the no input genes. Dr. Fitzpatrick explained today that the state probabilities for these genes should be as follows.

- p(1|y) = (1/2)*alpha_i
- p(-1|y) = (1/2)*alpha_i
- p(0|y) = 1-alpha_i

where y is the state of the transcription factor at time t=0. However, as alpha_i is the proportion of transcription factors in the network that regulate the target gene for all of the genes with inputs, I am not sure what alpha_i should be for the no input genes.

- I don't believe that that the dummy network has any no input genes so I will need to redo the dummy network to gauge how these state probabilities will work.

In addition, there are some questions that will need to be answered further down the road:

- Is the degradation rate of mRNA slower than that of protein? This is something to ask Dr. Dahlquist.
- What should a state of 1, -1, or 0 mean once we incorporate the log fold change data in simulations that estimate the parameters?

Katrina Sherbina 19:49, 27 May 2014 (EDT)

## May 29, 2014

I have been working on two network rates functions network_rates_20140527.m and network_rates_20140529.m, the former which contains the altered state probabilities for every gene controlled by at least one transcription factor in the network as Dr. Fitzpatrick described via email (described in the entry for May 27, 2014) with the parameter alpha set as 0.9 as suggested by Dr. Fitzpatrick. The latter script contains the state probabilities as initially discussed on the first day of summer work. Both functions however give the state probabilities for the no input genes as described in the entry for May 27, 2014. After some consultation with Dr. Fitzpatrick, I set the alpha parameter for the no input genes to be 0.05. According to Dr. Fitzpatrick, the alpha parameter should be lower for the no input genes than the other genes. If the alpha parameter is in fact dependent on the transcription factors in the network that control a gene, then naturally the genes controlled by at least one transcription factor in the network should have a higher alpha parameter than the no input genes. I think we will probably need to experiment with what the value should be for the alpha parameter depending on how the models fit the data.

Katrina Sherbina 22:09, 29 May 2014 (EDT)

## May 30, 2014

In order to test out the if statement in the network rates function that determines the state probabilities for the no input genes, I had to alter the network contained in the input file (Input_4_Gene_Dummy_Network_Stochastic_Model.xls) to include a no input gene. I chose GeneA to be the no input gene by setting the regulatory weight of all the transcription factors in the network to 0 for GeneA in the sheet "network_weights". In so doing, I also needed to zero out the row for GeneA in the "network" sheet.

To be able to run the forward simulation, I had to make several changes to the simple_stochastic_sim function of which I created a new copy called simple_stochastic_sim_20140530.m:

- To find the variable nevents, I changed the computation to 3*n_genes as the output of the network_rates function consists of three columns rather than two columns as it initially was.
- I created a new varible rn to take into account the third column of the network_rates function output, which involved adding the following two lines of code:

rn = zeros(n_genes,1); %Probability of no change

rn = rr(:,3); %Probability of no change

In addition, several function calls in some scripts had to be changed to ensure that the network_rates_20140529 and simple_stochastic_sim_20140530 functions were being called on.

There is still more troubleshooting to go as I got the following error message after trying to run the simulation:

??? Attempted to access Nlist(14,5); index out of bounds because size(Nlist)=[14,4].

Error in ==> simple_stochastic_sim_20140530 at 73 Nlist(icount,ii-n_genes) = max(0,Nlist(icount,ii-n_genes)-1/4);

Error in ==> Stochastic_Forward_Simulation_20140520 at 20 [simtime,model] = simple_stochastic_sim_20140530(@network_rates_20140529,simtime,x0);

Error in ==> Stochastic_Driver_20140520 at 8 Stochastic_Forward_Simulation_20140520

Once I have finished debugging the simple_stochastic_sim_20140530 function, I will run the forward simulation with the other network_rates function.

Katrina Sherbina 23:24, 30 May 2014 (EDT)

## June 2, 2014

I spent some time today debugging the Michaelis-Menten simulation scripts that Juan was using. Each time he tried to run the simulation, he got a matrix dimension mismatch error in the general_network_dynamics_mm script. In order to debug the code, I had to comment out the "clc" and "clear all" calls that he added to the beginning of the estimation_driver script. In debugging the general_network_dynamics_mm script, I noticed that the dimensions of the array D is 1x21 while the dimensions of the array zz is 21x1. As a result, it was not possible to do the following computation: D.*zz. The problem ended up being in how the variable D is specified. In Juan's script, he set D to be degrate. However, the correct line of code is D = degrate(:). This corrected line ensures that D is a column vector of dimensions 21x1. To be consistent with the script that I used when I ran simulations for my thesis, I also changed the way the variable P is computed to the following line of code: P = prorate(:). In making these changes, the simulation was able to run.

After helping Juan, I went back to the stochastic model. As I forgot to copy the files I was working on at home to my flash drive, I spent some time recreating the files I wrote about in my entries for May 29, 2014 and May 30, 2014. I tried to run a forward simulation for the stochastic model using the network_rates_function_20140529 and got the same error message that I received on May 30, 2014. At this point, I began troubleshooting the simple_stochastic_sim_20140530 script. In order to do so, I had to rerun the simulation outputting icount, Nlist, ii, and ii-n_genes at each iteration for the for loop in the simple_stochastic_sim_20140530 script. In so doing, I got almost the same error message as I received May 30, 2014 with the exception that the simulation was trying to access Nlist(8,5). At this point, since ii is greater than the number of genes in the network, the simulation is told to access Nlist(icount,ii-n_genes). I believe that the element accessed is given by (icount,ii-n_genes) in order to take into account that the arrays r2 and Ti in the same script have the dimensions n*3 x 1 where n is the number of columns in Nlist. However, the computation ii-n_genes becomes problematic when ii is not a multiple of n_genes. In the error message that I described, the problem is that ii is 9 which is not a multiple of n_genes which is 4. So, the simulation cannot access the correct element in Nlist. To fix this, I replaced Nlist(icount,ii-n_genes) with Nlist(icount,mod(ii,n_genes)). However, after trying to run the simulation with this fix, I got the following error message:

??? Attempted to access Nlist(14,0); index must be a positive integer or logical.

Error in ==> simple_stochastic_sim_20140530 at 76 Nlist(icount,mod(ii,n_genes)) = max(0,Nlist(icount,mod(ii,n_genes))-1/4);

Error in ==> Stochastic_Forward_Simulation_20140520 at 20 [simtime,model] = simple_stochastic_sim_20140530(@network_rates_20140529,simtime,x0);

Error in ==> Stochastic_Driver_20140520 at 8 Stochastic_Forward_Simulation_20140520

From the error message, it looks like there is a problem when ii is greater than the number of genes but is divisible by the number of genes. To fix this problem, I rewrote the for loop in simple_stochastic_sim_20140530.m governed by the condition ii<=n_genes as follows

if ii<=n_genes Nlist(icount,ii) = min(2,Nlist(icount,ii)+1/4); Nlist1 = Nlist %%% For troubleshooting purposes else if mod(ii,n_genes)==0 Nlist(icount,n_genes) = max(0,Nlist(icount,n_genes)-1/4); Nlist3 = Nlist %%% For troubleshooting purposes else Nlist(icount,mod(ii,n_genes)) = max(0,Nlist(icount,mod(ii,n_genes))-1/4); Nlist2 = Nlist %%% For troubleshooting purposes end end

In making the above alterations to the code, I reran the forward simulation. This time around I did not get any error messages. However, the model outputted does not seem correct. In looking at the array log2FC.model, I noticed that most of the elements of the array were -Inf.

I decided to run the forward simulation with the script network_rates_function_20140527.m instead of network_rates_function_20140529.m to see if the model would be any better. In order to do so, I had to change any call to the network_rates_function in any of the scripts in the forward simulation to network_rates_function_20140527.m from network_rates_function_20140529.m. In addition, I had to also add a globals statement to Stochastic_Forward_Simulation_20140520 to add the variable sumall to the globals. The variable sumall denotes the sum of the weights of all edges in the network. Also, I also added the variable sumall to the globals statement in the scripts network_rates_20140527.m and simple_stochastic_sim_20140530.m. However, the model simulated with the network_rates_function_20140527 script was not any better than that simulated with the network_rates_function_20140529 function. The array log2FC.model had many -Inf elements. I will be emailing Dr. Fitzpatrick to see if he has any idea what the problem could be.

Katrina Sherbina 18:59, 2 June 2014 (EDT)

## June 10, 2014

Dr. Fitzpatrick suggested that I simplify the forward simulation that I have been working with because it had too many moving parts as I adopted Nick's code. As I have not done as much work with the stochastic model as Nick did, I spent a couple of days going over some old journal club presentations and reading some online material on Markov Chain Monte Carlo simulations. I also ended up reading some of Nick's journal entries on his work on the stochastic model and came across an entry at the very beginning of the work referencing early simple stochastic simulation and network rates scripts. I dug up these scripts from my email and started to make modifications to it in order to use the network_rates_20140527.m script.

I saved a copy of this early simple stochastic simulation script as simple_stochastic_sim_20140610.m in order to make some changes. Many of these changes were changes I made to the simple stochastic sim and network rates scripts I have already described in past entries. I removed all scripts having to do with production rates because the production rates are not incorporated into the model yet. In using the simple_stochastic_sim_20140610.m, I removed all input sheets from the input Excel workbooks except network, network_rates, and simulation_times and renamed the input Excel workbook as Simplified_Input_4_Gene_Dummy_Network_Stochastic_Model.xls.

The forward simulation now just involves two scripts: simple_stochastic_sim_20140610.m and one of the two network_rates scripts that I have already been working on. The simple_stochastic_sim_20140610.m incorporates a simplified version of Stochastic_Parameters_20140520.m. I tried to run the script but ran into the following error:

??? Index exceeds matrix dimensions.

Error in ==> network_rates_20140527 at 14 wtii = wtmat(ii,jj); % Weight of the regulatory effect of each of the transcription factors

Error in ==> simple_stochastic_sim_20140610 at 92 [rr] = network_rates_20140527(N0);

I am in the process of debugging this error. I am surprised with this error because I the code I used closely resembles the simple_stochastic_sim I have used before. I have also not made any changes to the network_rates scripts in this forward simulation.

Katrina Sherbina 01:14, 11 June 2014 (EDT)

## June 12, 2014

I managed to fix the error that I was getting after making a few changes that I made to the following scripts: simple_stochastic_sim_20140610.m and network_rates_20140527.m:

- Changed the vector x0 from a n_genes x 1 vector of zeros to ones.
- Changed jump_rates in the network_rates script from a 2 column array to a 3 column array.
- Altered the globals statement in simple_stochastic_sim to match the globals statement in the network_rates script.
- Had to modify the for loop that updates the Nlist in the simple_stochastic_sim as discussed in the entry for June 2, 2014.
- Added line of code that takes the log base 2 of the model.
- Add script to save a .mat file of the simulation.

## June 16, 2014

Today, I began creating another network_rates script (called network_rates_201406106.m) with different state probabilities that Dr. Fitzpatrick and I discussed in today's morning meeting. Before I talk about the actual modified network_rates script, I should note that I changed two parameter values at the onset. I made the parameter alpha much smaller (i.e. 0.01 for the genes controlled by at least one transcription factor in the network and 0.005 for the no input genes) and, as a result, made nmc greater (i.e. 1000). I also removed a line of code in the simple_stochastic_sim_20140610 script that found the log base 2 of the model and added a loop to graph the model for each gene.

As for the modifications to the network_rates script, the following new state probabilities were coded for the genes that are controlled by at least one transcription factor in the network:

- The probability that the state of a target gene is 1 in the next time step is given by

(alpha_i/2)+sum_j(w_{ij}*(y_j > 0)) where j is an element of the activators and alpha_i is a parameter dependent on the target gene i

- The probability that the state of a target gene is -1 in the next time step is given by

(alpha_i/2)-sum_j(w_{ij}*(y_j > 0)) where j is an element of the repressors and alpha_i is a parameter dependent on the target gene i

- The probability that the state of a target gene is 0 in the next time step is given by

1-(alpha_i)-sum_j(abs(w_{ij})*(y_j > 0)) where j is an element of the set of transcription factors and alpha_i is a parameter dependent on the target gene i

After running the forward simulation using the scripts simple_stochastic_sim_20140610.m and network_rates_20140616.m with the network that I have been working with without any errors popping up in the process, I ran the forward simulation for a few other specific networks:

- Each gene is regulated by one transcription factor in the network and the regulatory weight is positive (Simplified_Input_4_Gene_Dummy_Network_All_Positive_Weights_Stochastic_Model.xls).
- Each gene is regulated by one transcription factor in the network and the regulatory weight is negative (Simplified_Input_4_Gene_Dummy_Network_All_Negative_Weights_Stochastic_Model.xls).
- Each gene is regulated by at least one transcription factor in the network and there is a mixture of positive and negative regulatory weights in the network (Simplified_Input_4_Gene_Dummy_Network_Mixed_Weights_Stochastic_Model.xls).

However, I obtained strange expression profile plots for each gene in the network for the first two aforementioned networks (in bulleted list). For both of these networks, the expression plots for each gene began at 1 and then sharply decreased to 0. I would expect this of the network with all negative weights but not of the network with all positive weights. For the simulation using the network with all positive weights, the model for each gene zeros out by 38 time points. For the simulation using the network with all negative weights, the model for each gene zeros out by 42 time points. The models for each of the networks are not identical but do follow the same trend. In the case of the third of the aforementioned networks (in bulleted list), the simulation could not even finish outputting the line Tnow = 0 many times before getting stuck. Taking a look at the model once I stopped the simulation n_model_times x n_genes array of 0's.

I experimented with two more values of alpha for the genes that are controlled by at least one transcription factor: 0.1 and 0.9. I thought that maybe the expression profiles for the genes in the all positive weights network could be decreasing to 0 because maybe alpha was too small and perturbations to the network would result in very small probabilities (close to 0) that the state of a gene would be 1. However, I observed the same trends as described when I originally ran the simulation with the alpha set to 0.01. I tried to run the simulation with the mixed weights network with an alpha of 0.9 and the simulation again would not stop outputting several lines of Tnow = 0 before getting stuck.

In looking at the scripts simple_stochastic_sim_20140610.m and network_rates_20140616.m I noticed what is most likely screwing up the simulation. In the simple_stochastic_sim_20140610 script, I forgot to call network_rates_20140616 and not another network_rates script. In addition, in the function description in network_rates_20140616.m, I called network_rates and not network_rates_20140616. In looking at other versions of the network rates, I noticed the same problem (i.e. that the name in the function description line did not match the name of the actual script). I remedied this issue in all of the network_rates scripts that I have.

I reran the simulation with the simple_stochastic_sim_20140610 and network_rates_20140616 scripts reverting the value of alpha for all genes controlled by at least one transcription factor in the network to 0.01 and got much better results. For the simulation with the all positive weights network, the expression profiles outputted for each gene oscillated a lot but the oscillations began to decrease with time. For the simulation with the all negative weights network, the expression profiles outputted for each gene decreased sharply to 0. The results are to be expected for each of the networks considering the sign of the regulatory weights in each network. For the simulation with the mixed weights network, the expression profiles outputted were similar between GeneA and GeneB and between GeneC and GeneD. For GeneA and GeneB, the plot increased plateuing a bit above 1.6. For GeneC and GeneD, the plot decreased to 0. These results are expected since the transcription factors regulating GeneA and GeneB have positive regulatory weights while the transcription factors regulating GeneC and GeneD have negative regulatory weights.

Katrina Sherbina 17:27, 16 June 2014 (EDT)

## June 23, 2014

For the past few days, I spent some time browsing the resources on stochastic models and gene regulatory networks. Dr. Fitzpatrick and I had briefly discussed last week the notion of noise in the model being indicative of a pattern of expression of gene. I looked up a few articles by a professor I met at UC Irvine as I remembered that one of his students talked about his research into that topic. However, I am not sure if we can use the work presented in a few of those articles that I read.

Today, Dr. Fitzpatrick clarified what he meant by simplifying the code. In talking with him, I realized that there were many extraneous elements to the "simplified" code that I have been working with that remained from trying to base my work off of Nick's. These extraneous elements are in part as a result of the fact that Nick's simulation used time steps of different sizes whereas the simulation we are trying to do now uses a fixed time step. However, with Dr. Fitzpatrick's suggestion, I stripped away many of these elements and created a much simpler simple_stochastic_sim script, which I have saved as simple_stochastic_sim_20140623.m.

In this new, simplified script, I cleaned up the loop that calls the network_rates function (I am still using network_rates_20140616.m). First of all, I changed it from a for loop to a while loop that executes a set of commands given that a counter (which is initialized prior to the loop) is less than or equal to the number of increments created given the time range imported from the input workbook containing simulation times. I increased the number of these increments from 101 to 1001. In removing many unnecessary variables, I created a new for loop in this while loop that assigns a state to each gene by comparing a random number generated for each gene with the probability that the state of a gene is 1 and the probability that the state of a gene is -1 in the next time step. The loop looks as follows

for g = 1:n_genes if u(g) <= up(g) x1(g) = 1; % State of a gene goes to 1 model(icount,g) = 1; % Store state to output matrix elseif (u(g) > up(g)) && (u(g) <= up(g)+dwn(g)) x1(g) = -1; % State of a gene goes to -1 model(icount,g) = -1; % Store state to output matrix elseif u(g) > up(g)+dwn(g) x1(g) = 0; % State of a gene goes to 0 model(icount,g) = 0; % Store state to output matrix end end

where *up(g)* is the probability that the state of a gene *g* goes to 1, *dwn(g)* is the probability that the state of a gene *g* goes to -1, and *icount* is a counter.

I ran this new script using the network with all positive weights a few times to see if the genes in the network had a similar profile (by this I mean the plot of the state of each gene for each time step) between the runs. For two genes, specifically GeneA and GeneC, I saw that the genes went to a state of 1 more than a state of -1. However, the profile for GeneB and GeneD changed between the runs.

There are still some things left to explore/implement with this script as it is now:

- If there is some consistency in the profile of each gene given more runs are performed with the simulation.
- The profile of the genes in the network using the all negative weights network and the mixed weights network.
- Incorporating a no input gene into the network.
- Looping the entire script so as to automate the runs.
- Look at different values of the gene specific parameter alpha.

Katrina Sherbina 19:17, 23 June 2014 (EDT)

## June 24, 2014

Today, I spent some time modifying the simple_stochastic_sim_20140623 script to being able to loop several runs of the simulation and save the output from each run automatically. This involved creating a new script called driver_20140624.m:

global A wtmat prorate degrate wts log2FC alpha i_forced no_inputs n_genes counter nmc tspan sumall nrun TX1

input_file_name = 'Simplified_Input_4_Gene_Dummy_Network_All_Negative_Weights_Stochastic_Model'; igraph = 1;

for nrun = 1:5 simple_stochastic_sim_20140623 end

input_file_name needs to be changed by the user to whatever workbook contains the input data (network and simulation times) the user wishes to work with. The for loop runs the simple stochastic simulation as many times as the user specifies.

I modifying the code in simple_stochastic_sim_20140623 that creates names for the graphs of the change of a gene's state over time to include the number of the run that is being executed. To ensure that the appropriate graphs would be saved for each run, I added the statement "close all" to the end of the script. I also added the following for loop to store some data to a mat file that describes the frequency of a specific state for each gene:

for j = 1:n_genes % Ouput some information about the states of the genes info.run = nrun; % Number of run info.gene{j,1} = TX1{1+(j),1}; % Name of gene info.nup(j,1) = length(find(model(:,j)==1)); % At how many time points does the state of a gene goes to 1? info.ndwn(j,1) = length(find(model(:,j)==-1)); % At how many time points does the state of a gene goes to -1? info.compare(j,1) = info.nup(j,1)>info.ndwn(j,1); % Does the state of a gene go to 1 more so than -1? end

I performed five runs of the simulation for the all positive weights network and the all negative weights network. I did notice that the state of GeneA, GeneB, and GeneC did go to 1 more so than -1 when performing the simulation for the all positive weights network. Eyeballing the plots generated from performing the simulation for the all negative weights network, it looks the state of the genes went to -1 more so than 1. These results make sense considering the networks.

It will be interesting to see the kinds of results I get when I run the simulation with a mixed weights network. I am also interested in seeing how the value of the gene specific parameter alpha will affect the results. I am also wondering whether I should try to perform more runs to get a better, more reliable interpretation of the genes' profiles.

Katrina Sherbina 00:10, 25 June 2014 (EDT)

## June 25, 2014

Today, I ran five runs of a simulation with the code I was working on yesterday (driver_20140624.m, simple_stochastic_sim_20140623.m, and network_rates_20140616.m) with the mixed weights network (Simplified_Input_4_Gene_Dummy_Network_Mixed_Weights_Stochastic_Model.xls). After completing all the runs, I looked at the frequency with which the state of each gene went to 1 and to -1. GeneA, GeneB, and GeneD showed the clearest difference between these two frequencies. The state of GeneA and GeneB went to 1 more so than -1. The state of GeneD went to -1 more so than 1. It would seem that, at least in this mixed weights network, GeneA and GeneB are up-regulated while GeneD is downregulated. These observations are also evident from the plots of the change in state of a gene over time. Whether or not the state of GeneC goes to 1 more so than -1 or vice versa is not as clear. For two runs, the frequency the state of GeneC went to 1 was equal to the frequency that the state went to -1. For another two runs, he frequency the state of GeneC went to -1 was greater than the frequency that the state went to 1. From looking at the input sheet, since the only regulator that controls GeneC (GeneD) has a negative regulatory weight, GeneC should be down-regulated. I think this implies that the frequency that the state of GeneC goes to -1 should be greater than the frequency that the state goes to 1.

I tried experimenting with the number of time steps. I increased it to 1,000,000 but then it was not possible to make sense of the plots. I might try setting the number of time steps to 10,000 or 100,000 (I have been specifying 1,000 time steps in the simulations for which I have described the results), but I am not sure that that will have much of an effect.

On a different note, Dr. Dahlquist sent me an email about a script that determines if a gene has a significant change in expression between two different strains. In looking into her question, I found that the script I had one my computer had several lines of code to create plots to see the expression profile for each gene between two different strains. I tried to quickly run the script but I ran into an error with one of the variables. I will try to figure out the source of this problem and see if I can re-integrate the lines of code to create the plots.

Katrina Sherbina 22:23, 25 June 2014 (EDT)

## June 27, 2014

I've taken a break from the stochastic model and went back to a MATLAB script that determines if a gene has a significant change in expression between two different strains that is described in a protocol called Comparing Significant Changes in Expression Between Two Strains. I was able to troubleshoot the script per Dr. Dahlquist's request to add a few old lines of code that create plots to see the expression profile for each gene between two different strains to the code in the protocol. This involved adding some new variables that serve as independent variables necessary to create the plots. At the moment, plots are generated for all genes that have an unadjusted p value of less than 0.05. Since hundreds of genes satisfy this p value requirement, I will ask Dr. Dahlquist if she wants to create a more stringent filter (i.e. Benjamini & Hochberg adjusted p value of less than 0.05) to generate these plots.

I am currently working on adding more comments to the protocol to allow the user to understand what each part of the code means.

## June 30, 2014

Today, I went back to the stochastic model. The latest iteration of it didn't seem to be doing what Dr. Fitzpatrick and I would expect given what we know about transcription.

- When we took another look at the gene profile plots that I sent last week during today's morning meeting, in particular the plots from the all positive network simulation, the gene was more likely to stay at a state of 0 rather than a state of 1. We plotted the probability of a state of 1 at each time step and the probability of a state of 0 at each time step and found that the latter probability was much higher than the former. Whereas, we would think that if a gene is activated in our current network (which is a 4-gene cyclic network), then the gene would stay on.
- We tried to deal with the above issue by gradually increasing the gene-dependent parameter alpha. In so doing, the probability that the state of a gene is 1 did become greater than the probability that the state of a gene is 0. However, in so doing, this random chance parameter alpha has a greater influence on the model dynamics than the regulatory weights. We would hope that the converse is the case.
- In the gene profile plots, the time a gene spent at a state of 1 was relatively short. However, biologically, we think that the transcription factor that first activates a gene stays one allowing many mRNAs to be produced in a relatively short time. According to the chapter "How Cells Read the Genome: From DNA to Protein" from the book
*Molecular Biology of the Cell 4th Edition*by Alberts*et al.*, it is the case that mRNAs are produced quickly because the synthesis of a new mRNA is begun before an RNA polymerase even finishes synthesizing the prior mRNA.

With the aforementioned problems, Dr. Fitzpatrick and I discussed a different way to express the state probabilities (in the following, X_i is the state of a gene *i* and y is the state of the gene at the current time step):

- P(X_i(t+h) = 1 | X(t) = y) =
- (alpha_i/2) + sum_ j (w_{ij}*(y_ j > 0) when y_i = 0 where
*j*is an element of the activators that control the target gene*i* - beta_i when y_i > 0
- 0 when y_i < 0

- (alpha_i/2) + sum_ j (w_{ij}*(y_ j > 0) when y_i = 0 where

- P(X_i(t+h) = -1 | X(t) = y) =
- (alpha_i/2) - sum_ j (w_{ij}*(y_ j > 0) when y_i = 0 where
*j*is an element of the repressors that control the target gene*i* - beta_i when y_i < 0
- 0 when y_i > 0

- (alpha_i/2) - sum_ j (w_{ij}*(y_ j > 0) when y_i = 0 where

- P(X_i(t+h) = 0 | X(t) = y) = 1 - P(X_i(t+h) = 1) - P(X_i(t+h) = -1)

For both probabilities when the current state y_ j > 0, the gene is up- or down-regulated based on an OR probability (either due to the random chance alpha_i/2 or by the sum of the regulatory weights). The parameter alpha_i should be relatively small. On the other hand, the parameter beta_i should be relatively high since the high affinity of the transcription factor to its binding site should ensure that a gene stays up- or down-regulated (depending on the type of transcription factor) until the gene has a higher chance of being in another state.

I modified the existing network_rates script renaming it network_rates_20140630.m to reflect the revised state probabilities above. I added a few lines of code to simple_stochastic_sim_20140623.m, for troubleshooting purposes, to create plots of the change in the probability that the state of a genes goes to 1, the change in the probability that the state of a genes goes to -1, and the change in the probability that the state of a genes is 0.

I ran a simulation using driver_20140624.m, simple_stochastic_sim_20140623.m, and network_rates_20140630.m using the all positive weights network (Simplified_Input_4_Gene_Dummy_Network_All_Positive_Weights_Stochastic_Model.xls). I set alpha to 0.01 and beta to 0.5. However, I observed what I spoke about previously as a problem. The probability of no change was higher than the probability that the state of a gene goes to 1 at most time steps. I then reran the simulation decreasing alpha to 0.001 but keeping beta at 0.5 to see if increasing the difference in magnitude between the two parameters would force the state of a gene to stay at 1 once it reached 1. However, with these changes, the frequency with which the state of a gene went to 1 was much less than with the initial value of alpha. I think I need to experiment some more with the alpha and beta parameters.

In addition, with this new network_rates script, when the state of a gene went to 1, the gene stayed at this state longer than when the simulation was run with network_rates_20140616.m. While we were thinking that maybe the time for which the state of a gene stayed at 1 should be longer than what it was when running the simulation with network_rates_20140616.m, I am not sure how long we really expect the gene to stay at a state of 1 given an all positive weight network. The synthesis of mRNA occurs quickly but the x-axis of the gene profile plots goes from 0 minutes to 60 minutes. Therefore, maybe a genes stays at a state for such a short time that it is difficult to tell on the scale of minutes.

Katrina Sherbina 18:32, 30 June 2014 (EDT)

## July 2, 2014

I have continued working with the most current stochastic model simulation (driver_20140624.m, simple_stochastic_sim_20140623.m, and network_rates_20140630.m) experimenting with different values of the regulatory weights in the all positive weight regulatory network (4 gene cyclic network) and the parameters alpha and beta. I began by doing a little bit of scratch work to see what the state probabilities of GeneD in the network would be given that the initial state of all genes is 0 and that GeneA turned on (the state of GeneA went to 1) by chance. In doing this scratch work, I decided to increase the magnitude of the weights in the all positive weights gene regulatory network from 0.5 to 0.7 in order to see how the probability that the state of GeneD will go to 1 with GeneA at a state of 1 would change the gene profile plots. In so doing, I found that all of the genes went to a state of 1 more frequently.

I also decided to see what would happen if I increased the value of beta keeping the value of alpha the same. By increasing beta from 0.5 to 0.8, the maximum probability that the state of a gene would go to 1 went from 0.5 to 0.8, which is expected since beta is the probability that the state of a gene stays at 1 given that its current state is greater than 0. I thought that with higher regulatory weights and a greater value of beta, there would not be any time step during which the state of a gene went to -1. Since this was not the case, I went back to doing some more scratch work and decided to see if I could decrease the chances that the state of a gene would go to -1 by decreasing the value of alpha but keeping the value of beta at 0.8. In so doing, there were less spikes to a state of 1 for all genes.

Before I send Dr. Fitzpatrick the plots that I have been compiling of the gene profiles and the changes in the state probabilities, I want to experiment with increasing the value of alpha a bit but keeping beta at 0.8 and the regulatory weights at 0.7.

Katrina Sherbina 17:40, 2 July 2014 (EDT)

## July 3, 2014

I ran a simulation setting alpha to 0.03, beta to 0.8, and all of the regulatory weights to 0.7. Looking at the plots of the change in a gene's state over time, I didn't really notice anything glaringly different or any improvement from the other simulations I have run this week. I do not want to increase alpha too much because, as I have said before, the model should be more influenced by the regulatory weights than the parameter alpha. It would be worthwhile I think to experiment with different values of alpha and beta again when we start to fit the model to the microarray data.

Other than that, I finished compiling two different sets of plots into a PowerPoints that I sent off to Dr. Fitzpatrick: plots of the change in a gene's state over time and plots of the probabilities of a state for one of the genes in the network.

Katrina Sherbina 18:25, 3 July 2014 (EDT)

## July 7, 2014

In an effort to make sense of the state dynamics of each gene, I made some modifications to the stochastic model scripts in order to find the average model for each gene. By average model, I mean the average state for each time point over many simulation runs for each gene. In order to find the average model, I made the following modifications:

- Added a new sheet to the input Excel workbook called "run_parameters" that contains three parameters:
- alpha1 = alpha associated with each gene that has at least one network input
- alpha2 = alpha associated with each gene that has no inputs
- beta = gene specific parameters that is the probability that the state of a target gene in the next time step is the same as the one in the current time step

- Separated all of the lines in simple stochastic sim (which was then resaved as simple_stochastic_sim_20140707.m) that imported parameters from the input Excel workbook and put it into a new script parameters_20140707.m
- In the network rates code (renamed network_rates_20140707.m), in order to not replace all the existing variables, bi (beta) was assigned the value of the imported parameter beta and ai (alpha) was assigned the value of the imported parameter alpha1 or alpha2 dependening on whether the target gene has at least one network input or not.
- In the driver code (renamed driver_20140707.m), in the for loop that runs the simple stochastic sim as many times as the user specifies, I put in the following lines of code to find the average model for each gene:

models(:,:,nrun) = model; % Create a n_model_times x n_genes x nruns array that stores the model from each run modelAvg = mean(models,3)+ model/N; % Find the average for each gene and model time point across all the simulation runs

I run the simulation using the scripts driver_20140707.m, parameters_20140707.m, simple_stochastic_sim_20140707.m, and network_rates_20140707.m using an all positive weights network (magnitude of each weight is 0.5) for 100 runs. I then created a plots of the average model for each gene using the following code:

for ii=1:n_genes figure(ii),hold on plot(time,modelAvg(:,ii),'LineWidth',1) title(TX1{1+(ii),1},'FontSize',14) xlabel('Time (minutes)','FontSize',14) ylabel('State','FontSize',14) end figHandles = findobj('Type','figure'); %Find the figure number of all open figures nfig = max(figHandles); for ii = 1:nfig saveas(figure(ii),[input_file_name '_' num2str(nrun) '_Simulation_Runs_' TX1{1+(ii),1} '.jpg']); %Save each figure as a .jpeg saveas(figure(ii),[input_file_name '_' num2str(nrun) '_Simulation_Runs_' TX1{1+(ii),1} '.fig']); %Save each figure as a .fig (opens through MATLAB) end

In creating plots of the average model for each gene, I thought the plots would be smoother and would show a clearer trend, namely that the genes get up-regulated. However, all the plots were still pretty choppy. I have emailed Dr. Fitzpatrick my results. I will try to run more simulations increasing the simulation runs and trying the higher positive weights regulatory network.

Katrina Sherbina 17:42, 7 July 2014 (EDT)

## July 8, 2014

The plots that I outputted yesterday were choppy because I was not resetting the counter for the simple stochastic simulation after each run. As a result, the model was never re-run. In addition, I was also computing the average model incorrectly. Here is the correction made to the loop in driver_20140707.m to do multiple runs:

nrun = 1;

while nrun <= N nrun icount = 2; % Reset the counter for the simple stochastic sim % Beginning point for counter correponds to the first % index of first nonzero entry in the array time simple_stochastic_sim_20140707 % Run stochastic model simulation modelAvg = modelAvg + model/N; % Find the average for each gene and model time point across all the simulation runs modelUp = modelUp + (model>0)/N; % Fraction of runs in which the model is in state = 1 modelDn = modelDn + (model<0)/N; % Fraction of runs in which the model is in state = -1 model0 = model0 + (model==0)/N; % Fraction of runs in which the model is in state = 0 nrun = nrun + 1; % Increment the counter end

In addition, Dr. Fitzpatrick also added some lines to find some stats, namely for what fraction of the runs does the model have a state of 1, -1, and 0. Also in the driver script, the loop to create the plots was modified in order to plot the model average as well as those stats.

I ran the simulation (100 runs) with the all positive weights network (magnitude of 0.5), alpha = 0.01, and beta = 0.8. The simulation took about 20 seconds to run on my personal computer. In observing the plot outputted for each gene, I noticed that the model average was similar to the fraction of runs in a state of 1. This is what I would expect considering that the input is an all positive weights network.

Katrina Sherbina 17:01, 9 July 2014 (EDT)

## July 9, 2014

Today, I ran more simulations with both all positive weights networks (magnitude of the weights 0.5 and 0.7) with different values of alpha and beta. I am currently putting together all of the plots to email to Dr. Fitzpatrick and see what kinds of trends there are.

## July 10, 2014

Today, I finished compiling the PowerPoint of plots from the simulation using all positive weights networks experimenting with different alpha and beta parameters. There are a few patterns that I noticed from these plots:

- The plots of the model average and the fraction of runs with the model at a state of 1 closely agree.
- Increasing the beta parameter causes more noise in the plots, specifically looking at those with regulatory weights with a magnitude of 0.7.
- Increasing the alpha parameter brings plots of the model average, fraction of runs with the model at a state of 1, and fraction of runs with the model at a state of 0 closer together.
- In the figures for the all positive regulatory network magnitude 0.7, increasing the alpha parameter even brought the plot of the fraction of runs with the model state 0 beneath the plots of the model average and fraction of runs with model state 1.

- Plots of the fraction of runs with a model state of -1 always remained close to 0.

I also ran simulations with the all negative weights network (regulatory weight -0.5) experimenting with different alpha and beta parameters. I found that it was more difficult to separate the plots of the model average, the fraction of runs with a model state of -1. and the fraction of runs with a model state of 1. To do so, I need to greatly increase the alpha or beta parameter. In addition, in some of the figures, if my eyes aren't deceiving me after staring at my monitor for awhile, it seems as if the plot of the fraction of runs with the model state of -1 looks like a mirror image of the model average plot. I would expect that these two plots would closely agree seeing as that the regulatory network inputted into code consists of only negative weights.

Tomorrow, I will run some simulations with a regulatory network that has greater negative regulatory weights, most likely -0.7.

Katrina Sherbina 01:08, 11 July 2014 (EDT)

## July 11, 2014

Today, I ran some simulations with two networks: a mixed weights 4-gene fully connected network and the network we have used to model *Saccharomyces cerevisiae* wild type microarray data. In the mixed weights 4-gene fully connected network, each gene has two positive regulatory weights (0.5) and two negative regulatory weights (-0.5). The wildtype network consisted of only positive regulatory weights (0.5).

I ran one simulation to observe the no input genes with the wild type network using an alpha of 0.005 for the no input genes and 0.01 for the genes with at least one input and a beta of 0.8. In the figures produced for the three no input genes (FHL1, SWI6, and SKO1), the plot of the fraction of runs with the model state of 0 stayed close to if not equal to 1 while the other plots were centered at 0. All in all, the plots in the figures for the genes had a lot of noise particularly for those genes that have a high in-degree, such as CIN5 and SWI4, or are controlled by a gene with a high in-degree, such as HAP5.

In the figures outputted during the simulation using the 4-gene fully connected network, I saw some trends with regards to changing the alpha and beta parameters as I have seen in other simulations with other 4-gene dummy networks. Increasing the value of beta increased the noise and moved the plot of the fraction of runs with a model state -1 and fraction of runs with a model state of 1 away from the plot of the model average. Increasing the alpha parameter separated the plot of the fraction of runs with model state -1 and fraction of runs with model state 1 away from the plot of the model average. It also shifted the center of the plot of the fraction of runs with model state 0 lower (you also see this when increasing the beta parameter. Something I noticed in particular was that the plots in the figure generated for GeneB (alpha of 0.01 and beta of 0.8) were more crowded together than for the figures for the other genes. I'm not exactly sure why this is considering that all genes are controlled by the same number of magnitude and direction of regulatory weights. Maybe this is just something due to chance.

Katrina Sherbina 00:39, 12 July 2014 (EDT)

## July 14, 2014

I ran some simulations with an all positive weights 4-gene fully connected network using the scripts driver_20140707.m, parameters_20140707.m, simple_stochastic_sim_20140707.m, and network_rates_20140707.m. Each positive weight has a magnitude of 0.5. As with the previous simulations with other networks, I have compiled plots for each gene of the model average and stats, namely the fraction of runs during which the model state is 0, 1, and -1. There were a couple of things I noticed in these plots:

- The center of the plot of the fraction of runs during which the model state is 0 lowered as beta was increased so much so that it was beneath the center of the plot of the fraction of runs during which the model state is 1 (when alpha = 0.01 and beta = 0.8).
- The center of the plot of the fraction of runs during which the model state is 1 increased as beta is increased.
- Changing the alpha parameter didn't seem to have much of an effect on the plots.
- The plot of the model average closely followed the plot of the fraction of runs during which the model state is 1 in all plots.

Tomorrow, I will work on some simulations with random networks.

Katrina Sherbina 21:24, 14 July 2014 (EDT)