Fall 2012 Computational Journal
September 5, 2012
Today, I started working on refreshing my memory of a clustering code (load_expr_sig.m) that we briefly worked on last academic year, which involved commenting the code. To run this code, you need an excel file (avgs_and_signif.xls) that has, for each strain data set, the average log fold changes for each time point and a list of 1's and 0's denoting which genes are significantly differentially expressed and which are not as determined by an ANOVA and B&H correction. The original spreadsheet with the significantly differentially expressed genes determined after the B&H corrections was accidentally deleted. Therefore, I went back to the simple_anova_code.m to do an ANOVA and B&H corrections for each strain after which a reconstruccted the avgs_and_signif.xls file.
At the moment, I am a bit confused about the variables T and T2 that are calculated in the for loop. First of all, I made both T and T2 matrices because the indx would be constructed incorrectly after the first for loop because, as the code was originally, only the T and T2 from the last round of the first for loop (corresponding to dZAP1) would be used. However, I am not sure what function T and T2 serve. Later, they are only used to construct the indx which it would seem could be constructed just as well from expr_data.sign which is a list of 1's and 0's for each data set where a 1 represents a gene that has a significant change in gene expression and 0 represents a gene that does not have a significant change in gene expression. From the code, T and T2 are both reproductions of expr_data.sign.
Into the first for loop (which constructs the data frame expr_data for each data set, I added a loop to try and include the standard names and systematic names of the genes in the data frame to maybe later use to determine which genes are in which cluster:
for jj = 1:n_genes expr_data(ii).systematic_name = b{jj+1,1}; expr_data(ii).standard_name = b{jj+1,2}; end
n_genes is previously calculated in the outer loop.
However, only one gene name is copied to expr_data.systematic_name YPR204W. In fact, the same gene is copied for expr_data.systematic_name for each data set. I will be troubleshooting this issue next week.
Katrina Sherbina 21:19, 5 September 2012 (EDT)
September 12, 2012
I understand now what T, T2, indx, and ind2 represent. T calculates which genes have significant change in gene expression in each experimental data set (assigning a 1 to them) and all those genes for which this is not true are assigned a 0. So indx (which find what genes in T are assinged a 1)represents an "AND" model. T2 calculates in how many data sets did a gene have a signficiant change in gene expression. So ind2 (which finds what genes in T are assigned a value greater than 0) represents the "OR" model.
Today I focused on the "OR" model in the clustering code. I expanded the data frame expr_data to include a list of the names of the genes. I added variables indx_genes and ind2_genes to see what genes are represented by indx and ind2, respectively. In the last loop, I constructed a data frame called dataset which contains the information about each cluster for each strain such as the index of genes in the cluster, the names of the genes in the cluster, and what transcription factors from the current 21 network we have for the determinstic and stochastic model codes are in each cluster. For the latter, I inserted another for loop that uses strcmp to find which transcription factors are in which cluster. Also, I added to lines of code that import the transcription factor network from the file Input_21_Gene_Network_MM_Model.xls and saves the transcription factor names as a cell array.
At the moment, it is somewhat tedious to look at the information in the data frame dataset to glean any information about the clusters considering the various levels to the data frame. Next week, I will try to write the information in the data frame to an excel file and then maybe get rid of a level or two in the data frame.
I am unclear about one thing in the j loop in the last loop in the code. I am not sure whether there would ever be a case when n1 does not equal 1 and so n2 does not equal nn (which is the length of a cluster).
Katrina Sherbina 21:21, 12 September 2012 (EDT)
October 10, 2010
Using the clustering code that we've been working on since the beginning of the semester, the genes (with a significant change in gene expression as determined by an ANOVA and Benjamini & Hochberg correction) were separated into 10 clusters. When looking at the transcription factors in each cluster, we found that not all of the transcription factors in the current network appeared in at least one cluster for each strain. Thus, we decided to cluster all of the genes into 10 clusters (using the same code with some modifications to take into account all of the genes not just genes with significant changes in gene expression). In so doing, each of the transcription factors appeared in one of the clusters. We decided to try clustering all of the genes into just 6 clusters. In so doing, we found that not only was each trancription factor in one cluster but that there were clusters that had more transcription factors in comparison to any of the clusters generated when setting the cluster number to 10.
An ANOVA was performed (for each gene) for each of the 6 clusters of the wildtype strain expressiond data with the null hypothesis that the average expression of a gene was 0 for all time points and the alternative hypothesis that the average expression of a gene was not 0 for at least one time point. The average log fold change data was used for each gene since the this was the data (not the raw data) that was used in the clustering. For some clusters, none of the transcription factors in that clustered showed up as significant. For other clusters, some but not all of the transcription factors showed up as significant. For each cluster, more that 5% of the genes in a given cluster showed up as significant. Thus, I want to clarify with Dr. Fitzpatrick that a B&H correction should also be performed.
Another statistical test that has to be done is to determine whether there are any genes in the t120 data that have a mean average log fold change that is not 0. This will be done using a t test.
Katrina Sherbina 19:53, 10 October 2012 (EDT)
October 17, 2012
I performed a t-test to determine whether or not there is any gene during the t120 timepoint that does not have a mean log fold change of 0. I computed the t statistic using the formula T = (mean(X)-0)/(standard deviation(X)/sqrt(n)) where X is the log fold change of a gene and n is the number of replicates at t120 for a particular gene. I found the p value using the t cumulative distribution with 3 degrees of freedom. The results of the t-test showed that 302 genes had a mean log fold change that was not equal to 0. A Benjamini & Hochberg correction was then performed. The results of this correction showed that only 1 gene had a mean log fold change that was not eqaul to 0 (YNL208W). From the results of the Benjamini & Hochberg correction, it can be said that a majority of the genes reach equilibrium by the t120 timepoint. Using this information, we will explore the dynamics of the network if we go back to the start of the cold shock experiment (t0) from the t120 time point.
Today I also went back to the clustering code for the wildtype data to try and do a Benjamini & Hochberg correction. When I ran the code with the Benjamini & Hochberg correction written into it, I noticed several bugs in the output. First, some transcription factors appeared in duplicate in some of the clusters. I am not sure why this happended because the only change I made from last week was inserting code to do a Benjamini & Hochberg correction. In addition, I don't think the Benjamini & Hochberg correction is being done correctly because for some clusters there are less p values after the Benjamini & Hochberg correction than there were after first performing an ANOVA. I will troubleshooting these issues next week.
Katrina Sherbina 20:37, 17 October 2012 (EDT)
October 31, 2012
I continued modifiying the MATLAB code for the Michaelis-Menten determinstic model to perform the simulation for the recovery time points with just the wildtype data. Thus, the first thing I did was edit the input sheet so that only the wildtype strain is listed for "Strain" and that the time points are 90, 120, and 150. The data for the wildtype was modified to reflect the t90 and t120 data collected. A column of 0's was added to represent the t150 data because, from previous statistical tests, we believe that steady state should be reached by 150 minutes into the experiment. In addition, I new spreadsheet was added titled "wt_initial_guess" which replaces the current x0 vector in the code (which represents the intial guess we use when solving the differential equation) with the average of the t60 data collected for the wildtype. Accordingly, a call was added in the paramters to import the intial guess and then the variable x0 in the Parameters and general_least_squares_error code was changed to the vector that was imported corresponding to the initial guess. When the simulation is done for other strains, x0 may need to become a matrix where each column represents the appropriate intial guess for solving the differential equation using the data for a specific strain. Since the intial guess has negative values because the average log fold change for a gene at t60 may have been negative, the following modifications were made to the general_network_dynamics_mm script in order to ensure that the array of production rates prod has positive values:
- The absolute value of the output from the michaelis_menten function is taken to generate the matrix f.
- The absolute value of sam is taken.
- In the calculation for the value pro, the absolute value of any entry of zz is used.
I have begun to run the entire deterministic model code. It is currently still running. The graph of fit being done to the data is a cause for concern. The lines of fit are not going thorugh the data. They are short and some seem connected at their ends. I will let the simulation run and check on it sometime this week to see whether any progress has been made.
It may be worth while to run the simulation with just the t90 and t120 data to see if maybe the probably is just in including guesses in the way we are not for what the t150 data may look like.
Katrina Sherbina 20:46, 31 October 2012 (EDT)
November 7, 2012
Most of the changes to code described in the previous entry were undone by reverting the general_least_squares_error and general_network_dynamics script back to what it was at the end of the summer. After the code was reverted, the following changes were made to address the fact that log fold change data from the t60 timepoint is being used for the initial guess rather than a vector of 1's:
- After the average t60 data for the wildtype is imported through the Parameters script, x0 is set as 2 raised to the power of the average log fold change at time t60 for each gene in the network.
- x0 is set in the general_least_squares_error_code the same way as it is set in the Parameters script.
In addition, the minimum x value in the graphs that are outputted when running the Graphs script was changed to 60 because in the data was all pushed to the left when the minimum x value was set at 0.
Looking at the gene expression change over time graphs generated for each of the genes in the network, some of the models seem to fit the recovery data for the genes very well. However, there are quite a number of genes for which the model does not fit through the data for the gene at all, for example ZAP1. The graph for PHD1 is particularly interesting. the data collected at the recovery time points (t90 and t120) suggests that the expression of PHD1 increases drastically during the recovery phase. It may however still be the case that PHD1 recovers (returns to steady state) at a later time for which we do not have data.
Katrina Sherbina 20:17, 7 November 2012 (EST)
November 14. 2012
Today, I did some tweaking to the graphs of the change in expression over time for each gene in the network. In trying to add the data point for the t60 time point for each gene in the network, I realized that I did not include 60 as one of the simulation times in the input sheet when running the code last week. In addition, the average log fold change was added for each time point (the t60 data point was added in this way). After rerunning the code, the fit of the model to the data did improve for some genes such as CIN5 and FHL1. However, there were also genes for which the model didn't seem to fit the data any better than before, such as for ZAP1. The graph for PHD1 did not change. It still suggests that PDH1 recovers after the t150 time point.
Katrina Sherbina 19:34, 14 November 2012 (EST)
November 29, 2012
Today, I ran the the simulation for the recovery phase using the data for all of the strains in order to start working on two tasks:
- Comparing the production rates and weights between the cold shock and recovery phases
- Running the simulation for different randomly chosen networks for both the cold shock and recovery phases.
The simulation just finished after several bugs were fixed. In order to run the simulation, first, the input sheet necessary to run the simulation for the recovery phase was modified:
- The log 2 fold change data for dCIN5, dGLN4, dHMO1, and dZAP1 was added for t90 and t120. A column of 0's was also added to represent the t150 time point (at which all of the genes in the network have presumably recovered).
- The intial guesses to the solution of the differential equation for each strain was added to the existing spreadsheet with the intial guesses for the wildtype strain. These initial guesses were the average log 2 fold change for t60 for each gene for each strain.
- The alpha value was at first set to 0.1. It was changed to 0.04 since this was the alpha value decided upon over the summer to use for the simulation using Michaelis-Menten kinetics.
The code used to run the simulation for the recovery phase was also modified:
- The intial guesses for each strain were all read in as one matrix. Each value in the matrix was transformed by taking 2 to the power of that value. Then, each column of the matrix was saved into separate arrays in the data structure log2FC.
- In the LSE and general_least_squares_error scripts, the initial guess to run the ode45 function (called x0) was put into the for loop that uses the ode45 function to solve the differential equation. This way the appropriate initial guesses were used as each deletion strain was simulated.
- In the graphs script, the commands to plot the average log fold change for each time point for each gene (log2FC.m) and to plot the initial guess (the average log fold change at t60) for each gene were put into a separate loop so that only the log fold change data for t90, t120, and t150 and the simulated model were included in the legend.
The next step is to compare the outputted production rates and weights from the recovery phase to the cold shock phase. The MAT files for each deletion strain used in the cold shock phase simulation were saved so it will not be necessary to run the cold shock phase simulation again. In addition, we still need to run the simulation for the cold shock and recovery phases using different networks.
Katrina Sherbina 01:20, 30 November 2012 (EST)