# Fall 2011 Computational Journal

## October 4, 2011

To determine what genes are sigificantly differentially expressed, p values were determined across all the timecourse data for each gene for each strain using an f-test. First, the average log fold change, μj where j is the timepoint, was calculated for each timepoint for each strain. Second, for each strain, the average log fold change μj was subtracted from the log fold change for each flask for each timepoint. Third, the sum of the squares of these differeneces was found for each strain, which is the SSF (the sum of the squares of the error):

SSF=ΣjΣi(Yij-μj)2


where i is the flask and j is the timepoint. There should be only one SSF value for each gene for each strain.

Fourth, the sum of the square of the log fold change for each flask for each timepoint was calculated for each strain:

SSH=ΣjΣi(Yij)2


where i is the flask and j is the timepoint.

Fifth, the F statistic was calculated for each gene for each strain by subtracting the SSF from the SSH, dividing the difference by the SSF, and multiplying the entire quotient by the difference of the total number of data points for the strain and the total number of timepoints for the strain divided by the total number of timepoints for the strain:

F Statistic = ((SSH-SSF)/SSF)*((N-T)/T)


where N is the total number of data points for the strain and T is the total number of timepoints for the strain. The p value for each gene for each strain was calculated using the following function in Excel:

=FDIST(F statistic, T, N-T)


where N is the total number of data points for the strain and T is the total number of timepoints for the strain. Then, the total number of genes with a p value below 0.05 was counted. These genes are considered to be significantly differentially expressed.

Bonferroni and Benjamini and Hochberg corrections were applied to the p values for each gene for each strain. For the Bonferroni correction, the p value was multiplied by the total number of genes (i.e. 6189). To perform the Benjamini and Hochberg correction, first the p values for each strain were arranged from the least p value to the greatest p value. Second, a numerical index was created (i.e. 1, 2, 3, ... , 6189). Then each p value was multiplied by the total number of genes (i.e. 6189) and divided by its correspoinding value in the numerical index. The total number of genes with a p value below 0.05 was counted for each strain for the Bonferroni and the Benjamini and Hochberg corrected p values.

Katrina Sherbina 20:19, 4 October 2011 (EDT)

## October 18, 2011

The significantly differentially expressed genes in the wildtype (i.e. those with a Benjamini & Hochberg corrected p value of less than 0.05) were clustered using STEM. First, the wildtype normalized log fold changes= data for the significantly differently expressed genes was copied into a separate Excel spreadsheet. This spreadsheet was formatted as described in "Clustering and Gene Ontology Analysis with STEM" in BIOL398-01/S11:Week 12. The procedure for clustering in STEM is outline in BIOL398-01/S11:Week 12.

Katrina Sherbina 23:08, 25 October 2011 (EDT)

## October 25, 2011

The Excel spreadsheet with the latest statistical analysis was reformatted. The protocol for performing all of the statistical analysis in Excel (i.e. modified ANOVA, Bonferroni p value correction, and Benjamini & Hochberg p value correction)ic currently being written in the protocols section of the Dahlquist Lab wiki. A link to it will be posted once the protocol is complete.

Katrina Sherbina 23:11, 25 October 2011 (EDT)

## November 1, 2011

The protocol for the statistical analysis of the normalized microarray data can be found on the page Dahlquist:Modified ANOVA and p value Corrections for Microarray Data.

Katrina Sherbina 22:05, 1 November 2011 (EDT)

## December 6, 2011

We are trying to use MATLAB to create a network of signficant genes and of transcription factors. For each strain data, the script codes for clustering the genes with significant changes in gene expression based on similar expression profiles. Currently, we are working on outputting an Excel spreadsheet for each strain with the gene name and the cluster number, average log fold change data, and the cluster centroid for the cluster of each gene. The following script is currently being edited to create a matrix for the data to be outputted:

#Convert the numerical matrices q and C to class arrays.
#q is a matrix containing the average log fold change data.
#C is a 10X5 matrix containing the cluster centroid of each timepoint for the genes for each cluster.
A = num2cell(q);
c = num2cell(C);

for i=1:size(ind2)
outarray{i,1}=b{1+ind2(i),7};
outarray{i,2}=IDX(i);
outarray{i,3}=A{ind2(i),1};
outarray{i,4}=A{ind2(i),2};
outarray{i,5}=A{ind2(i),3};
outarray{i,6}=A{ind2(i),4};
outarray{i,7}=A{ind2(i),5};
outarray{i,8}=c{IDX(i),1};
outarray{i,9}=c{IDX(i),2};
outarray{i,10}=c{IDX(i),3};
outarray{i,11}=c{IDX(i),4};
outarray{i,12}=c{IDX(i),5};
end


The current error showing up in outarray{i,3}=A{ind2(i),1}; is that the index exceeds matrix dimensions. This error may be occuring because the matrix b does not have the same number of rows as the matrix A. We need to look into creating an array of the gene names that have a signficant change in gene expression in order to replace the matrix b in the for loop.

Katrina Sherbina 23:26, 6 December 2011 (EST)