Dahlquist:Modified ANOVA and p value Corrections for Microarray Data

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(Calculating the F statistic and p value corrections in MATLAB: Corrected the last segment of code that outputs the data)
(changed most instances of the word "spreadsheet" to "worksheet" to be clear)
Line 30: Line 30:
====Calculate the F Statistic====
====Calculate the F Statistic====
-
*Create an Excel workbook containing a separate spreadsheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
+
*Create an Excel workbook containing a separate worksheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
-
*For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective spreadsheets.  
+
*For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective worksheets.  
*In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space.
*In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space.
-
*Complete the following steps for each of the spreadsheets containing the normalized data:
+
*Complete the following steps for each of the worksheets containing the normalized data:
*After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.  
*After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.  
*Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
*Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
Line 77: Line 77:
====Calculate the Bonferroni p value Correction====
====Calculate the Bonferroni p value Correction====
-
*Select the spreadsheet labelled "Bonferroni".  
+
*Select the worksheet labelled "Bonferroni".  
-
*Paste the "Gene ID" column from one of the earlier five spreadsheets into the colum A of this spreadsheet.  
+
*Paste the "Gene ID" column from one of the earlier five worksheets into the column A of this worksheet.  
-
*Paste the entire column of p values for the wildtype into the column B of the spreadsheet as values. Paste the p values for the other strains in the same manner into columns C-F.
+
*Paste the entire column of p values for the wildtype into the column B of the worksheet as values. Paste the p values for the other strains in the same manner into columns C-F.
*In cell G1 and H1, type "wt_Bonferroni_pval".  
*In cell G1 and H1, type "wt_Bonferroni_pval".  
Line 103: Line 103:
====Calculate the Benjamini & Hochberg p value Correction====
====Calculate the Benjamini & Hochberg p value Correction====
-
*Click on the spreadsheet named "B&H".
+
*Click on the worksheet named "B&H".
*Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
*Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
-
*Copy and paste the column of Gene ID's from one of the previous spreadsheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype spreadsheet into column D. Paste the column of p values from the dCIN5 spreadsheet into column H. Paste the column of p values from the dGLN3 spreadsheet into column L. Paste the column of p values from the dHMO1 spreadsheet into column P. Paste the column of p values from the dZAP1 spreadsheet into column T.  
+
*Copy and paste the column of Gene ID's from one of the previous worksheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype worksheet into column D. Paste the column of p values from the dCIN5 worksheet into column H. Paste the column of p values from the dGLN3 worksheet into column L. Paste the column of p values from the dHMO1 worksheet into column P. Paste the column of p values from the dZAP1 worksheet into column T.  
*Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".
*Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".
Line 130: Line 130:
===Calculating the F statistic and p value corrections in MATLAB===
===Calculating the F statistic and p value corrections in MATLAB===
-
*The MATLAB script for the statistical analysis needs as input a spreadsheet with all of the normalized data. To prepare this input,  
+
*The MATLAB script for the statistical analysis needs as input a Microsoft Excel spreadsheet with all of the normalized data. To prepare this input,  
:#Open the .csv file created after normalizing the data (i.e. GCAT_and_Ontario_Final_Normalized_Data.csv).
:#Open the .csv file created after normalizing the data (i.e. GCAT_and_Ontario_Final_Normalized_Data.csv).
-
:#Rename the spreadsheet as "Master_Sheet".
+
:#Rename the worksheet as "Master_Sheet".
:#Type the identifier "Gene ID" in cell A1.
:#Type the identifier "Gene ID" in cell A1.
:#Insert a new column after column A.
:#Insert a new column after column A.
Line 143: Line 143:
*Close the .csv file and open MATLAB. Make sure the directory is set to where the input file is located.
*Close the .csv file and open MATLAB. Make sure the directory is set to where the input file is located.
-
*Specify the names of the Excel file and spreadsheet that contains the data and strain for which the data will be import into MATLAB.  
+
*Specify the names of the Excel file and worksheet that contains the data and strain for which the data will be import into MATLAB.  
  filename    = 'Katrina_Latest_Statistical_Analysis_Normalized_Data_20111101.xls';
  filename    = 'Katrina_Latest_Statistical_Analysis_Normalized_Data_20111101.xls';
  sheetname  = 'Master_Sheet';
  sheetname  = 'Master_Sheet';

Revision as of 15:36, 21 July 2014

Image:HeaderBlueBig.gif Home        Research        Protocols        Notebook        People        Publications        Courses        Contact       


Contents

Introduction

The protocol below describes the statistical tests that have been performed to determine the significance of the gene expression changes seen in the microarray data. The Bonferroni and Benjamini & Hochberg corrections were used to correct for false positives after performing the statistical tests.

Analyzing Significant Gene Expression Changes Within Each Strain

For each strain, an ANOVA statistic was calculated to determine what genes have a significant change in expression during at least one time point.

  • To analyze significant changes in gene expression, we need to compute an F statistic for each gene, which is the ratio of the MST (among time points mean square) to the MSE (within time points mean square).
  • The MSE is the within time points sum of squares of deviations from the mean (henceforth, sum of squares (SSF) divided by the corresponding degrees of freedom (i.e. the number of time points T).
  • The SSF is calculated by finding the sum of the squared difference between a gene’s log fold change for time point i and flask j and the mean log fold change for the time point.
SSF=ΣiΣj(Yijj)2 
  • The MST is the among time points sum of squares (SST) divided by the corresponding degrees of freedom (the difference between the total number of time points and replicates N and the number of time points T). The SST is the within time points sum of squares (SSF) subtracted from the sum of squares for all of the data for a gene (SSH).
  • The SSH is calculated by squaring the log fold change and then summing over all of the time points i and flasks j.
SSH=ΣiΣj(Yij)2 
  • So, the F statistic with degrees of freedom N-T and T is
 [(SSH-SSF)/SSF] * [(N-T)/T]
  • Find the p value by using the F distribution with your computed F statistic, significance value, and N-T and T degrees of freedom.

In the case, such as this, when multiple statistical tests are being performed simultaneously, a correction should be applied to the p values in order to mitigate the false positive rate. The Bonferroni method is one such method performed by multiplying each p value by the total number of hypotheses, which in this case corresponds to the number of genes.

p*n≤α

A downside to this correction is that the final will consist of only the most extreme outliers, thus ignoring some potentially significant genes.

A more robust method of correcting the p values is the Benjamini & Hochberg correction (B&H). This method is performed by first sorting all the p values from least to greatest and each gene is given a rank i. Then, the p values are multiplied by the total number of hypotheses n and divided by their rank i.

p*n/i≤α

Statistical Analysis in Excel

The F statistic, Bonferroni correction, and the Benjamini and Hochberg correction can be performed in Excel.

Calculate the F Statistic

  • Create an Excel workbook containing a separate worksheet for each of the following labels: "wt", "dCIN5", "dGLN3", "dHMO1", "dZAP1", "Bonferroni", and "B&H".
  • For each of the strains, copy the gene names and the normalized data from the output after normalization into the respective worksheets.
  • In order to continue with the calculations, it is very important that there are no cells with the value "NA". Replace all of the “NA” values in the data with a space.
  • Complete the following steps for each of the worksheets containing the normalized data:
  • After the last column of the normalized data, label the next five columns (one for each time point), with the designation “<strain>_Avg_LogFC_<timepoint> in row 1.
  • Calculate the average log fold change for each time point. For the data calculated at the t15 time point for the wild type strain, type the following formula into the cell beneath the column header “wt_Avg_LogFC_t15”:
=AVERAGE(B2:E2)

Press enter. Double click on the lower right hand corner of the cell so that Excel computes the average log fold change for the rest of the genes.

  • Calculate the average log fold change for the remaining time points by using the formula aforementioned substituting the range for the cell corresponding to the first replicate measurement for the time point to the cell corresponding to the last replicate measurement for the time point.
  • After the last column of the average log fold changes, label the next few columns (the number of which corresponds to the number of columns of the normalized data) with the header “<strain>_mean_centered_LogFC_<timepoint>-<flask>” in row 1.
  • For each time point, subtract the mean fold change for the time point from each fold change measurement for the time point for each gene (henceforth, mean centered log fold changes). For examples, in row 2 of the column labeled “wt_mean_centered_LogFC-t15-1”, type the following formula and hit enter:
=B2-$Y2. 

Double click on the lower right hand corner of the cell to make the computation for each gene. Use the same formula for the remainder of the time points and genes substituting the first term for the cell corresponding to the average log fold change for the time point and the second term for a cell containing the fold change for a replicate of the time point.

  • After the last column of the mean centered normalized log fold changes, label the next column with the header "<strain>_SSH" in row 1.
  • Calculate the SSH by finding the sum of squares of the log fold changes. For example, in row 2 of the column labeled “wt_SSH” type the following formula
=SUMSQ(B2:X2)

and hit enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Label the next column after the SSH calculates with the header "<strain>_SSF in row 1.
  • Calculate the SSF by find the sum of squares of all of the mean centered log fold changes. For example, in row 2 of the column labeled “wt_SSF” type the following formula
=SUMSQ(AD2:AZ2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after the SSF calculations, enter the header “<strain>_Fstat” in row 1.
  • Calculate the F statistic as previously described. For example, in row 2 of the column labeled “wt_Fstat” type the formula
=((23-5)/5)*((BA2-BB2)/BB2)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • In the first cell of the next column after F stat calculations, type the header “<strain>_Pval”.
  • To calculate the p value, use the formula
=FDIST(F statistic, T, N-T)

For example, in row 2 of the column labeled “wt_Pval” type the formula

=FDIST(BC2,5,23-5)
  • In the first cell of the next column after p value calculations, type the header "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed (i.e. the p value for that gene is less than 0.05). For example, for the wild type data, type the following formula into the second cell of the column.
=IF(BD2<0.05,1,0)

and press enter. Double click on the lower right hand corner of the cell to perform the calculation for all of the genes.

  • Find the sum is the total number of genes that have a p value of less than 0.05.
  • For the wild type data, scroll down to the bottom of the column and type the following formula in the first blank cell
=SUM(BD2:BD6190)

and press enter.

Calculate the Bonferroni p value Correction

  • Select the worksheet labelled "Bonferroni".
  • Paste the "Gene ID" column from one of the earlier five worksheets into the column A of this worksheet.
  • Paste the entire column of p values for the wildtype into the column B of the worksheet as values. Paste the p values for the other strains in the same manner into columns C-F.
  • In cell G1 and H1, type "wt_Bonferroni_pval".
  • Calculate the Bonferroni corrected p value for the wildtype data. To do so, type the following formula into cell G2:
=B2*6189

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into cell H2:
=IF(G2>1,1,G2)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • In cell I2, type "Significantly_Expressed_Genes".
  • Find which genes are significantly differentially expressed by typing the following formula into the cell I2:
=IF(G2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column.
  • In cell I6192, type the following formula:
=SUM(I2:I6190)
  • Find the Bonferroni corrected p values for the rest of the strains following the procedure outlined for the wildtype above. Change the column names so that they reflect the strain on which you are working on. Change the cell designations in the formulas to reflect the cells corresponding to the strain on which you are working on.

Calculate the Benjamini & Hochberg p value Correction

  • Click on the worksheet named "B&H".
  • Create an index column by first typing "Index" into cell A1. Then type "1" into cell A2 and "2" into cell A3. Select both cells A2 and A3. Click, hold, and then drag the lower right hand corner until you reach row 6190. Release once you get to row 6190.
  • Copy and paste the column of Gene ID's from one of the previous worksheets into columns C, G, K, O, and S. Paste the column of p values from the wildtype worksheet into column D. Paste the column of p values from the dCIN5 worksheet into column H. Paste the column of p values from the dGLN3 worksheet into column L. Paste the column of p values from the dHMO1 worksheet into column P. Paste the column of p values from the dZAP1 worksheet into column T.
  • Select all of column D. Sort by ascending values. You can click the menu option Data < Sort... In the new window that opens, keep the option "Expand the selection" and click "Sort...". Keep the "Sort by..." option as "Ascending" and click "OK." Alternately, you can click the sort button from A to Z on the toolbar, keep the option "Expand the selection" and click "OK".

Sort columns, H, L, P, and T in the same way.

  • Calculate the Benjamini and Hochberg p value correction for the wildtype. Type "wt_BH_pval" in cell E1. Type the following formula in cell E2:
=(D2*6189)/A2 

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Calculate the Benjamini and Hochberg correction for the rest of the strains in column I for dCIN5, in column M for dGLN3, in column Q for dHMO1, and in column U for dZAP1.
  • Type "Significantly_Expressed_Genes" into cell F1, J1, N1, R1, and V1.
  • Find which genes are significantly differentially expressed. In order to do so for the wildtype, type the following formula into cell F2:
=IF(E2<0.05,1,0)

and press enter. Instead of typing the cell designations, click on the beginning cell and then double click on the lower right hand corner.

  • Scroll down to the bottom of the column F.
  • In cell 6191 of the column, type the following formula:
=SUM(F2:F6190)
  • Find which genes are significantly differentially expressed for the rest of the strains using columns J, N, R, and V. You can use the formulas above but make sure to alter the cell number.

Calculating the F statistic and p value corrections in MATLAB

  • The MATLAB script for the statistical analysis needs as input a Microsoft Excel spreadsheet with all of the normalized data. To prepare this input,
  1. Open the .csv file created after normalizing the data (i.e. GCAT_and_Ontario_Final_Normalized_Data.csv).
  2. Rename the worksheet as "Master_Sheet".
  3. Type the identifier "Gene ID" in cell A1.
  4. Insert a new column after column A.
  5. Type the identifier "Standard Name" in cell B1. Column B will contain the common names for the genes on the microarray.
  6. Copy the gene names in column A.
  7. Paste the names into the "Value" field of the List <-> Gene List tool in YEASTRACT [1]. Then, click on the "Transform" button.
  8. Select all of the names in the "Gene Name" column of the resulting table.
  9. Copy and paste these names into column B of the .csv file.
  10. Save the .csv file as a .xls file.
  • Close the .csv file and open MATLAB. Make sure the directory is set to where the input file is located.
  • Specify the names of the Excel file and worksheet that contains the data and strain for which the data will be import into MATLAB.
filename    = 'Katrina_Latest_Statistical_Analysis_Normalized_Data_20111101.xls';
sheetname   = 'Master_Sheet';
strain      = 'wt';

Make sure that the strain name is exactly as it appears in the column headers in the Excel spreadsheet.

  • Import the Excel spreadsheet and create a separate matrix containing only the numerical data in the spreadsheet.
[log2FC,b]=xlsread(filename,sheetname);
a = log2FC(:,2:end);
  • For each time point in the experiment, identify the columns in the input file that contain the data for the specified strain and time point.
strainT15 = strfind(b(1,:),[strain '_LogFC_t15']);
strainT30 = strfind(b(1,:),[strain '_LogFC_t30']);
strainT60 = strfind(b(1,:),[strain '_LogFC_t60']);
strainT90 = strfind(b(1,:),[strain '_LogFC_t90']);
strainT120 = strfind(b(1,:),[strain '_LogFC_t120']);
  • Create vectors of column indicies that were identified for each time point for the strain specified.
ind15 = find(not(cellfun('isempty',strainT15)))-(length(b(1,:))-length(a(1,:)));
ind30 = find(not(cellfun('isempty',strainT30)))-(length(b(1,:))-length(a(1,:)));
ind60 = find(not(cellfun('isempty',strainT60)))-(length(b(1,:))-length(a(1,:)));
ind90 = find(not(cellfun('isempty',strainT90)))-(length(b(1,:))-length(a(1,:)));
ind120 = find(not(cellfun('isempty',strainT120)))-(length(b(1,:))-length(a(1,:)));
  • Specify the significance value alpha for the ANOVA, the total number of genes n to be analyzed, the number of independent variables p, and the number of constraints q on the linear model.
alpha = 0.05; 
n = length(a(:,1));
p = 5;
q = 5;
  • Concatenate all of the indices into one vector.
ind = [ind15,ind30,ind60,ind90,ind120];
N = length(ind);
  • To perform the ANOVA, it is necessary to set up a regression matrix X, which is a matrix of ones and zeros that specifies the independent variables. Set up the regressor matrix so that it has as many rows as the number of indices and as many columns as there are independent variables.
X = zeros(N,p);
  • If the indices do not start at 1, then the regressor matrix will not be constructed correctly. To remedy this, create a new set of indices by manipulating the initial indices to begin at 1. Then, create the regressor matrix X.
if ind(1)~= 1
 ind151 = ind15-ind15(1)+1;
 ind301 = ind30-ind30(1)+ind151(end)+1;
 ind601 = ind60-ind60(1)+ind301(end)+1;
 ind901 = ind90-ind90(1)+ind601(end)+1;
 ind1201 = ind120-ind120(1)+ind901(end)+1;
 X(ind151,1) = 1;
 X(ind301,2) = 1;
 X(ind601,3) = 1;
 X(ind901,4) = 1;
 X(ind1201,5) = 1;
else 
  X(ind15,1) = 1;
  X(ind30,2) = 1;
  X(ind60,3) = 1;
  X(ind90,4) = 1;
  X(ind120,5) = 1;
end
  • Specify the dimensions of the output matrix. This output matrix will contain two columns of gene names, a column for the average log fold change for each time point, the F statistic, the p value, the SSH, the SSF, the Benjamini & Hochberg corrected p values, whether or not the p value is significant, and the rank of each gene's p value after performing the Benjamini & Hochberg correction.
out_data = zeros(n,12);
  • To compute the F statistic and p value for each gene for the wild type strain, first the indices have to be modified. This modification involves ignoring the indices that correspond to the time points for which a gene has "NaN" as the log fold change.
for ii = 1:n
if strcmpi(strain,'wt') == 1
   I = find(~isnan(a(ii,ind)));
   ind15x = I(I>=ind15(1)&I<=ind15(end));
   ind30x = I(I>=ind30(1)&I<=ind30(end));
   ind60x = I(I>=ind60(1)&I<=ind60(end));
   ind90x = I(I>=ind90(1)&I<=ind90(end));
   ind120x = I(I>=ind120(1)&I<=ind120(end));
  • Concatenate these modified indices into a new vector indx. Find the length N of the newly created vector.
   indx = [ind15x,ind30x,ind60x,ind90x,ind120x];
   N = length(indx);
  • The continuity of the indices from one time point to another was lost in modifying the indices for the wild type data. Manipulate the modified indices so that they are continuous from one time point to the next.
   if N<23
       ind15x = find(ind15x);
       ind30x = find(ind30x)+ind15x(end);
       ind60x = find(ind60x)+ind30x(end);
       ind90x = find(ind90x)+ind60x(end);
       ind120x = find(ind120x)+ind90x(end);
   end
  • Set up the regressor matrix using the indices ind15x, ind30x, ind60x, ind90x, and ind120x.
   X = zeros(N,p);
   X(ind15x,1) = 1;
   X(ind30x,2) = 1;
   X(ind60x,3) = 1;
   X(ind90x,4) = 1;
   X(ind120x,5) = 1;
  • The aforementioned modifications to the indices do not need to be performed for the deletion strains. Therefore, the following condition ensures that the indices remain as they were computed at the beginning of the code.
else
   indx = ind;
end
  • For each gene, create a vector fo the log fold change values that correspond to that gene.
 Y = a(ii,indx)'; 
  • Compute the average log fold change.
 beta = X\Y;
  • Specify which columns in the output matrix will store the average log fold changes.
 out_data(ii,1:q) = beta';
  • Find the sum of squares of error for the hypothesized model s2h and the full model s2. Then, find the F statistic and p value.
 s2 = (1/N)*(Y-X*beta)'*(Y-X*beta); 
 s2h = (1/N)*(Y'*Y); 
 W      = ((N-p)/q)*((s2h-s2)/s2); 
 pval    = 1-fcdf(W,q,N-p); 
  • Specify the columns that are to be outputted.
 out_data(ii,6) = W;
 out_data(ii,7) = pval;
 out_data(ii,8) = s2*N;
 out_data(ii,9) = s2h*N;
end
  • Perform the Benjamini & Hochberg correction.
p1 = out_data(:,7);
q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1); 
tests   = ps<=q1; 
itests  = find(tests==1);
itestm  = max(itests);
qs = zeros(size(q1));
rs = qs;
for ii = 1:n
   qs(is(ii)) = ps(ii)*n/ii; 
   rs(is(ii)) = ii;
end
  • Output the data.
signif                  = zeros(n,1);
signif(is(1:itestm))    = 1;
out_data(:,10) = qs;
out_data(:,11) = signif;
out_data(:,12) = rs;
eval(['save ' strain '_out_data out_data;']);
out_data_cells{1,1} = 'Systematic Name';
out_data_cells{1,2} = 'Standard Name';
out_data_cells{1,3} = [strain '_t15'];
out_data_cells{1,4} = [strain '_t30'];
out_data_cells{1,5} = [strain '_t60'];
out_data_cells{1,6} = [strain '_t90'];
out_data_cells{1,7} = [strain '_t120'];
out_data_cells{1,8} = 'f stat';
out_data_cells{1,9} = 'p val';
out_data_cells{1,10} = 'SS full';
out_data_cells{1,11} = 'SS hyp';
out_data_cells{1,12} = 'B&H comps';
out_data_cells{1,13} = '? signif ?';
out_data_cells{1,14} = 'B&H rank';

for ii = 1:n
   out_data_cells{1+ii,1} = b{ii+1,1};
   out_data_cells{1+ii,2} = b{ii+1,2};
   for jj = 1:12
       out_data_cells{1+ii,2+jj} = out_data(ii,jj);
   end
end
xlswrite([strain '_one_strain_ANOVA_out_data.xls'],out_data_cells)

Comparing Significant Changes in Expression Between Two Strains

A general linear model is used to test the hypothesis that each gene shows the same pattern of gene expression over the time course of the experiment in each strain. The null hypothesis is that the average log fold change of a gene g for a time point j in one strain s is the same as that in another strain s'. The statistical analysis was performed in MATLAB. As with the statistical test to determine significant changes in expression in each strain, the analysis of gene expression between two genes involves finding the sum of squares of error for the models under the null and alternative hypothesis and, then, finding the F statistic and p value for each gene.

Finding a General Linear Model in MATLAB

The following is a code with comments as implemented in MATLAB.

  • Designate the input Excel workbook as well as what sheet in the notebook MATLAB should import.
filename = 'Katrina_Latest_Statistical_Analysis_Normalized_Data_20111101.xls';
sheetname  = 'Master_Sheet';
  • Designate one strain as strain1 and the other as strain2. If one of the strains in the statistical test is the wildtype, make sure that wt is specified as strain1.
% % If one of the two strains you are working on is the wildtype, keep that
% % wildtype as strain 1.
strain1    = 'wt'; %wt, dCIN5, dGLN3, dHMO1, or dZAP1
% % Select strain 2 to be one of the other strains you would like to
% % compare with the first strain.
strain2    = 'dCIN5'; %dCIN5, dGLN3, dHMO1, or dZAP1
  • Load numerical and character data. Then, isolate the numerical data from the numerical column headers.
[log2FC,b]=xlsread(filename,sheetname);
a = log2FC(:,2:end);
%
% a = numerical data from sheet in matrix form
% b = character data from sheet in cell array form
%
%
  • Identify where in the cell array b does the first strain strain1 and the specified time point appear.
T15S1 = strfind(b(1,:),[strain1 '_LogFC_t15']); %15 minutes
T30S1 = strfind(b(1,:),[strain1 '_LogFC_t30']); %30 minutes
T60S1 = strfind(b(1,:),[strain1 '_LogFC_t60']); %60 minutes
T90S1 = strfind(b(1,:),[strain1 '_LogFC_t90']); %90 minutes
T120S1 = strfind(b(1,:),[strain1 '_LogFC_t120']); %120 minutes
  • Identify where in the cell array b does the second strain strain2 and the specified time point appear.
T15S2 = strfind(b(1,:),[strain2 '_LogFC_t15']); %15 minutes
T30S2 = strfind(b(1,:),[strain2 '_LogFC_t30']); %30 minutes
T60S2 = strfind(b(1,:),[strain2 '_LogFC_t60']); %60 minutes
T90S2 = strfind(b(1,:),[strain2 '_LogFC_t90']); %90 minutes
T120S2 = strfind(b(1,:),[strain2 '_LogFC_t120']); %120 minutes
  • Create an index designating which column of the data array a corresponds to each time point for the first strain strain1.
ind15S1 = find(not(cellfun('isempty',T15S1)))-(length(b(1,:))-length(a(1,:))); %15 minutes
ind30S1 = find(not(cellfun('isempty',T30S1)))-(length(b(1,:))-length(a(1,:))); %30 minutes
ind60S1 = find(not(cellfun('isempty',T60S1)))-(length(b(1,:))-length(a(1,:))); %60 minutes
ind90S1 = find(not(cellfun('isempty',T90S1)))-(length(b(1,:))-length(a(1,:))); %90 minutes
ind120S1 = find(not(cellfun('isempty',T120S1)))-(length(b(1,:))-length(a(1,:))); %120 minutes
  • Combine all data column indices into one array
indS1   = [ind15S1,ind30S1,ind60S1,ind90S1,ind120S1];
  • Create an index designating which column of the data array a corresponds to each time point for the second strain strain2.
ind15S2 = find(not(cellfun('isempty',T15S2)))-(length(b(1,:))-length(a(1,:))); %15 minutes
ind30S2 = find(not(cellfun('isempty',T30S2)))-(length(b(1,:))-length(a(1,:))); %30 minutes
ind60S2 = find(not(cellfun('isempty',T60S2)))-(length(b(1,:))-length(a(1,:))); %60 minutes
ind90S2 = find(not(cellfun('isempty',T90S2)))-(length(b(1,:))-length(a(1,:))); %90 minutes
ind120S2 = find(not(cellfun('isempty',T120S2)))-(length(b(1,:))-length(a(1,:))); %120 minutes
  • Combine all data column indices into one array
indS2   = [ind15S2,ind30S2,ind60S2,ind90S2,ind120S2];
N = length(indS1)+length(indS2);
  • Determine the significance value for the statistical test and the total number of genes. The variable p is number of time points multiplied by the number of strains while q = p/number of strains.
alpha = 0.05; %Significance value for statistical test.
n = length(a(:,1)); %Total number of genes.
p = 10; % Number of time points multiplied by the number of strains.
q = 5; % p/number of strains
  • To compute the sum of the square of the error in the models associated with the null and alternative hypotheses, it is necessary to construct two matrices (X and Xh) representing the independent variables in the linear models associated with the two hypotheses. The following code constructs these matrices in the case that neither of the two strains in the statistical analysis are the wild type.
if strcmpi(strain1,'wt') == 0
   indS1x   = find(indS1);
   i0 = ind15S1(1);
   i1 = indS1x(end);
   i2 = indS2(1);
   
   %Create new indices without any breaks so as to properly construct the
   %regressor matrices X and Xh.
   ind151 = find(ind15S1);
   ind301 = ind30S1-i0+1;
   ind601 = ind60S1-i0+1;
   ind901 = ind90S1-i0+1;
   ind1201 = ind120S1-i0+1;
   
   ind152 = ind15S2-i2+1+i1;
   ind302 = ind30S2-i2+1+i1;
   ind602 = ind60S2-i2+1+i1;
   ind902 = ind90S2-i2+1+i1;
   ind1202 = ind120S2-i2+1+i1;
   
   %Length of each array of indices
   n151 = length(ind151);
   n301 = length(ind301);
   n601 = length(ind601);
   n901 = length(ind901);
   n1201 = length(ind1201);
   n152 = length(ind152);
   n302 = length(ind302);
   n602 = length(ind602);
   n902 = length(ind902);
   n1202 = length(ind1202);
   
   %ts1 and ts2 are arrays of time points reflecting the number of
   %replicates in the data for strain1 and strain2, respectively. 
   %This will be necessary to later plot the linear model for each gene.
   ts1 = [ones(n151,1)*15;ones(n301,1)*30;ones(n601,1)*60;ones(n901,1)*90;ones(n1201,1)*120];
   ts2 = [ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120];
   t   = [ts1;ts2];
   [tsort,is] = sort(t); %tsort contains the sorted list of time points after combining arrays ts1 and ts2.
   
   % X and Xh are regressor matrices for the full and hypothesized
   % models, respectively
   
   X   = zeros(N,p);
   Xh  = zeros(N,p-q);
   
   Xh(ind151,1) = 1;
   Xh(ind301,2) = 1;
   Xh(ind601,3) = 1;
   Xh(ind901,4) = 1;
   Xh(ind1201,5) = 1;
   Xh(ind152,1) = 1;
   Xh(ind302,2) = 1;
   Xh(ind602,3) = 1;
   Xh(ind902,4) = 1;
   Xh(ind1202,5) = 1;
   
   X(ind151,1) = 1;
   X(ind301,2) = 1;
   X(ind601,3) = 1;
   X(ind901,4) = 1;
   X(ind1201,5) = 1;
   X(ind152,6) = 1;
   X(ind302,7) = 1;
   X(ind602,8) = 1;
   X(ind902,9) = 1;
   X(ind1202,10) = 1;
end


%
% out_data includes
%
out_data = zeros(n,12);
nsig = 0;
for ii=1:n
   if strcmpi(strain1,'wt') == 1
       %     Excludes all fold changes that have are an NaN.
       I = find(~isnan(a(ii,indS1)));
       ind15x = I(I>=ind15S1(1)&I<=ind15S1(end));
       ind30x = I(I>=ind30S1(1)&I<=ind30S1(end));
       ind60x = I(I>=ind60S1(1)&I<=ind60S1(end));
       ind90x = I(I>=ind90S1(1)&I<=ind90S1(end));
       ind120x = I(I>=ind120S1(1)&I<=ind120S1(end));
       
       %indx represents the list of columns from which the data will be drawn
       indx = [ind15x,ind30x,ind60x,ind90x,ind120x];
       c = length(indx);
       
       if c<23
           ind15x = find(ind15x);
           ind30x = find(ind30x)+ind15x(end);
           ind60x = find(ind60x)+ind30x(end);
           ind90x = find(ind90x)+ind60x(end);
           ind120x = find(ind120x)+ind90x(end);
       end
       
       indX = find(indx);
       i1 = indX(end);
       i2 = indS2(1);
       
       ind152m = ind15S2-i2+1+i1;
       ind302m = ind30S2-i2+1+i1;
       ind602m = ind60S2-i2+1+i1;
       ind902m = ind90S2-i2+1+i1;
       ind1202m = ind120S2-i2+1+i1;
       
       N = length(indx)+length(indS2);
       
       % X and Xh are regressor matrices for the fulll and hypothesized
       % mdoels, respectively
       X   = zeros(N,p);
       Xh  = zeros(N,p-q);
       
       Xh(ind15x,1) = 1;
       Xh(ind30x,2) = 1;
       Xh(ind60x,3) = 1;
       Xh(ind90x,4) = 1;
       Xh(ind120x,5) = 1;
       Xh(ind152m,1) = 1;
       Xh(ind302m,2) = 1;
       Xh(ind602m,3) = 1;
       Xh(ind902m,4) = 1;
       Xh(ind1202m,5) = 1;
       
       X(ind15x,1) = 1;
       X(ind30x,2) = 1;
       X(ind60x,3) = 1;
       X(ind90x,4) = 1;
       X(ind120x,5) = 1;
       X(ind152m,6) = 1;
       X(ind302m,7) = 1;
       X(ind602m,8) = 1;
       X(ind902m,9) = 1;
       X(ind1202m,10) = 1;
       
       %Length of each array of indices
       n151 = length(ind15x);
       n301 = length(ind30x);
       n601 = length(ind60x);
       n901 = length(ind90x);
       n1201 = length(ind120x);
       n152 = length(ind152m);
       n302 = length(ind302m);
       n602 = length(ind602m);
       n902 = length(ind902m);
       n1202 = length(ind1202m);
       
       %ts1 and ts2 are arrays of time points reflecting the number of
       %replicates in the data for strain1 and strain2, respectively.
       %This will be necessary to later plot the linear model for each gene.
       ts1 = [ones(n151,1)*15;ones(n301,1)*30;ones(n601,1)*60;ones(n901,1)*90;ones(n1201,1)*120];
       ts2 = [ones(n152,1)*15;ones(n302,1)*30;ones(n602,1)*60;ones(n902,1)*90;ones(n1202,1)*120];
       t   = [ts1;ts2];
       [tsort,is] = sort(t); %tsort contains the sorted list of time points after combining arrays ts1 and ts2.
       
   else
       indx = indS1;
   end
   
   Ys1 = a(ii,indx)'; %Data for a gene for the first strain
   Ys2 = a(ii,indS2)'; %Data for a gene for the second strain
   Y   = [Ys1;Ys2];
   
   beta    = X\Y; %Average log fold change for each time point in the full model.
   betah   = Xh\Y; %Average log fold change for each time point in the hypothesized model.
   
   out_data(ii,1:10) = beta';
   out_data(ii,11:15) = betah';
   
   s2 = (1/N)*(Y-X*beta)'*(Y-X*beta); %Sum of squares of error in the full model
   s2h = (1/N)*(Y-Xh*betah)'*(Y-Xh*betah); %Sum of squares of error in the hypothesized model
   
   W       = ((N-p)/q)*((s2h-s2)/s2); % F statistic
   pval    = 1-fcdf(W,q,N-p); %P value
   
   out_data(ii,16) = W;
   out_data(ii,17) = pval;
   out_data(ii,18) = s2*N;
   out_data(ii,19) = s2h*N;
   
   if strcmpi(strain1,'wt') == 1
       Ym1 = X(ind15x(1):ind120x(end),1:5)*beta(1:5,1);
       Ym2 = X(ind152m(1):ind1202m(end),6:10)*beta(6:10,1);
   else
       Ym1 = X(ind151(1):ind1201(end),1:5)*beta(1:5,1);
       Ym2 = X(ind152(1):ind1202(end),6:10)*beta(6:10,1);
   end
   Yh  = Xh*betah;
   
   %Graph the expression profile of each gene as determined by the general
   %linear model.
   
   pause on
   if pval<0.05
       figure(1),plot(ts1,Ys1,'ro'),hold on,plot(ts2,Ys2,'mo'),plot(ts1,Ym1,'r','LineWidth',2),plot(ts2,Ym2,'m','LineWidth',2),...
           plot(tsort,Yh(is),'k','LineWidth',2),title([b{ii+1,1} '/' b{ii+1,2},'   W = ',num2str(W),' pval = ',num2str(pval)]),...
           xlabel('time, minutes'),ylabel('log fold change'),...
           legend([strain1 ' data'],[strain2 ' data'],[strain1 ' model'],[strain2 ' model'],'hypothesized model'),drawnow
       hold off
       nsig = nsig + 1;
       [ii,nsig]
       drawnow
       pause
   end
end
p1 = out_data(:,17); %P values
q1 = (alpha/n)*[1:n]';
[ps,is] = sort(p1); %Sort the p values.
tests   = ps<=q1;
itests  = find(tests==1);
itestm  = max(itests);
qs = zeros(size(q1));
rs = qs;
for ii = 1:n
   qs(is(ii)) = ps(ii)*n/ii; % Perform B&H Correction
   rs(is(ii)) = ii;
end
signif                  = zeros(n,1);
signif(is(1:itestm))    = 1; % Specify whether or not the corrected p value is significant
% Output all data 
out_data(:,20) = qs;
out_data(:,21) = signif;
out_data(:,22) = rs;
eval(['save ' strain1 '_' strain2 '_out_data out_data;']);
out_data_cells{1,1} = 'Systematic Name';
out_data_cells{1,2} = 'Standard Name';
out_data_cells{1,3} = [strain1 '_t15'];
out_data_cells{1,4} = [strain1 '_t30'];
out_data_cells{1,5} = [strain1 '_t60'];
out_data_cells{1,6} = [strain1 '_t90'];
out_data_cells{1,7} = [strain1 '_t120'];
out_data_cells{1,8} = [strain2 '_t15'];
out_data_cells{1,9} = [strain2 '_t30'];
out_data_cells{1,10} = [strain2 '_t60'];
out_data_cells{1,11} = [strain2 '_t90'];
out_data_cells{1,12} = [strain2 '_t120'];
out_data_cells{1,13} = 'hyp_t15';
out_data_cells{1,14} = 'hyp_t30';
out_data_cells{1,15} = 'hyp_t60';
out_data_cells{1,16} = 'hyp_t90';
out_data_cells{1,17} = 'hyp_t120';
out_data_cells{1,18} = 'f stat';
out_data_cells{1,19} = 'p val';
out_data_cells{1,20} = 'SS full';
out_data_cells{1,21} = 'SS hyp';
out_data_cells{1,22} = 'B&H comps';
out_data_cells{1,23} = '? signif ?';
out_data_cells{1,24} = 'B&H rank';
for ii = 1:n
   out_data_cells{1+ii,1} = b{ii+1,1};
   out_data_cells{1+ii,2} = b{ii+1,2};
   for jj = 1:22
       out_data_cells{1+ii,2+jj} = out_data(ii,jj);
   end
end
xlswrite([strain1 '_' strain2 '_ANOVA_out_data.xls'],out_data_cells)
Personal tools