Evan Montz Week 8

Excel File Spreadsheet
[[Media:Merrell_Compiled_Raw_Data_Vibrio_EM_20101024.xls | Evan Montz Excel Spreadsheet]]

Text File Spreadsheet
[[Media:Merrell_Compiled_Raw_Data_Vibrio_EM_20101024.txt‎ | Evan Montz Text Spreadsheet]]

Electronic Notebook

 * I first downloaded the file, opened it, and re-saved it to make it unique.
 * I inserted the "scaled_centered" worksheet into the file.
 * Then I copied all the raw data into the new worksheet and added the rows for Average and Standard Deviation.
 * Using the "=Average" and "STDEV" functions, I was able to calculate the values for the log ratios for each chip.
 * It was noticed that there were some cells within the raw data that had no data in them. This yielded a minor error saying that some cells had no data. This error was ignored.
 * Next I added in the "scaled_centered" columns for each run of each patient.
 * Now using the "=(B4-$B$2)/$B$3" equation along with the drag down feature, this value was calculated for each cell for every run.
 * NOTE: It is vital to use the "$" symbol while writing this equation because without it, when we used the drag down feature, excel would continue to reference different cells and that was not what we desired.
 * Next, I created the new worksheet titled "statistics"
 * In this worksheet, I copied the first column along with all the columns that were scaled and centered.
 * Then I created the three average columns for each patient and using the =AVERAGE command along with the drag down feature again, I computed the values for each cell for all three patients.
 * Next, I created the column that was to display the average of all three averages. The averages of these values were taken using the same method as the previous step, except I took the average of the three averages rather than the scaled and centered data.
 * Next, I created the Tstat column and computed the results using the function: =AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates)) where the number of replicates was 3 for our case. The drag down feature was again used to obtain the values for all cells.
 * The last column that was created was the Pvalue column. The values were calculated using the function: =TDIST(ABS(R2),degrees of freedom,2) where the degrees of freedom was 2 for our case.  The values were again dragged down to get the rest of the numbers.
 * Finally, I created the "forGenMAPP" worksheet.
 * Here, the data was manipulated to get the correct number of decimal places for the different values. The statistical columns were also cut and shifted to the beginning of the spreadsheet.  I also added the "system code" column after the ID column and added an "N" to each cell using the drag down method.
 * Lastly, the file was re-saved as an .xls file and then saved as a .txt file.
 * Both were then uploaded to this page at the top.

Sanity Check: Number of genes significantly changed
p<0.05 = 948 genes fit into this criteria p<0.01 = 235 genes fit into this criteria p<0.001 = 24 genes fit into this criteria p<0.0001 = 2 genes fit into this criteria p<0.05 and "Avg_LogFC_all">0 = 352 genes fit into this criteria p<0.05 and "Avg_LogFC_all"<0 = 596 genes fit into this criteria p<0.05 and "Avg_LogFC_all">0.25 = 339 genes fit into this criteria p<0.05 and "Avg_LogFC_all"<-0.25 = 579 genes fit into this criteria
 * I clicked on the A1 cell and applied the Autofilter.
 * The autofilter was applied to the pvalue column according to the following criterion:
 * The pvalue criteria was returned to p<0.05
 * The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change greater than zero.
 * The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change less than zero.
 * The "Avg_LogFC_all" column was then filtered to display all genes with an average log fold change greater than 0.25 or less than -0.25 (This is more realistic).
 * From what I can tell, our experiment was very similar to that of Merrell et al. (2002). The main difference I see is that they used Statistical Analysis for Microarrays (SAM) program and we used Excel.  Assuming that both programs do the same thing, both experiments were very similar.

Sanity Check: Compare individual genes with known data

 * VC0028 (in the table there were two genes with this title)
 * fold change = 1.65 and 1.27 respectively
 * p value = 0.0474 and 0.0629 respectively
 * significantly changed? NO p<0.05 and YES p>0.05 respectively
 * VC0941 (in the table there were two genes with this title)
 * fold change = 0.09 and -.28 respectively
 * p value = 0.6759 and 0.1636 respectively
 * significantly changed? YES p>0.05 and YES p>0.05 respectively
 * VC0869 (in the table there were five genes with this title)
 * fold change = 1.59, 1.95, 2.20, 1.50 and 2.12 respectively
 * p value = 0.0463, 0.0227, 0.0020, 0.0174, and 0.0200 respectively
 * significantly changed? NO p<0.05, NO p<0.05, NO p<0.05, NO p<0.05 and NO p<0.05 respectively
 * VC0051 (in the table there were two genes with this title)
 * fold change = 1.92 and 1.89 respectively
 * p value = 0.0139 and 0.0160 respectively
 * significantly changed? NO p<0.05 and NO p<0.05 respectively
 * VC0647 (in the table there were three genes with this title)
 * fold change = -1.11, -0.94 and -1.05 respectively
 * p value = 0.0003, 0.0125, 0.0051 respectively
 * significantly changed? NO p<0.05, NO p<0.05 and NO p<0.05 respectively
 * VC0468
 * fold change = -0.17
 * p value = 0.3350
 * significantly changed? YES p>0.05
 * VC2350
 * fold change = -2.40
 * p value = 0.0130
 * significantly changed? NO p<0.05
 * VCA0583
 * fold change = 1.06
 * p value = 0.1011
 * significantly changed? YES p>0.05

Evan Montz 22:55, 24 October 2010 (EDT)