Aherman week 8

Journal of Manipulations Performed on Data

 * 1) Downloaded data from week 8 assignment
 * 2) Renamed file as Raw_Data_Vibrio_AH_20101019.xls
 * 3) Created new workbook and named it scaled_centered
 * 4) Copy and pasted critical data into new workbook
 * 5) Computed averages and standard deviations for each column
 * 6) Computed scaled and centered values for each column through formula ex: =(A4-$A$2)/$A$3
 * 7) Created a new workbook and labeled it statistics, copy and pasted scaled and centered data for each data column
 * 8) Averaged the log ratio fold changes for A, B, and C replicates by inserting new columns and using the equation =average(B4:E4)
 * 9) Computed and Average of the averages for A, B, and C column using similar technique listed above
 * 10) Created a T_Stat column that performed a t test on the data set.
 * 11) Copy and pasted data into a new work sheet labeled forGenMAPP.
 * 12) Rearranged the data, so that t test and pvalues are in the front, also added a system code line with N.
 * 13) Performed sanity check on data so far using find technique:
 * p value < 0.05?- 948
 * What about p < 0.01?- 235
 * What about p < 0.001?- 24
 * What about p < 0.0001?- 2
 * Keeping the p value filter at p < 0.05
 * Filtered Avg_LogFC_all greater than zero-352
 * Filtered Avg_LogFC_all less than zero-596
 * Filtered Avg_LogFC_all > 0.25 & < -0.25 - 918
 * In summation: the p value is a cut-off that can be set by the individual researcher, if they want to include more genes in their results but are less confident then they should lower the p value cut-off, and if they want to have less genes but be more confident about the fold change difference, then they should set the p value cut off higher. In terms of GenMAPP, the cut-off will be set a fold change of >0.25 and <-0.25, with the p value being cut off at < 0.05, this will result in the inclusion of several hundred genes
 * 1) Merrell et al. (2002) used a method which quantified, normalized and corrected their data to yield an intensity ratio relative to a control reference signal.
 * 2) The Merrell method used the Statistical Analysis for Microarrays (SAM) program, whereas we used Excel spread sheets.
 * Merrell et al. adjusted data:
 * Gene ID VC0028 - Fold change = 1.65 p value =  0.0474
 * Gene ID VC0941 - Fold change = 0.09 p value = 0.06759
 * Gene ID VC0869 - Fold change = 1.59 p value = 0.0463
 * Gene ID VC0051 - Fold change = 1.92 p value = 0.0139
 * Gene ID VC0647 - Fold change = -0.94 p value = 0.0125
 * Gene ID VC0468 - Fold change = -.017 p value = 0.3350
 * Gene ID VC2350 - Fold change = -2.40 p value = 0.0130
 * Gene ID VCA0583 - Fold change = 0.61 p value = 0.1457
 * Computed data:
 * Gene ID VC0028 - Fold change = 1.28/0.99 p value = 0.9479/0.0776 - both are significantly lower values
 * Gene ID VC0941 - Fold change = -0.26/-0.47 p value = 0.1647/0.9719- the fold change is lower, and the p value is significantly higher
 * Gene ID VC0869 - Fold change = 1.16 p value = 0.0009/0.4391- both are significantly lower
 * Gene ID VC0051 - Fold change = 1.47/1.41 p value = 0.2379/0.0148- the fold change is lower, while the p value is higher
 * Gene ID VC0647 - Fold change =-1.02/-1.11 p value = 0.1775/0.2667- the fold change is pretty close, while the p value is significantly higher
 * Gene ID VC0468 - Fold change = 0.10 p value = 0.9856- both values are significantly higher
 * Gene ID VC2350 - Fold change = -2.15 p value = 0.1056- both values are higher
 * Gene ID VCA0583 - Fold change = 1.12/0.99 p value = 0.1479/0.2002- both the fold change and the p value are higher
 * As can be seen from the results above, there is a significant difference between the results when the data analysis is conducted in these two manners. This can be very important when considering that some genes may have been left out or misrepresented based on poor data analysis.

Andrew Herman 18:03, 24 October 2010 (EDT)
 * Raw Data File
 * Andrew Herman's Final Data file in .txt format
 * Andrew Herman's Final Data file in .xls format