Richard Brous Week 8

Richard Brous

 * Week 8 Assignment
 * [[Media:RAB_10_23_2010_Merrell_Compiled_Raw_Data_Vibrio.xls|Vibrio Excel Data File]]
 * [[Media:RAB_10_23_2010_Merrell_Compiled_Raw_Data_Vibrio.txt|Vibrio Tab Deliminated GenMAPP File]]

Downloaded the "Merrill_Compiled_Raw_Data_Vibrio.xls data file to desktop

 * Used a PC so beware they are on an older rev of MS Excel than the newer Mac systems.
 * Added initials and date to my file so I can easily locate it later and not worry about other students using a duplicate file name and potentially overwriting my file.
 * Filename is: RAB_10_19_2010_Merrell_Compiled_Raw_Data_Vibrio.xls

Normalize the log ratios for the set of slides in the experiment

 * Be sure to fully expand both the application and the document so you can see the worksheets at the bottom
 * Added a new worksheet and named it "scaled_centered" and copied the raw data into it.
 * click on top left corner of spreadsheet for fast way to select all.
 * Insert two new top rows and name them Average and Standard Deviation respectivly
 * AVERAGE of each data column (each chip)
 * formula is =AVERAGE(start:finish)
 * =AVERAGE(B4:B5224)
 * If you don't know the exact range, click on the start cell, then hold shift and select a few cells below. Release Shift then quickly scroll to bottom of column. Hold shift again and click on the last cell at bottom of column, they should all be highlighted now. Press ENTER.
 * Copy and paste this formula as is across all columns to average each one.
 * STANDARD DEVIATION of each data column
 * formula is =STDEV(start:finish)
 * =STDEV(B4:B5224)
 * If you don't know the exact range, click on the start cell, then hold shift and select a few cells below. Release Shift then quickly scroll to bottom of column. Hold shift again and click on the last cell at bottom of column, they should all be highlighted now. Press ENTER.
 * Copy and paste this formula as is across all columns to average each one.
 * Scaling and Centering
 * Insert a new column to the right of each existing data column and append the label with "scaled_centered"
 * A1 | A1 scaled_centered
 * Starting formula in cell c4
 * typed: =(B4-$B$2)/$B$3 which will subtract the average from B4 then divide it by the stdev.
 * $ are used in the formula to show that certain data is used as a REFERENCE and must not be modified.
 * copied and pasted the formula into the entire column.
 * Repeated for all remaining scaled_centered columns adjusting for each specific column of data.

Perform statistical analysis on the ratios

 * Created a new worksheet labeled statistics
 * copied ID column into first column of it
 * Copied A1_centered_scaled column from scaled_centered worksheet into column B of it.
 * Copied A2_centered_scaled column from scaled_centered worksheet into column C of it.
 * Copied A3_centered_scaled column from scaled_centered worksheet into column D of it.
 * Copied A4_centered_scaled column from scaled_centered worksheet into column E of it.
 * Copied B1_centered_scaled column from scaled_centered worksheet into column F of it.
 * Copied B2_centered_scaled column from scaled_centered worksheet into column G of it.
 * Copied B3_centered_scaled column from scaled_centered worksheet into column H of it.
 * Copied B4_centered_scaled column from scaled_centered worksheet into column I of it.
 * Copied C1_centered_scaled column from scaled_centered worksheet into column J of it.
 * Copied C2_centered_scaled column from scaled_centered worksheet into column K of it.
 * Copied C3_centered_scaled column from scaled_centered worksheet into column L of it.
 * Copied C4_centered_scaled column from scaled_centered worksheet into column M of it.
 * Performed Avg_LogFC_A (average of A1 - A4) and copied through each cell of column N
 * =AVERAGE(B2:E2) then copied throughout column
 * Performed Avg_LogFC_B (average of B1 - B4) and copied through each cell of column O
 * =AVERAGE(F2:I2) then copied throughout column
 * Performed Avg_LogFC_C (average of C1 - C4) and copied through each cell of column P
 * =AVERAGE(J2:M2) then copied throughout column
 * Performed Avg_LogFC_all (average of other 3 averages) and copied through each cell of column O
 * =AVERAGE(N2:P2) then copied throughout column
 * Inserted a new column to the right Avg_LogFC_all called Tstat
 * Computes a T statistic which shows whether the scaled and centered data av log ratios differ significantly
 * = AVERAGE(N2:P2)/(STDEV(N2:P2)/SQRT(number of replicates)
 * our example uses 3 replicates
 * copied throughout column
 * Inserted a new column to right of Tstat called Pvalue
 * =TDIST(ABS(R2),degrees of freedom, 2)
 * degrees of freedom is number of replicates minus one = 2 in our case
 * copied throughout column
 * Created new worksheet named forGenMAPP
 * copied statistics worksheet data into it using paste special values
 * Columns B through Q
 * Format cells: numbers to 2 decimal places
 * Columns R and S
 * Format cells: numbers to 4 decimal places
 * Cut Columns N through S and insert cut cells to Column B
 * This will insert columns N through S in between columns B and H without overwriting anything
 * Deleted rows 2 and 3
 * Now data is directly below header row 1 of each column
 * Inserted new column called SystemCode to the right of ID column
 * filled column with letter "N"
 * Saved the forGenMAPP worksheet as a text (tab deliminated) file *.txt
 * Ignored warnings from Excel
 * Uploaded both to Wiki
 * filename1: RAB_10_23_10_Merrell_Compiled_Raw_Data_Vibrio.xls
 * filename2: RAB_10_23_10_Merrell_Compiled_Raw_Data_Vibrio.txt

Sanity check: Number of genes significantly changed

 * Opened Spreadsheet to forGenMAPP worksheet
 * Selected cell A1 and selected Data > Filter> AutoFilter to bring up the pull downs across each column header
 * Enabled the ability to filter data by column by criteria I set
 * Selected pull down from Pvalue column
 * Chose Custom to allow me to set my own criterion for filter of this column
 * Pvalue < 0.05
 * 948 returned
 * Pvalue < 0.01
 * 235 returned
 * Pvalue < 0.001
 * 24 returned
 * Pvalue < 0.0001
 * 2 returned
 * Pvalue cutoff of p < 0.05
 * Says you would have seen a gene expression change that deviates this far from zero, less than 5% of the time
 * Performed 5221 T tests for significance
 * which means with p < 0.05, would expect to see this magnitude of gene expression change in about 5% of our T tests (261 times)
 * since more than 261 genes pass this cutoff, it shows that some genes have significantly changed - BUT we don't know which ones.
 * Avg_LogFC_all colum
 * Provides the size of the gene expression change and in what direction.
 * Positive means increased values relative to control values
 * Negative means decreased values relative to control values
 * Keep Pvalue filter at p < 0.05
 * Filter Avg_LogFC_all show all genes with an average log fold change greater than zero
 * 352 returned
 * Filter Avg_LogFC_all show all genes with an average log fold change less than zero
 * 596 returned
 * Filter Avg_LogFC_all show all genes with an average log fold change p > 0.25 && p < -0.25greater than zero
 * 918 returned
 * Summary: p value cut-off should be thought of as a moveable confidence level
 * to be very confident means the p value cut-off should be small but will reduce data size
 * If ok to be less confident means the p value cut-off can be larger which allows us to include more genes in the analysis
 * For GenMAPP
 * Will use the fold change cut-off of greater than 0.25 and less than -0.25 and a p cut-off of p < 0.05 because we want to include several hundred genes in the analysis
 * Merrell et al. (2002) used a method which quantified, normalized and corrected their data to yield significant gene expression change of each gene as an intensity ratio relative to a control reference signal.
 * The Merrell method seemed to be similar to our method of data normalization and correction except we did ours using Excel while they used the Statistical Analysis for Microarrays (SAM) program. They also used a two-class SAM analysis where class I was grown in vitro while II was taken from each individual (Class II based on collapsing individual samples). Our analysis was based on 3 samples each from 3 individuals.

Sanity Check - Compare individual genes with known data

 * Merrell et al. data format or their adjusted data: [Patient A fold change] [Patient B fold change]  [Patient C fold change]


 * Gene ID VC0028
 * Fold changes: 1.65
 * p values: 0.0474
 * Merrell et al. results: VC0028 dihydroxy-acid dehydratase (ilvD) 3.5  2.2  2.1
 * Our data seems to be consistently lower


 * Gene ID VC0941
 * Fold changes: 0.09
 * p values: 0.6759
 * Merrell et al. results: VC0941 serine hydroxymethyltransferase (glyA-1) 3.7  4.5  3.0
 * Our data is consistently significantly lower


 * Gene ID VC0869
 * Fold changes: 1.59
 * p values: 0.0463
 * Merrell et al. results: VC0869 phosphoribosylformylglycinamidine synthase(purL) 2.4 2.7  2.1
 * Our data seems to be consistently much lower


 * Gene ID VC0051
 * Fold changes: 1.92
 * p values: 0.0139
 * Merrell et al. results: VC0051	phosphoribosylaminoimidazole carboxylase, ATPase subunit (purK) 2.3  3.0  2.1
 * Our data is consistently slightly lower


 * Gene ID VC0647
 * Fold changes: -0.94
 * p values: 0.0125
 * Merrell et al. results: VC0647 polyribonucleotide nucleotidyltransferase (pnp) 3.0  2.2  2.3
 * Our data is consistently significantly lower


 * Gene ID VC0468
 * Fold changes: -0.17
 * p values: 0.3350
 * Merrell et al. results: VC0468 glutathione synthetase (gshB) 0.29  0.27  0.35
 * Our data is consistently slightly lower


 * Gene ID VC2350
 * Fold changes: -2.40
 * p values: 0.0130
 * Merrell et al. results: VC2350 deoxyribose-phosphate aldolase (deoC) 0.31  0.13  	0.27
 * Our data is consistently significantly lower

Richard Brous 20:30, 23 October 2010 (EDT)
 * Gene ID VCA0583
 * Fold changes: 0.61
 * p values: 0.1457
 * Merrell et al. results: VCA0583 hypothetical protein 0.09  0.18  0.23
 * Our data seems to be consistently much higher