Carmen E. Castaneda: Week 11

Background
This is a list of steps required to analyze DNA microarray data.


 * 1) Quantitate the fluorescence signal in each spot
 * 2) Calculate the ratio of red/green fluorescence
 * 3) Log transform the ratios
 * 4) Normalize the ratios on each microarray slide
 * 5) * Steps 1-4 are performed by the GenePix Pro software.
 * 6) * You will perform the following steps:
 * 7) Normalize the ratios for a set of slides in an experiment
 * 8) Perform statistical analysis on the ratios
 * 9) Compare individual genes with known data
 * 10) * Steps 5-7 are performed in Microsoft Excel
 * 11) Pattern finding algorithms (clustering)
 * 12) Map onto biological pathways
 * 13) * We will use software called STEM for the clustering and mapping
 * 14) Create mathematical model of transcriptional network

Experimental Design
For the Schade data, the timepoints are t0, t10, t30, t120, t12h (12 hours), and t60 (60 hours) of cold shock at 10°C.


 * Begin by recording in your wiki the number of replicates for each time point in your data. For the group assigned to the Schade data, compare the number of replicates with what is stated in the Materials and Methods for the paper.  Is it the same?  If not, how is it different?

There are 3 replicates for t0, 7 replicates for t10, 6 replicates for t30, 4 replicates for t120, 4 replicates for t12h, and 6 replicates for t60h.

No it is not the same there were according to the paper there were two repeats for t0, t120, and t12h and 3 repeats for t10, t30 and t60h.

Normalize the ratios for a set of slides in an experiment
Why is this important? The $ in front of the number is important because it acts a placeholder in the way that no matter how the formula is used, the number after the $ will stay the same. So in our case it allows us to move the formula around without loosing that cell row in our case.

Perform statistical analysis on the ratios
Record the number of replacements made in your wiki page.
 * I made zero replacements in this step.

Sanity Check: Number of genes significantly changed

 * Answer these questions for each timepoint in your dataset.
 * How many genes have p value < 0.05?
 * For t0, there are 269 genes have a p value < 0.05.
 * For t10, there are 451 genes have a p value < 0.05.
 * For t30, there are 474 genes have a p value < 0.05.
 * For t120, there are 1335 genes have a p value < 0.05.
 * For t12h, there are 2463 genes have a p value < 0.05.
 * For t60h, there are 972 genes have a p value < 0.05.


 * What about p < 0.01?
 * For t0, there are 86 genes have a p value < 0.01.
 * For t10, there are 76 genes have a p value < 0.01.
 * For t30, there are 63 genes have a p value < 0.01.
 * For t120, there are 514 genes have a p value < 0.01.
 * For t12h, there are 1204 genes have a p value < 0.01.
 * For t60h, there are 304 genes have a p value < 0.01.


 * What about p < 0.001?
 * For t0, there are 15 genes have a p value < 0.001.
 * For t10, there are 4 genes have a p value < 0.001.
 * For t30, there are 2 genes have a p value < 0.001.
 * For t120, there are 110 genes have a p value < 0.001.
 * For t12h, there are 254 genes have a p value < 0.001.
 * For t60h, there are 28 genes have a p value < 0.001.


 * What about p < 0.0001?
 * For t0, there are 1 genes have a p value < 0.0001.
 * For t10, there are 0 genes have a p value < 0.0001.
 * For t30, there are 0 genes have a p value < 0.0001.
 * For t120, there are 10 genes have a p value < 0.0001.
 * For t12h, there are 32 genes have a p value < 0.0001.
 * For t60h, there are 1 genes have a p value < 0.0001.

Perform this correction and determine whether and how many of the genes are still significantly changed at p < 0.05 after the Bonferroni correction.
 * For t0, there are 0 genes have a p value < 0.05.
 * For t10, there are 0 genes have a p value < 0.05.
 * For t30, there are 0 genes have a p value < 0.05.
 * For t120, there are 1 genes have a p value < 0.05.
 * For t12h, there are 2 genes have a p value < 0.05.
 * For t60h, there are 0 genes have a p value < 0.05.


 * The "AvgLogFC" tells us the magnitude of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.  For the timepoint that had the greatest number of genes significantly changed at p < 0.05, answer the following:
 * '''Keeping the "Pval" filter at p < 0.05, filter the "AvgLogFC" column to show all genes with an average log fold change greater than zero. How many meet these two criteria?
 * For t0, there are 1050 genes that meet these two criteria.
 * For t10, there are 1186 genes that meet these two criteria.
 * For t30, there are 1130 genes that meet these two criteria.
 * For t120, there are 1094 genes that meet these two criteria.
 * For t12h, there are 1179 genes that meet these two criteria.
 * For t60h, there are 1187 genes that meet these two criteria.


 * Keeping the "Pval" filter at p < 0.05, filter the "AvgLogFC" column to show all genes with an average log fold change less than zero. How many meet these two criteria?
 * For t0, there are 1406 genes that meet these two criteria.
 * For t10, there are 1277 genes that meet these two criteria.
 * For t30, there are 1333 genes that meet these two criteria.
 * For t120, there are 1369 genes that meet these two criteria.
 * For t12h, there are 1284 genes that meet these two criteria.
 * For t60h, there are 1276 genes that meet these two criteria.


 * Keeping the "Pval" filter at p < 0.05, How many have an average log fold change of > 0.25 and p < 0.05?
 * For t0, there are 502 genes that meet these two criteria.
 * For t10, there are 524 genes that meet these two criteria.
 * For t30, there are 568 genes that meet these two criteria.
 * For t120, there are 762 genes that meet these two criteria.
 * For t12h, there are 1155 genes that meet these two criteria.
 * For t60h, there are 776 genes that meet these two criteria.


 * How many have an average log fold change of < -0.25 and p < 0.05?
 * For t0, there are 673 genes that meet these two criteria.
 * For t10, there are 521 genes that meet these two criteria.
 * For t30, there are 720 genes that meet these two criteria.
 * For t120, there are 967 genes that meet these two criteria.
 * For t12h, there are 1227 genes that meet these two criteria.
 * For t60h, there are 801 genes that meet these two criteria.


 * What criteria did Schade et al. (2004) use to determine a significant gene expression change? How does it compare to our method?

'Find NSR1'' in your dataset. Is it's expression significantly changed at any timepoint? Record the average fold change and p value for NSR1 for each timepoint in your dataset.'''
 * For t0, the average fold change is -0.68 and the p value is 0.4618.
 * For t10, the average fold change is 0.72 and the p value is 0.0103.
 * For t30, the average fold change is 3.19 and the p value is 0.0120.
 * For t120, the average fold change is 3.60 and the p value is 0.0009.
 * For t12h, the average fold change is 1.76 and the p value is 0.0004.
 * For t60h, the average fold change is 0.41 and the p value is 0.3110.


 * Which gene has the smallest p value in your dataset (at any timepoint)? You can find this by sorting your data based on p value (but be careful that you don't cause a mismatch in the rows of your data!)  Look up the function of this gene at the Saccharomyces Genome Database and record it in your notebook.  Why do you think the cell is changing this gene's expression upon cold shock?


 * At t120, the gene YNL316C ahd the smallest p value at 1.26031877294559E-06. I think the cell is changing this gene's expression upon cold shock because since it is involved with converting prephanate to phenylpyruvate, the gene must feel that at that temperature it is not a vital function therefore it must repress the activity of gene.