BIOL368/F14:Chloe Jones Week 13: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Calculating fold change: replace error with single space character)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Here is the calculations that I did so you can compare your calculations with mine.  [[Media:Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx | Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx]]  ''— [[User:Kam D. Dahlquist|Kam D. Dahlquist]] 18:54, 2 December 2014 (EST)''
===Calculating fold change===
===Calculating fold change===
*Here is the [[Media:Overton_MicroarrayData_20141119.xlsx | Overton_MicroarrayData_20141119.xlsx]] used for this analysis of MRSA with Ranalexin.  
*Here is the [[Media:Overton_MicroarrayData_20141119.xlsx | Overton_MicroarrayData_20141119.xlsx]] used for this analysis of MRSA with Ranalexin.
** Start from the sheet called "SAR_only"
** Start from the sheet called "SAR_only"
*<u>Step. 1</u> Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.  
*<u>Step 1.</u> Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.
*<u>Step. 2</u> Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
*<u>Step 2.</u> Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
* Step 3.  Take the log2 of each fold change.
*<u>Step 3.</u> Take the log2 of each fold change.
  =LOG(number, base)
  =LOG(number, base)
  for example, =LOG(A1,2) takes the log2 of the number in cell A1.
  for example, =LOG(A1,2) takes the log2 of the number in cell A1.
* Fix the dye-swapped samples so that the orientation is alwasys ranalexin/control by multiplying the log fold changes of the samples where the control was labeled with Cy5 by negative 1 (-1).
*<u>Step 4.</u> Homogenize the data so the samples orientation always reads ranalexin/control, do this by multiplying where the control was labeled with Cy5 by (-1).
  =(-1)*A1 would multiply the number in cell A1 by negative 1.
  =(-1)*A1 would multiply the number in cell A1 by negative 1.
* create a new worksheet. Copy and paste the following columns into the new sheet, ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped. Be sure to use the option, Paste special > paste values instead of a regular copy and paste.
*<u>Step 5.</u> New Worksheet. Input the ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped into new worksheet.Paste specials>values.
* there are probably error messages in certain cells after you have done these calculations. We need to replace all of these with a single space character using the Find/replace option. Record the number of replacements you make.
*<u>Step 6.</u> Some cells contain error (i.e #NUM!, #DIV/0!), get rid of these errors by find/replace method. Record number of replacements.
**#NUM!: <b>535</b>
**#DIV/0: <b>16</b>


=== Scaling and Centering the Data ===
=== Scaling and Centering the Data ===
''To scale and center the data (between chip normalization) perform the following operations:''
*<u>Step 1.</u> New worksheet, name it "scaled_centered".
*<u>Step 2.</u> Previous worksheet, copy the log fold changes and input in new "scaled_centered". worksheet.
*<u>Step 3.</u> Insert two rows of headers between the top row of headers
*<u>Step 4.</u> Label A2 cell "Average" and A3 cell "StdDev."
*<u>Step 5.</u> Average log ration for each chip (column of data). Cell B2 type equation below. Instead of highlighting till the bottom, press the top and the scroll down to bottom>Shift/click>Ctrl/enter.
=AVERAGE(B4:B5483)
*<u>Step 6.</u> Compute Standard deviation for each chip (column of data). Cell B3 type equation below.
=STDEV(B4:B5483)
*<u>Step 7.</u> Utilize Excel.Copy equations from cells B2 and B3. Excel will make the equations match the column numbers. Now the standard deviation and average is computed for the log ratios of each chip. Scaling and centering based on these values.
*<u>Step 8.</u> Copy the headers and and place them to the right of the existing dataset. They should be headers with blank columns underneath. Change the names of the headers so they now read <previous name>_scaled_centered.
*<u>Step 9.</u> Cell N4 type equation below.  B4 is getting average (Cell B2) subtracted and then divided by standard deviation (cell B3). Dollar symbols make sure that each cell in a column is getting subtracted and divided by the same thing. This is important because if you were to copy the whole column with no dollar signs it would move to correlate to that particular row and column.
  =(B4-B$2)/B$3
*<u>Step 10.</u>Copy and paste the scaling and centered equation above for the "_scaled_centered" column headers." Use the information of the right that correlate with the name minus the "scaled-centered" part. 
=== Perform statistical analysis on the ratios ===
''We are going to perform this step on the scaled and centered data you produced in the previous step.''
*<u>Step 1.</u> New worksheet title it "statistics".
*<u>Step 2.</u> Copy the first two columns ("ID" and "Index") from the "scaling_centering" worksheet. Paste in new worksheet.
*<u>Step 3.</u> Copy the "_scaled_centered" columns from the scaling_centering" worksheet. Paste in new worksheet with paste special>value.
*<u>Step 4.</u> Remove rows 2 and 3 ("Average" and "StDev" ), now you have gene IDs right below the headers.
*<u>Step 5.</u> Go to the next empty new column on the right. Input headers "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"into it individual column.
*<u>Step 6.</u> Compute average fold changes for for the ''technical'' replicates for each sample(A, B, and C). Use equation below, type into I2 and use for column I.
  =AVERAGE(C2:D2)
*<u>Step 7.</u>Follow same equation for samples B and C, paste into appropriate columns.
*<u>Step 8.</u> Compute average of biological replicates. Use header "Avg_LogFC_all", input into next available empty column. Type =average first cell and then parenthesis and click the three averages that were just calculated. Essentially an average of an average.
*<u>Step 9.</u>New column next to  "Avg_LogFC_all" column, input header  "Tstat". This column will tell us if scaled and centered average log ratios are significantly different than 0 aka no change. Use equation below. (Number of replicates=3). Calculate for whole column. 
=AVERAGE(I2:K2)/(STDEV(I2:K2)/SQRT(number of replicates))
*<u>Step 10.</u>Empty column to the right, input header "Pvalue".Use equation below. (degree of freedom=number of replicates minus one)
  =TDIST(ABS(M2),degrees of freedom,2)
*<u>Step 11.</u>New worksheet title it "forGenMAPP".
*<u>Step 12.</u> Copy all information from "statistics" worksheet, into new worksheet. Paste special>values
*<u>Step 13.</u>Change columns  C through L (all the fold changes) into 2 decimal places. Right click>Format>Numbers>2. Change columns M and N to four decimal places.
*<u>Step 14.</u>Input column right to "ID" column. Label header "SystemCode", fill column with "N".
*<u>Step 15.</u>Save file as a Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Will show warnings. Upload .xls and .txt files to wiki page with unique name.
*Here is the [[Media:Overton_MicroarrayData_20141119.txt | Overton_MicroarrayData_20141119.txt]] used for this analysis of MRSA with Ranalexin. The text version that will be used for GenMAPP.
=== Sanity Check: Number of genes significantly changed ===
''Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly.  We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.''
*<u>Step 1.</u> Go to "forGenMAPP" worksheet. On A1 cell right click filter>filter by selected cells value
*<u>Step 2.</u> Click on drop down menu on "Pvalue" column. Text filters>custom filters. Set criteria for P-value:
** How many genes have p value < 0.05? <b>140</b>
** What about p < 0.01? <b>31</b>
** What about p < 0.001?<b>3</b>
** What about p < 0.0001?<b>1</b>
''* We have just performed 5480 T tests for significance.  Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times.  If have more than 274 genes that pass this cut off, we know that some genes are significantly changed.  However, we don't know ''which'' ones.''
*"Avg_LogFC_all" column tells the size of gene expression change, positive values correlates to increases relative to control. Negative values correlate to decreases relative to the control.
*<u>Step 3.</u>
**"Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. There are<b>71</b>.
**"Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. There are <b>69</b>.
** What about an average log fold change of > 0.25 and p < 0.05? There are <b>55</b>.
** What about an average log fold change of < -0.25 and p < 0.05?There are <b>54</b>.More realistic.
**''For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.''
* What criteria did your paper use to determine a significant gene expression change?  How does it compare to our method?
**In my paper the genes were analyzed using a software called GeneSpring v7. ImaGene V5.5 was used to normalize the data  which then allowed for the calculation of each gene. To identify difference in response the ranalexin treatment the data was filtered to detect genes that had a greater than two-fold expression difference with a t-test p-value < 0.05. The method used in the paper was very similar to the method for this assignment. Both methods utilized a T-test with certain parameters one being haivng a P-value of less that 0.05,  however the fold expression had to be more in the paper to obtain significance (i.e. two fold in comparison to .25)
=== Sanity Check:  Compare individual genes with known data ===
* Look in your paper for genes that are specifically mentioned.  What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
**They discussed individual genes, but in terms of mentioning the p-value it was only discussed for modules. They provided a table with modules and what genes were present, however they provided the  false discovery adjusted p-value rather than just the p-value. The false discovery p-value relates to the number of false discoveries in that test. I am not as familiar with the FDR, but I know that it is more conserved than a P-value.  They mention the p value and fold expression in  terms of the 93 upregulated and 105 downregulated genes (>two-fold expression difference,p < 0.05). The paper mention certain genes such as  VraR (SAR1974, 2.45 fold, RanaUP),  SAR1689 (GreA, 4.00-fold), and SAR0625 (SarA, 4.36-fold). SAR1374 (msrR, 2.24-fold). When I compared my dataset to the fold changes calculated in the paper I saw significant differences. Most of my fold changes don’t exceed a decimal.
{{Template:Chloe Jones}}

Latest revision as of 00:58, 3 December 2014

Here is the calculations that I did so you can compare your calculations with mine. Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx Kam D. Dahlquist 18:54, 2 December 2014 (EST)

Calculating fold change

  • Here is the Overton_MicroarrayData_20141119.xlsx used for this analysis of MRSA with Ranalexin.
    • Start from the sheet called "SAR_only"
  • Step 1. Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.
  • Step 2. Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
  • Step 3. Take the log2 of each fold change.
=LOG(number, base)
for example, =LOG(A1,2) takes the log2 of the number in cell A1.
  • Step 4. Homogenize the data so the samples orientation always reads ranalexin/control, do this by multiplying where the control was labeled with Cy5 by (-1).
=(-1)*A1 would multiply the number in cell A1 by negative 1.
  • Step 5. New Worksheet. Input the ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped into new worksheet.Paste specials>values.
  • Step 6. Some cells contain error (i.e #NUM!, #DIV/0!), get rid of these errors by find/replace method. Record number of replacements.
      1. NUM!: 535
      2. DIV/0: 16

Scaling and Centering the Data

To scale and center the data (between chip normalization) perform the following operations:

  • Step 1. New worksheet, name it "scaled_centered".
  • Step 2. Previous worksheet, copy the log fold changes and input in new "scaled_centered". worksheet.
  • Step 3. Insert two rows of headers between the top row of headers
  • Step 4. Label A2 cell "Average" and A3 cell "StdDev."
  • Step 5. Average log ration for each chip (column of data). Cell B2 type equation below. Instead of highlighting till the bottom, press the top and the scroll down to bottom>Shift/click>Ctrl/enter.
=AVERAGE(B4:B5483)
  • Step 6. Compute Standard deviation for each chip (column of data). Cell B3 type equation below.
=STDEV(B4:B5483)
  • Step 7. Utilize Excel.Copy equations from cells B2 and B3. Excel will make the equations match the column numbers. Now the standard deviation and average is computed for the log ratios of each chip. Scaling and centering based on these values.
  • Step 8. Copy the headers and and place them to the right of the existing dataset. They should be headers with blank columns underneath. Change the names of the headers so they now read <previous name>_scaled_centered.
  • Step 9. Cell N4 type equation below. B4 is getting average (Cell B2) subtracted and then divided by standard deviation (cell B3). Dollar symbols make sure that each cell in a column is getting subtracted and divided by the same thing. This is important because if you were to copy the whole column with no dollar signs it would move to correlate to that particular row and column.
 =(B4-B$2)/B$3
  • Step 10.Copy and paste the scaling and centered equation above for the "_scaled_centered" column headers." Use the information of the right that correlate with the name minus the "scaled-centered" part.


Perform statistical analysis on the ratios

We are going to perform this step on the scaled and centered data you produced in the previous step.

  • Step 1. New worksheet title it "statistics".
  • Step 2. Copy the first two columns ("ID" and "Index") from the "scaling_centering" worksheet. Paste in new worksheet.
  • Step 3. Copy the "_scaled_centered" columns from the scaling_centering" worksheet. Paste in new worksheet with paste special>value.
  • Step 4. Remove rows 2 and 3 ("Average" and "StDev" ), now you have gene IDs right below the headers.
  • Step 5. Go to the next empty new column on the right. Input headers "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"into it individual column.
  • Step 6. Compute average fold changes for for the technical replicates for each sample(A, B, and C). Use equation below, type into I2 and use for column I.
 =AVERAGE(C2:D2)
  • Step 7.Follow same equation for samples B and C, paste into appropriate columns.
  • Step 8. Compute average of biological replicates. Use header "Avg_LogFC_all", input into next available empty column. Type =average first cell and then parenthesis and click the three averages that were just calculated. Essentially an average of an average.
  • Step 9.New column next to "Avg_LogFC_all" column, input header "Tstat". This column will tell us if scaled and centered average log ratios are significantly different than 0 aka no change. Use equation below. (Number of replicates=3). Calculate for whole column.
=AVERAGE(I2:K2)/(STDEV(I2:K2)/SQRT(number of replicates))
  • Step 10.Empty column to the right, input header "Pvalue".Use equation below. (degree of freedom=number of replicates minus one)
 =TDIST(ABS(M2),degrees of freedom,2)
  • Step 11.New worksheet title it "forGenMAPP".
  • Step 12. Copy all information from "statistics" worksheet, into new worksheet. Paste special>values
  • Step 13.Change columns C through L (all the fold changes) into 2 decimal places. Right click>Format>Numbers>2. Change columns M and N to four decimal places.
  • Step 14.Input column right to "ID" column. Label header "SystemCode", fill column with "N".
  • Step 15.Save file as a Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Will show warnings. Upload .xls and .txt files to wiki page with unique name.

Sanity Check: Number of genes significantly changed

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

  • Step 1. Go to "forGenMAPP" worksheet. On A1 cell right click filter>filter by selected cells value
  • Step 2. Click on drop down menu on "Pvalue" column. Text filters>custom filters. Set criteria for P-value:
    • How many genes have p value < 0.05? 140
    • What about p < 0.01? 31
    • What about p < 0.001?3
    • What about p < 0.0001?1

* We have just performed 5480 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times. If have more than 274 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones.

  • "Avg_LogFC_all" column tells the size of gene expression change, positive values correlates to increases relative to control. Negative values correlate to decreases relative to the control.
  • Step 3.
    • "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. There are71.
    • "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. There are 69.
    • What about an average log fold change of > 0.25 and p < 0.05? There are 55.
    • What about an average log fold change of < -0.25 and p < 0.05?There are 54.More realistic.
    • For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.


  • What criteria did your paper use to determine a significant gene expression change? How does it compare to our method?
    • In my paper the genes were analyzed using a software called GeneSpring v7. ImaGene V5.5 was used to normalize the data which then allowed for the calculation of each gene. To identify difference in response the ranalexin treatment the data was filtered to detect genes that had a greater than two-fold expression difference with a t-test p-value < 0.05. The method used in the paper was very similar to the method for this assignment. Both methods utilized a T-test with certain parameters one being haivng a P-value of less that 0.05, however the fold expression had to be more in the paper to obtain significance (i.e. two fold in comparison to .25)

Sanity Check: Compare individual genes with known data

  • Look in your paper for genes that are specifically mentioned. What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
    • They discussed individual genes, but in terms of mentioning the p-value it was only discussed for modules. They provided a table with modules and what genes were present, however they provided the false discovery adjusted p-value rather than just the p-value. The false discovery p-value relates to the number of false discoveries in that test. I am not as familiar with the FDR, but I know that it is more conserved than a P-value. They mention the p value and fold expression in terms of the 93 upregulated and 105 downregulated genes (>two-fold expression difference,p < 0.05). The paper mention certain genes such as VraR (SAR1974, 2.45 fold, RanaUP), SAR1689 (GreA, 4.00-fold), and SAR0625 (SarA, 4.36-fold). SAR1374 (msrR, 2.24-fold). When I compared my dataset to the fold changes calculated in the paper I saw significant differences. Most of my fold changes don’t exceed a decimal.

Electronic Lab Notebook

Weekly Assignments

Class Journals


Chloe Jones 03:46, 15 October 2014 (EDT)Chloe Jones