BIOL368/F14:Chloe Jones Week 13: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Here is the calculations that I did so you can compare your calculations with mine.  [[Media:Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx | Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx]]  ''— [[User:Kam D. Dahlquist|Kam D. Dahlquist]] 18:54, 2 December 2014 (EST)''
===Calculating fold change===
===Calculating fold change===
*Here is the [[Media:Overton_MicroarrayData_20141119.xlsx | Overton_MicroarrayData_20141119.xlsx]] used for this analysis of MRSA with Ranalexin.  
*Here is the [[Media:Overton_MicroarrayData_20141119.xlsx | Overton_MicroarrayData_20141119.xlsx]] used for this analysis of MRSA with Ranalexin.
** Start from the sheet called "SAR_only"
** Start from the sheet called "SAR_only"
*<u>Step. 1</u> Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.  
*<u>Step 1.</u> Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.
*<u>Step. 2</u> Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
*<u>Step 2.</u> Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
* Step 3.  Take the log2 of each fold change.
*<u>Step 3.</u> Take the log2 of each fold change.
  =LOG(number, base)
  =LOG(number, base)
  for example, =LOG(A1,2) takes the log2 of the number in cell A1.
  for example, =LOG(A1,2) takes the log2 of the number in cell A1.
* Fix the dye-swapped samples so that the orientation is alwasys ranalexin/control by multiplying the log fold changes of the samples where the control was labeled with Cy5 by negative 1 (-1).
*<u>Step 4.</u> Homogenize the data so the samples orientation always reads ranalexin/control, do this by multiplying where the control was labeled with Cy5 by (-1).
  =(-1)*A1 would multiply the number in cell A1 by negative 1.
  =(-1)*A1 would multiply the number in cell A1 by negative 1.
* create a new worksheet. Copy and paste the following columns into the new sheet, ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped. Be sure to use the option, Paste special > paste values instead of a regular copy and paste.
*<u>Step 5.</u> New Worksheet. Input the ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped into new worksheet.Paste specials>values.
* there are probably error messages in certain cells after you have done these calculations. We need to replace all of these with a single space character using the Find/replace option. Record the number of replacements you make.
*<u>Step 6.</u> Some cells contain error (i.e #NUM!, #DIV/0!), get rid of these errors by find/replace method. Record number of replacements.
**#NUM!: <b>535</b>
**#DIV/0: <b>16</b>


=== Scaling and Centering the Data ===
=== Scaling and Centering the Data ===
 
To scale and center the data (between chip normalization) perform the following operations:
''To scale and center the data (between chip normalization) perform the following operations:''
 
* Insert a new Worksheet into your Excel file, and name it "scaled_centered".
*<u>Step 1.</u> New worksheet, name it "scaled_centered".
* Go back to your previous worksheet with the log fold changes, Select All and Copy.  Go to your new "scaled_centered" worksheet, click on the upper, left-hand cell (cell A1) and Paste.
*<u>Step 2.</u> Previous worksheet, copy the log fold changes and input in new "scaled_centered". worksheet.
* Insert two rows in between the top row of headers and the first data row.
*<u>Step 3.</u> Insert two rows of headers between the top row of headers
* In cell A2, type "Average" and in cell A3, type "StdDev".
*<u>Step 4.</u> Label A2 cell "Average" and A3 cell "StdDev."  
* You will now compute the Average log ratio for each chip (each column of data). In cell B2, type the following equation:
*<u>Step 5.</u> Average log ration for each chip (column of data). Cell B2 type equation below. Instead of highlighting till the bottom, press the top and the scroll down to bottom>Shift/click>Ctrl/enter.
  =AVERAGE(B4:B5483)
  =AVERAGE(B4:B5483)
: and press "Enter".  Excel is computing the average value of the cells specified in the range given inside the parentheses.  Instead of typing the cell designations, you can click on the beginning cell, scroll down to the bottom of the worksheet, and shift-click on the ending cell.
*<u>Step 6.</u> Compute Standard deviation for each chip (column of data). Cell B3 type equation below.
* You will now compute the Standard Deviation of the log ratios on each chip (each column of data). In cell B3, type the following equation:
  =STDEV(B4:B5483)
  =STDEV(B4:B5483)
: and press "Enter".
*<u>Step 7.</u> Utilize Excel.Copy equations from cells B2 and B3. Excel will make the equations match the column numbers. Now the standard deviation and average is computed for the log ratios of each chip. Scaling and centering based on these values.
* Excel will now do some work for you. Copy these two equations (cells B2 and B3) and paste them into the empty cells in the rest of the columns. Excel will automatically change the equation to match the cell designations for those columns.
*<u>Step 8.</u> Copy the headers and and place them to the right of the existing dataset. They should be headers with blank columns underneath. Change the names of the headers so they now read <previous name>_scaled_centered.
* You have now computed the average and standard deviation of the log ratios for each chip. Now we will actually do the scaling and centering based on these values.
*<u>Step 9.</u> Cell N4 type equation below. B4 is getting average (Cell B2) subtracted and then divided by standard deviation (cell B3). Dollar symbols make sure that each cell in a column is getting subtracted and divided by the same thing. This is important because if you were to copy the whole column with no dollar signs it would move to correlate to that particular row and column.  
* Copy the column headings for all of your data columns and then paste them to the right of the last data column so that you have a second set of headers above blank colums of cells. Edit the names of the columns so that they now read<previous name>_scaled_centered, etc.
  =(B4-B$2)/B$3
* In cell N4, type the following equation:
*<u>Step 10.</u>Copy and paste the scaling and centered equation above for the "_scaled_centered" column headers." Use the information of the right that correlate with the name minus the "scaled-centered" part.
  =(B4-B$2)/B$3
 
: In this case, we want the data in cell B4 to have the average subtracted from it (cell B2) and be divided by the standard deviation (cell B3). We use the dollar sign symbols in front of the "2" and "3" to tell Excel to always reference that row in the equation, even though we will paste it for the entire column. Why is this important?
* Copy and paste this equation into the entire column.
* Copy and paste the scaling and centering equation for each of the columns of data with the "_scaled_centered" column header. Be sure that your equation is correct for the column you are calculating.


=== Perform statistical analysis on the ratios ===
=== Perform statistical analysis on the ratios ===


We are going to perform this step on the scaled and centered data you produced in the previous step.
''We are going to perform this step on the scaled and centered data you produced in the previous step.''


* Insert a new worksheet and name it "statistics".
*<u>Step 1.</u> New worksheet title it "statistics".
* Go back to the "scaling_centering" worksheet and copy the first two columns ("ID" and "Index").
*<u>Step 2.</u> Copy the first two columns ("ID" and "Index") from the "scaling_centering" worksheet. Paste in new worksheet.  
* Paste the data into the first two columns of your new "statistics" worksheet.
*<u>Step 3.</u> Copy the "_scaled_centered" columns from the scaling_centering" worksheet. Paste in new worksheet with paste special>value.  
* Go back to the "scaling_centering" worksheet and copy the columns that are designated "_scaled_centered".
*<u>Step 4.</u> Remove rows 2 and 3 ("Average" and "StDev" ), now you have gene IDs right below the headers.  
* Go to your new worksheet and click on the C1 cell.  Select "Paste Special" from the Edit menu.  A window will open: click on the radio button for "Values" and click OK. This will paste the numerical result into your new worksheet instead of the equation which must make calculations on the fly.
*<u>Step 5.</u> Go to the next empty new column on the right. Input headers "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"into it individual column.  
* Delete Rows 2 and 3 where it says "Average" and "StDev" so that your data rows with gene IDs are immediately below the header row 1.
*<u>Step 6.</u> Compute average fold changes for for the ''technical'' replicates for each sample(A, B, and C). Use equation below, type into I2 and use for column I.
* Go to a new column on the right of your worksheet. Type the headers "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns.
  =AVERAGE(C2:D2)
* Compute the average log fold change for the ''technical'' replicates for each sample (A, B, and C) by typing the equation:
*<u>Step 7.</u>Follow same equation for samples B and C, paste into appropriate columns.  
=AVERAGE(C2:D2)
*<u>Step 8.</u> Compute average of biological replicates. Use header "Avg_LogFC_all", input into next available empty column. Type =average first cell and then parenthesis and click the three averages that were just calculated. Essentially an average of an average.  
: into cell I2.  Copy this equation and paste it into the rest of the column.
*<u>Step 9.</u>New column next to "Avg_LogFC_all" column, input header "Tstat". This column will tell us if scaled and centered average log ratios are significantly different than 0 aka no change. Use equation below. (Number of replicates=3). Calculate for whole column. 
* Create the equation for samples B and C and paste it into their respective columns.
* Now you will compute the average of the averages (i.e., average the ''biological'' replicates. Type the header "Avg_LogFC_all" into the first cell in the next empty column. Create the equation that will compute the average of the three previous averages you calculated and paste it into this entire column.
* Insert a new column next to the "Avg_LogFC_all" column that you computed in the previous step. Label the column "Tstat". This will compute a T statistic that tells us whether the scaled and centered average log ratio is significantly different than 0 (no change). Enter the equation: 
  =AVERAGE(I2:K2)/(STDEV(I2:K2)/SQRT(number of replicates))
  =AVERAGE(I2:K2)/(STDEV(I2:K2)/SQRT(number of replicates))
: (NOTE: in this case the number of replicates is 3. Be careful that you are using the correct number of parentheses.)  Copy the equation and paste it into all rows in that column.
*<u>Step 10.</u>Empty column to the right, input header "Pvalue".Use equation below. (degree of freedom=number of replicates minus one)
* Label the top cell in the next column "Pvalue". In the cell below the label, enter the equation: 
  =TDIST(ABS(M2),degrees of freedom,2)
=TDIST(ABS(M2),degrees of freedom,2)
*<u>Step 11.</u>New worksheet title it "forGenMAPP".
The number of degrees of freedom is the number of replicates minus one, so in our case there are 2 degrees of freedom. Copy the equation and paste it into all rows in that column.
*<u>Step 12.</u> Copy all information from "statistics" worksheet, into new worksheet. Paste special>values
* Insert a new worksheet and name it "forGenMAPP".
*<u>Step 13.</u>Change columns C through L (all the fold changes) into 2 decimal places. Right click>Format>Numbers>2. Change columns M and N to four decimal places.  
* Go back to the "statistics" worksheet and Select All and Copy.
*<u>Step 14.</u>Input column right to "ID" column. Label header "SystemCode", fill column with "N".  
* Go to your new sheet and click on cell A1 and select Paste Special, click on the Values radio button, and click OKWe will now format this worksheet for import into GenMAPP.
*<u>Step 15.</u>Save file as a Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Will show warnings. Upload .xls and .txt files to wiki page with unique name.  
* Select Columns C through L (all the fold changes). Select the menu item Format > Cells.  Under the number tab, select 2 decimal places. Click OK.
   
* Select Columns M and N.  Select the menu item Format > Cells.  Under the number tab, select 4 decimal places. Click OK.
*Here is the [[Media:Overton_MicroarrayData_20141119.txt | Overton_MicroarrayData_20141119.txt]] used for this analysis of MRSA with Ranalexin. The text version that will be used for GenMAPP.  
* Select Columns N through S and Cut. Select Column B by left-clicking on the "B" at the top of the column.  Then right-click on the Column B header and select "Insert Cut Cells".  This will insert the data without writing over your existing columns.
=== Sanity Check: Number of genes significantly changed ===
* Insert a column to the right of the "ID" column. Type the header "SystemCode" into the top cell of this column.  Fill the entire column (each cell) with the letter "N".
 
* Save your work. Select the menu item File > Save As, and choose "Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Excel will make you click through a couple of warnings because it doesn't like you going all independent and choosing a different file type than the native .xlsThis is OK. Your new *.txt file is now ready for import into GenMAPP. But before we do that, we want to know a few things about our data as shown in the next section.
''Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.''
** Upload both the .xls and .txt files that you have just created to your journal page in the class wiki. Make sure that your file name is distinct from your other classmates so that nobody overwrites anyone else's file.


=== Sanity Check: Number of genes significantly changed ===
*<u>Step 1.</u> Go to "forGenMAPP" worksheet. On A1 cell right click filter>filter by selected cells value
*<u>Step 2.</u> Click on drop down menu on "Pvalue" column. Text filters>custom filters. Set criteria for P-value:
** How many genes have p value < 0.05? <b>140</b>
** What about p < 0.01? <b>31</b>
** What about p < 0.001?<b>3</b>
** What about p < 0.0001?<b>1</b>
''* We have just performed 5480 T tests for significance.  Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times.  If have more than 274 genes that pass this cut off, we know that some genes are significantly changed.  However, we don't know ''which'' ones.''
*"Avg_LogFC_all" column tells the size of gene expression change, positive values correlates to increases relative to control. Negative values correlate to decreases relative to the control.
*<u>Step 3.</u>
**"Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. There are<b>71</b>.
**"Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. There are <b>69</b>.
** What about an average log fold change of > 0.25 and p < 0.05? There are <b>55</b>.
** What about an average log fold change of < -0.25 and p < 0.05?There are <b>54</b>.More realistic.
**''For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.''


Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly.  We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.


* Open your spreadsheet and go to the "forGenMAPP" tab.
* Click on cell A1 and select the menu item Data > Filter > Autofilter.  Little drop-down arrows should appear at the top of each column.  This will enable us to filter the data according to criteria we set.
* Click on the drop-down arrow on your "Pvalue" column.  Select "Custom".  In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
** How many genes have p value < 0.05?
** What about p < 0.01?
** What about p < 0.001?
** What about p < 0.0001?
* When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
* We have just performed 5480 T tests for significance.  Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times.  If have more than 274 genes that pass this cut off, we know that some genes are significantly changed.  However, we don't know ''which'' ones.
* The "Avg_LogFC_all" tells us the size of the gene expression change and in which direction.  Positive values are increases relative to the control; negative values are decreases relative to the control.
** Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero.  How many are there?
** Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero.  How many are there?
** What about an average log fold change of > 0.25 and p < 0.05?
** Or an average log fold change of < -0.25 and p < 0.05?  (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
* In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant".  Instead, it is a moveable confidence level.  If we want to be very confident of our data, use a small p value cut-off.  If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.  For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.
* What criteria did your paper use to determine a significant gene expression change?  How does it compare to our method?
* What criteria did your paper use to determine a significant gene expression change?  How does it compare to our method?
**In my paper the genes were analyzed using a software called GeneSpring v7. ImaGene V5.5 was used to normalize the data  which then allowed for the calculation of each gene. To identify difference in response the ranalexin treatment the data was filtered to detect genes that had a greater than two-fold expression difference with a t-test p-value < 0.05. The method used in the paper was very similar to the method for this assignment. Both methods utilized a T-test with certain parameters one being haivng a P-value of less that 0.05,  however the fold expression had to be more in the paper to obtain significance (i.e. two fold in comparison to .25)
=== Sanity Check:  Compare individual genes with known data ===
* Look in your paper for genes that are specifically mentioned.  What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
**They discussed individual genes, but in terms of mentioning the p-value it was only discussed for modules. They provided a table with modules and what genes were present, however they provided the  false discovery adjusted p-value rather than just the p-value. The false discovery p-value relates to the number of false discoveries in that test. I am not as familiar with the FDR, but I know that it is more conserved than a P-value.  They mention the p value and fold expression in  terms of the 93 upregulated and 105 downregulated genes (>two-fold expression difference,p < 0.05). The paper mention certain genes such as  VraR (SAR1974, 2.45 fold, RanaUP),  SAR1689 (GreA, 4.00-fold), and SAR0625 (SarA, 4.36-fold). SAR1374 (msrR, 2.24-fold). When I compared my dataset to the fold changes calculated in the paper I saw significant differences. Most of my fold changes don’t exceed a decimal.
{{Template:Chloe Jones}}

Latest revision as of 00:58, 3 December 2014

Here is the calculations that I did so you can compare your calculations with mine. Overton_MicroarrayData_20141119_CJ_downloaded_20141202_editedKD.xlsx Kam D. Dahlquist 18:54, 2 December 2014 (EST)

Calculating fold change

  • Here is the Overton_MicroarrayData_20141119.xlsx used for this analysis of MRSA with Ranalexin.
    • Start from the sheet called "SAR_only"
  • Step 1. Rename the columns with the item isolated, biological replicate, and dye used. For each array data file there will be 2 columns: signal and background.
  • Step 2. Minus the background from the signal for each data file. Then take the answers from each replicate and divide it by its corresponding dye (i.e. Cy5 Ranalexin(B1)/Cy3 MRSA252 (B1)). NOw, the fold change is calculated.
  • Step 3. Take the log2 of each fold change.
=LOG(number, base)
for example, =LOG(A1,2) takes the log2 of the number in cell A1.
  • Step 4. Homogenize the data so the samples orientation always reads ranalexin/control, do this by multiplying where the control was labeled with Cy5 by (-1).
=(-1)*A1 would multiply the number in cell A1 by negative 1.
  • Step 5. New Worksheet. Input the ID, index, the log fold changes of the non-dye-swapped samples and the fold changes of the dye-swapped samples that have been flipped into new worksheet.Paste specials>values.
  • Step 6. Some cells contain error (i.e #NUM!, #DIV/0!), get rid of these errors by find/replace method. Record number of replacements.
      1. NUM!: 535
      2. DIV/0: 16

Scaling and Centering the Data

To scale and center the data (between chip normalization) perform the following operations:

  • Step 1. New worksheet, name it "scaled_centered".
  • Step 2. Previous worksheet, copy the log fold changes and input in new "scaled_centered". worksheet.
  • Step 3. Insert two rows of headers between the top row of headers
  • Step 4. Label A2 cell "Average" and A3 cell "StdDev."
  • Step 5. Average log ration for each chip (column of data). Cell B2 type equation below. Instead of highlighting till the bottom, press the top and the scroll down to bottom>Shift/click>Ctrl/enter.
=AVERAGE(B4:B5483)
  • Step 6. Compute Standard deviation for each chip (column of data). Cell B3 type equation below.
=STDEV(B4:B5483)
  • Step 7. Utilize Excel.Copy equations from cells B2 and B3. Excel will make the equations match the column numbers. Now the standard deviation and average is computed for the log ratios of each chip. Scaling and centering based on these values.
  • Step 8. Copy the headers and and place them to the right of the existing dataset. They should be headers with blank columns underneath. Change the names of the headers so they now read <previous name>_scaled_centered.
  • Step 9. Cell N4 type equation below. B4 is getting average (Cell B2) subtracted and then divided by standard deviation (cell B3). Dollar symbols make sure that each cell in a column is getting subtracted and divided by the same thing. This is important because if you were to copy the whole column with no dollar signs it would move to correlate to that particular row and column.
 =(B4-B$2)/B$3
  • Step 10.Copy and paste the scaling and centered equation above for the "_scaled_centered" column headers." Use the information of the right that correlate with the name minus the "scaled-centered" part.


Perform statistical analysis on the ratios

We are going to perform this step on the scaled and centered data you produced in the previous step.

  • Step 1. New worksheet title it "statistics".
  • Step 2. Copy the first two columns ("ID" and "Index") from the "scaling_centering" worksheet. Paste in new worksheet.
  • Step 3. Copy the "_scaled_centered" columns from the scaling_centering" worksheet. Paste in new worksheet with paste special>value.
  • Step 4. Remove rows 2 and 3 ("Average" and "StDev" ), now you have gene IDs right below the headers.
  • Step 5. Go to the next empty new column on the right. Input headers "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C"into it individual column.
  • Step 6. Compute average fold changes for for the technical replicates for each sample(A, B, and C). Use equation below, type into I2 and use for column I.
 =AVERAGE(C2:D2)
  • Step 7.Follow same equation for samples B and C, paste into appropriate columns.
  • Step 8. Compute average of biological replicates. Use header "Avg_LogFC_all", input into next available empty column. Type =average first cell and then parenthesis and click the three averages that were just calculated. Essentially an average of an average.
  • Step 9.New column next to "Avg_LogFC_all" column, input header "Tstat". This column will tell us if scaled and centered average log ratios are significantly different than 0 aka no change. Use equation below. (Number of replicates=3). Calculate for whole column.
=AVERAGE(I2:K2)/(STDEV(I2:K2)/SQRT(number of replicates))
  • Step 10.Empty column to the right, input header "Pvalue".Use equation below. (degree of freedom=number of replicates minus one)
 =TDIST(ABS(M2),degrees of freedom,2)
  • Step 11.New worksheet title it "forGenMAPP".
  • Step 12. Copy all information from "statistics" worksheet, into new worksheet. Paste special>values
  • Step 13.Change columns C through L (all the fold changes) into 2 decimal places. Right click>Format>Numbers>2. Change columns M and N to four decimal places.
  • Step 14.Input column right to "ID" column. Label header "SystemCode", fill column with "N".
  • Step 15.Save file as a Text (Tab-delimited) (*.txt)" from the file type drop-down menu. Will show warnings. Upload .xls and .txt files to wiki page with unique name.

Sanity Check: Number of genes significantly changed

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

  • Step 1. Go to "forGenMAPP" worksheet. On A1 cell right click filter>filter by selected cells value
  • Step 2. Click on drop down menu on "Pvalue" column. Text filters>custom filters. Set criteria for P-value:
    • How many genes have p value < 0.05? 140
    • What about p < 0.01? 31
    • What about p < 0.001?3
    • What about p < 0.0001?1

* We have just performed 5480 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times. If have more than 274 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones.

  • "Avg_LogFC_all" column tells the size of gene expression change, positive values correlates to increases relative to control. Negative values correlate to decreases relative to the control.
  • Step 3.
    • "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. There are71.
    • "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. There are 69.
    • What about an average log fold change of > 0.25 and p < 0.05? There are 55.
    • What about an average log fold change of < -0.25 and p < 0.05?There are 54.More realistic.
    • For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.


  • What criteria did your paper use to determine a significant gene expression change? How does it compare to our method?
    • In my paper the genes were analyzed using a software called GeneSpring v7. ImaGene V5.5 was used to normalize the data which then allowed for the calculation of each gene. To identify difference in response the ranalexin treatment the data was filtered to detect genes that had a greater than two-fold expression difference with a t-test p-value < 0.05. The method used in the paper was very similar to the method for this assignment. Both methods utilized a T-test with certain parameters one being haivng a P-value of less that 0.05, however the fold expression had to be more in the paper to obtain significance (i.e. two fold in comparison to .25)

Sanity Check: Compare individual genes with known data

  • Look in your paper for genes that are specifically mentioned. What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
    • They discussed individual genes, but in terms of mentioning the p-value it was only discussed for modules. They provided a table with modules and what genes were present, however they provided the false discovery adjusted p-value rather than just the p-value. The false discovery p-value relates to the number of false discoveries in that test. I am not as familiar with the FDR, but I know that it is more conserved than a P-value. They mention the p value and fold expression in terms of the 93 upregulated and 105 downregulated genes (>two-fold expression difference,p < 0.05). The paper mention certain genes such as VraR (SAR1974, 2.45 fold, RanaUP), SAR1689 (GreA, 4.00-fold), and SAR0625 (SarA, 4.36-fold). SAR1374 (msrR, 2.24-fold). When I compared my dataset to the fold changes calculated in the paper I saw significant differences. Most of my fold changes don’t exceed a decimal.

Electronic Lab Notebook

Weekly Assignments

Class Journals


Chloe Jones 03:46, 15 October 2014 (EDT)Chloe Jones