Kara M Dismuke Week 11 Journal: Difference between revisions

Revision as of 22:14, 31 March 2015

Electronic Journal Notebook

Summary of what you need to turn in for the individual Week 11 assignment Upload your updated Excel spreadsheet to LionShare that has today's calculations in it. Use the same filename as before so that the download link that you already provided to Drs. Dahlquist and Fitzpatrick will still work. Create, upload to OpenWetWare, and link to a PowerPoint presentation that contains the p value table and the screenshots of your stem results. Each slide in the presentation should have a meaningful title that describes the main message of the slide. These slides will form the basis of your final presentation in the class. [[ Zip together all of the tab-delimited text files that you created for and from stem and upload them to LionShare: the file that was saved from your original spreadsheet that you used to run stem each of the genelist and GOlist files for each of your significant profiles.

Methods

Background

To analyze microarray data:

Using GenePix Pro Software...
- In each spot, quantitate the fluorescence signal
- Calculate ration of red/green fluorescence
- Log transform ratios
- Normalize ratios on each microarray slide
Using a script in R...
- Normalize ratios for set of slides
Using Microsoft Excel...
- Perform statistical analysis on ratios
- Compare genes w/ known data
USING STEM software...
- Map to biological pathways
Using MATLAB...
- Create mathematical model of transcriptional network

Project Partner: Kristen Horstmann Dahlquist lab microarray data set: Wild type vs. dZAP1

Note: Kristen will be analyzing the Wild type strain data and I (Kara) will be analyzing the dZAP1 strain data.

For dZAP1...Media: Dahlquist_Lab_Microarray_Data_dZAP1_20150319_KMD
- - t15: 4 replicates
  - t30: 4 replicates
  - t60: 4 replicates
  - t90: 4 replicates
  - t120: 4 replicates
Procedural Notes:

Statistical Analysis: ANOVA

In the Excel data file, we created a new sheet, and named it "stats"

In first row...
Copied first two columns into the stats sheet (ID, Standard name)
1. Labeled columns C-G using the format: (STRAIN)_xbar_(TIME) in first row
  - In the dZAP1 case: *"dZAP1_xbar_t15", "dZAP1_xbar_t30", "dZAP1_xbar_t60", "dZAP1_xbar_t90", "dZAP1_xbar_t120"
2. Labeled columns H and I using the format: (STRAIN)_xbar_grand and (STRAIN)_ss_HO
  - In the dZAP1 case: *"dZAP1_xbar_grand", "dZAP1_ss_HO"
3. Labeled columns J-N using the format: (STRAIN)_SS_full, Fstat and p-value
  - In the dZAP1 case: *"dZAP1_SS_full", "dZAP1_Fstat", "dZAP1_p-value"
Performed Computations
1. In A2, typed =AVERAGE(
  - clicked tab containing the data, then highlighted all data sheet in row 2 associated with dZAP1 and t15 then close parenthesis with ) and press "Enter"
    - Then, we did this for t30, t60, t90, t120 (to compute the average)
  - Clicked cell C2, positioned cursor at bottom right corner and once, we saw a plus sign, double clicked so formula will be copied into column for all the other genes
2. Performed similar operation for D2 (with t30 data), E2 (with t60 data), F2 (with t90 data), and G2 (with t120 data)
3. Performed similar operation for all dZAP1 data in H2 (entire row 2 instead of specific time points)
  - Note this is the computed the "grand" average for all of our data for a particular gene. For the case of dZAP1, note this could be computed by averaging all the data (this is contains all the data at each time point) or by averaging the averages previously connected because there were 4 replicates for each time point. If done both ways, you should get the same value.
4. In I2, typed =SUMSQ(
  - clicked on data sheet and highlighted al data for dZAP1 then closed parenthesis and pressed "Enter"
5. We noted there were 4 data points at each time point, and with 5 time points, there were 20 total number of data points.
6. Typed =SUMSQ(data!C2:F2)-4*stats!C2^2 in J2 and pressed "Enter"
  - Fill formula to entire column using previously discussed procedure
7. Perform step 6 for K2 (t30), L2 (t60), M2 (t90), and N2 (t120)
8. Set total number of data points equal to (n=20)
9. In P2, typed =((n-5/5)*(I2-O2)/O2 and hit "Enter" (note: n=20)
  - Fill formula to entire column using previously discussed procedure
10. In Q2, typed =FDIST(F2, 5, n-5) (note: n=20)
  - Fill formula to entire column using previously discussed procedure
11. Adjusted p-value to correct for multiple testing problem
  1. Label column r (in first row): "dZAP1_Bonferroni_p-value"
  2. Typed =Q2*6189 in R2 and hit "Enter"
    - Fill formula to entire column using previously discussed procedure
  3. In S2, typed formula =IF(R2>1,1,R2) to replace any adjusted p value greater than 1 by number 1

Calculate the Benjamini Hochberg p value Correction

Named new worsheet "B&H"
Created index column
- in A1, typed "Index"
- in A2, typed "1"
- in A3, typed "2"
- Selected cells A2 and A3 and use previously discussed procedure, to fill cells in column A
  - this creates a series of numbers from 1 to 6189 in column A
In Column B, copied ID Column from one of previous worksheets
- To paste, used Paste special > Paste values
Copied column Q (unadjusted p-values) from "stats" worksheet and paste into Column C
Selected columns A-C, and then, sorted Column C by ascending values
In D1, type "Rank"
- Created series of numbers from 1-6189 using previously discussed procedure
In E1, typed "STRAIN_B-H_p-value" (dZAP1_B-H_p-value)
In E2, typed =(C2*6189)/D2 and pressed "Enter"
- Fill equation in cells of entire E column
in F1, typed "STRAIN_B-H_p-value"
- In F2, typed =IF(E2>1,1,E2) and pressed "Enter"
- Fill equation in cells of entire F column
Selected columns A-F, and sorted them by Index's ascending order (in Column A)
Copied column F and used Paste special > Paste values, to paste into Column T of stats sheet
- Labeled Column T in stats sheet "dZAP1_B-H_p-value"

RESULTING EXCEL FILE: Media:Dahlquist_Lab_Microarray_Data_dZAP1_20150326_KMD.xls

Sanity Check: Number of genes significantly changed

Perform sanity check to ensure we performed our data analysis correctly. We need to find the number of genes that are significantly changed at certain cut-off points.

In "stats" worksheet, selected Menu item Data > Filter > Autofilter
Clicked drown-down arrow on Column Q to select "Custom" filter according to the following criterion:
- How many genes have p<.05? What is the percentage (out of 6189)?
- This is saying that less than 5% of the time, we would expect to see a gene expression change for at least one of time points.
- How many genes have p<.01? What is the percentage (out of 6189)?
- How many genes have p<.001? What is the percentage (out of 6189)?
- How many genes have p<.0001? What is the percentage (out of 6189)?
We note that performing this procedure doesn't give us which genes pass the cut-off. We also note that the Bonferroni p-value correction is very stringent and the Benjamini-Hochberg correction is less stringent. Filter data to find:
- For the Bonferroni-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
- For the Benjamini and Hochberg-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
We note that the p-value is not a magical number to determine significance; rather, it is a confidence level on a movable scale (with the smaller p-values used for higher levels of confidence)

Clustering and Gene Ontology Analysis with STEM

STEP ONE: SET UP SOFTWARE
1. Download STEM software. Click here to go to the STEM web site.
2. Click on the following download link, register, and download stem.zip file to Desktop.
  - Unzip the file. Then inside the folder created called stem, double-click on the stem.cmd to launch STEM program
STEP TWO: PREPARE DATA FOR SOFTWARE
1. Create new worksheet entitled "stem" in Excel file
  - Copy "index" column from "stats" worksheet and Paste values into column A of "stem" worksheet
2. Copy data from "stats" worksheet and Past values into into column B of "stem" worksheet
  - Rename column A "ID"
  - Rename column B: "SPOT"
3. Filter data on B-H corrected p value to be >.05, then delete these rows (except the header row)
  - then after deleting these, unfilter the data
4. Delete all data columns except for Average Log Fold change columns for each time point (e.g. "dZAP1_xbar_t15")
5. Rename data columns with just time and units
  - I.e. rename "dZAP1_xbar_t15" "15m"
  - Save work and then "Save As" the file in the "Text (Tab-delimited) (*txt)" format
RUN STEM
1. In section 1 (Expression Data Info) of the STEM interface window...
  - click "Browse" and then choose your file
  - click radio button "No normalization/add 0"
  - Check box: "Spot IDs included in the data file"
2. In section 2 (Gene Info) of the STEM interface window...
  - for Gene Annotation Source, select Saccharomyces cerevisiae (SGD)
  - for Cross Reference Source, select "No cross references"
  - for Gene Location Source, select "No Gene Locations"
3. In section 3 (Options) of the STEM interface window...
  - check that Clustering Method is "STEM Clustering Method"
  - don't change defaults for Maximum Number or Model Profiles or Maximum Unit Change in Model between Time Points
4. in section 4 (Execute)
  - click on yellow Execute button to run STEM
VIEW & SAVE STEM RESULTS
1. In new window ("All STEM Profiles (1)") that opens, note that each box represents a model expression profile.
  - If it's colored, this indicates a statistically significant number of genes assigned
    - If two profiles have the same color, then they belong to the same cluster of profiles
  - All of them are arranged in order of p value: from most significant to least significant
  - The software assigns a number to each box to serve as an ID for each
2. Click on "Interface Options..." and at the bottom of this window, click on radio button: "Based on Real Time" (this will be next to X-axis scale should be"). Then close this window.
3. Take a screenshot of this window, and paste into PPT presentation to save the generated figures
4. Click on each profile and take screenshot of this more detailed plot that opens in a new window
  - Take screenshot of each and save each image in your presentation
  - At bottom of each profile window, click on "Profile Gene Table" button to see the list of genes belonging to that profile
  - In window that appears, click "Save Table" button and save file to your Desktop
    - Make name of file descriptive of the contents (in format of: dZAP1_profile#_genelist.txt"
  - Upload these files to LionShare and provide link to Dr. Dahlquist and Dr. Fitzpatrick
    - It will be easier to zip all the files together and upload them as one file
5. For each profile, click "Profile GO Table" to see list of Gene Ontology terms belonging to the profile
  - In window that appears, click "Save Table" button and save file to your Desktop
    - Make name of file descriptive of contents (in format of: dZAP1_profile#_GOlist.txt")
      - At this point, all data needed from STEM software is save and we can begin interpreting it
  - Upload these files to LionShare and provide link to Dr. Dahlquist and Dr. Fitzpatrick.
    - It will be easier to zip all the files together and upload them as one file

Results

PART 1: ANOVA

Comparison with known data (NSR1)

NSR1 (ID: YGR159C) (Note in my Excel file the data for this gene is found in row 2362)
Data for NSR1
- unadjusted p value: 5.8618x10^-8
- Bonferroni-corrected p-value: .000362787
- B-H-corrected p-value: 1.25099x10^-5
- average Log fold change at each of the time points: (did i do this right??...come back and check this)
  - t15: 3.5895
  - t30: 3.394075
  - t60: 3.609025
  - t90:-1.9752
  - t120: .026325

p-value results

unadjusted p value <.05

wild-type: 31.42% (2378/6189)
dZAP1: 36.58% (2264/6189)

unadjusted p-value < .01

wild-type: 24.67% (1527/6189)
dZAP1: 23.35% (1445/6189)

unadjusted p-value < .001

wild-type: 13.90% (860/6189)
dZAP1: 12.80% (792/6189)

unadjusted p-value < .0001

wild-type: 7.43% (460/6189)
dZAP1: 6.69% (414/6189)

Benjamini & Hochberg-adjusted p-value < .05

wild-type: 26.76% (1656/6189)
dzap1: 24.85% (1538/6189)

Bonferroni-adjusted p-value <.05

wild-type: 3.68% (228/6189)
dZAP1: 3.10% (192/6189)

Note: We've organized this data into a table in a PPT slide: Table 1 (wildtype and dZAP1 p-values)‎

@@ Line 167: / Line 167: @@
 ===Results===
-====Comparison with known data (NSR1)====
+====PART 1: ANOVA====
+=====Comparison with known data (NSR1)=====
 *NSR1 (ID: YGR159C) (Note in my Excel file the data for this gene is found in row 2362)
 *Data for NSR1
@@ Line 180: / Line 181: @@
 ***t120: .026325
-====p-value results====
+=====p-value results=====
 '''unadjusted p value <.05'''
 *wild-type: 31.42% (2378/6189)
@@ Line 207: / Line 208: @@
 ====Biological Interpretation of STEM Results====
 *[[Media: Dahlquist_Lab_Microarray_Data_dZAP1_20150326_KMD_stem.txt | STEM Text file]]
 ===Conclusions===
 ===Answers to Questions===

Kara M Dismuke Week 11 Journal: Difference between revisions

Revision as of 22:14, 31 March 2015

Contents

Electronic Journal Notebook

Methods

Background

Statistical Analysis: ANOVA

Calculate the Benjamini Hochberg p value Correction

Sanity Check: Number of genes significantly changed

Clustering and Gene Ontology Analysis with STEM

Results

PART 1: ANOVA

Comparison with known data (NSR1)

p-value results

Biological Interpretation of STEM Results

Conclusions

Answers to Questions

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools