Kristen M. Horstmann Week 11 Journal

Electronic Notebook

Procedure

Background

Kara and I were assigned to the microarray data analysis
We will be comparing wild type and the zap1 gene. I will be analyzing wild type, and Kara will be doing the data for zap1.
Time increments of 15, 30, 60, 90, and 120
File of data for dZAP1: dZap1

This is a list of steps required to analyze DNA microarray data.

Quantitate the fluorescence signal in each spot
Calculate the ratio of red/green fluorescence
Log transform the ratios
Normalize the ratios on each microarray slide
Normalize the ratios for a set of slides in an experiment
Perform statistical analysis on the ratios using Excel
Compare individual genes with known data using Excel
Pattern finding algorithms (clustering) using STEM
Map onto biological pathways using STEM
Create mathematical model of transcriptional network using MATLAB

Data will be shared using Lion Share
For Wild Type:
- t15: 4 replicates
- t30: 5 replicates
- t60: 4 replicates
- t90: 5 replicates
- t120: 5 replicates

Statistical Analysis Part 1: ANOVA

In the Excel spreadsheet, there is a worksheet labled "data". In this worksheet, each row contains the data for one gene (one spot on the microarray). The first column (labeled "ID") contains the gene identifier from the Saccharomyces Genome Database. The second column contains the Standard Name for each of the genes. Each subsequent column contains the log2 ratio of the red/green fluorescence from each microarray hybridized in the experiment
Each of the column headings from the data begin with the experiment name ("wt" for wild type S. cerevisiae data, "dCIN5" for the Δcin5 data, etc., and Spar for the S. paradoxus data). "LogFC" stands for "Log2 Fold Change" which is the Log2 red/green ratio. The timepoints are designated as "t" followed by a number in minutes. Replicates are numbered as "-0", "-1", "-2", etc. after the timepoint.
The timepoints are t15, t30, t60 (cold shock at 13°C) and t90 and t120 (cold shock at 13°C followed by 30 or 60 minutes of recovery at 30°C). The specific replicates for Wild Type Data are noted below.

Created a new worksheet, naming it stats
Copied the first two columns of the data worksheet (containing ID and Standard Name) into the stats sheet.
In the first row, columns c through g, created column labels of the form wt_xbar_t15, wt_xbar_t30, wt_xbar_t60, wt_xbar_t90, wt_xbar_t120. Xbar stands for the average of the selected data.
In the first row, columns h and i, created the column labels wt_xbar_grand and wt_ss_HO.
In the first row, columns j through n, created the column labels wt_ss_t15, wt_ss_t30, wt_ss_t60, wt_ss_t90, wt_ss_t120 as in (3). SS stands for sumsquared.
In the first row, columns o, p, and q, created the column labels wt_SS_full, Fstat and p-value.
In cell c2, typed =AVERAGE( to start computations
Then clicked on the tab containing the data, and highlighted all the data in row 2 associated with wild type AND t15. Closed the parentheses and entered.
Clicked on the tab for the stats sheet. Cell c2 contains the average of the log fold change data from the first gene at t=15 minutes.
Clicked on cell c2 and positioned cursor at the bottom right corner. Double clicked and copied to the entire column of 6188 other genes.
Cell d2, and repeat (7) through (10) with the t30 data, to e2 with the t60 data, f2 with the t90, g2 with the 120.
Cell h2, and repeat (7) through (10) highlighted all the data for wild type in row 2 instead of the individual time points.
Cell i2. Typed =SUMSQ(
Went to data sheet's tab again, and highlighted all the data in row 2 for wild type. Closed the parentheses and entered
- The data highlighted here was the same as in (12).
For wild type there are 23 data points.
In cell j2, typed =SUMSQ(data!C2:F2)-4*stats!C2^2 and hit enter.
- The phrase data!C2:F2 is the data associated with t15. The number "4" is the number of data points (note that cells c2, d2, e2, f2 contain 4 data points). The phrase stats!c2 gets the average computed in Step (8) for t15, and the "^2" squares that value. Upon completion of this single computation, used the Step (10) trick to copy the formula throughout the column.
In cells k2 through n2, repeat ed(16) for the t30 through t120 data points.
In o2 typed =sum(j2:n2) and hit enter. Copied to the whole column.
In cell p2, typed =((n-5)/5)*(i2-o2)/o2 and hit enter. (Sub n with number from 20). Copied to whole column.
In cell q2, typed =FDIST(P2,5,n-5) replacing n as in (20) with the number of data points total. Copied to the whole column.
Labeled column r =wt_Bonferroni_p-value.
Typed the equation =q2*6189, and copy thru column.
Replaced any corrected p value that is greater than 1 by the number 1 by the following formula into cell s2: =IF(r2>1,1,r2)

Further Notes

Encountered a few typing issues where I mixed up the replicate numbers (i.e put 4 instead of 5 for the sum squared of time 90) but was quickly found and amended
when sorted data to find the significant data (p-value <.05) for Column Q, 2378 out of 6189 of the genes were found to have significant findings.
- Very promising since this is over a third of all the data (38.4% to be exact) that the majority of this data is likely not in error
wt_Bonferri (column R) numbers seem high, but the p-values are seemingly correct where we used an "if, then" function
- By filtering these values, 519 out of 6189 Bonferri numbers were less than one.

Calculate the Benjamini & Hochberg p value Correction

Inserted a new worksheet named "B&H".
Created an index column by first typing "Index" into cell A1. Then typed "1" into cell A2 and "2" into cell A3. Selected both cells A2 and A3. Double-clicked to fill column with a series of numbers from 1 to 6189.
Copied and pasted the column of ID's from one of the previous worksheets into column B.
For the following, used Paste special > Paste values. Copied Column Q (the unadjusted p values) from the stats worksheet and pasted it into Column C.
Selected all of columns A, B, and C. Sorted by ascending values on Column C. Clicked the sort button from A to Z from smallest to largest.
Typed the header "Rank" in cell D1. Repeated step 2 and created a series of numbers in ascending order from 1 to 6189. This is the p value rank, smallest to largest.
Calculating Benjamini and Hochberg p value correction: Typed "STRAIN_B-H_p-value" in cell E1. Typed in cell E2: =(C2*6189)/D2 and entered. Copied that equation to the entire column.
Typed "wt_B-H_p-value" into cell F1.
Typed the following formula into cell F2: =IF(E2>1,1,E2) and press enter. Copied to entire column.
Selected columns A through F. Sorted them by Index in Column A in ascending order.
Copied column F and used Paste special < Paste values to paste it into column T of stats sheet.

Results for Excel Portion

Resulting WildType Spreadsheet

Excel Sheet for Wild Type: Horstmann Wild Type Excel Sheet

Comparison with Known Data NSR1

NSR1 (ID: YGR159C) (Note: in wildtype excel file the data for this gene is found in row 603)
Data for NSR1
- unadjusted p value: 1.43E-08
- Bonferroni-corrected p-value: 8.86E-05
- B-H-corrected p-value: 3.85E-06
- average Log fold change at each of the time points:
  - t15: 3.068
  - t30: 3.394
  - t60: 3.414
  - t90: -1.415
  - t120: -.0570

Sanity Check: Number of genes significantly changed

Performed "sanity check" to ensure we performed our data analysis correctly. We need to find the number of genes that are significantly changed at certain cut-off points.

In "stats" worksheet, selected Menu item Data > Filter > Autofilter
Clicked drown-down arrow on Column Q to select "Custom" filter according to the following criterion:
- How many genes have p<.05? What is the percentage (out of 6189)?
- This is saying that less than 5% of the time, we would expect to see a gene expression change for at least one of time points.
  - wild-type: 31.42% (2378/6189)
  - dZAP1: 36.58% (2264/6189)
- How many genes have p<.01? What is the percentage (out of 6189)?
  - wild-type: 24.67% (1527/6189)
  - dZAP1: 23.35% (1445/6189)
- How many genes have p<.001? What is the percentage (out of 6189)?
  - wild-type: 13.90% (860/6189)
  - dZAP1: 12.80% (792/6189)
- How many genes have p<.0001? What is the percentage (out of 6189)?
  - wild-type: 7.43% (460/6189)
  - dZAP1: 6.69% (414/6189)
We would like to add that performing this procedure doesn't give us which genes pass the cut-off. ALso that the Bonferroni p-value correction is very stringent and the Benjamini-Hochberg correction is less stringent. Filtered data to find:
- For the Bonferroni-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
  - wild-type: 3.68% (228/6189)
  - dZAP1: 3.10% (192/6189)
- For the Benjamini and Hochberg-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
  - wild-type: 26.76% (1656/6189)
  - dzap1: 24.85% (1538/6189)
We note that the p-value is not a magical number to determine significance; rather, it is a confidence level on a movable scale (with the smaller p-values used for higher levels of confidence)

*This data is also made clear in a powerpoint under STEM results

Clustering and Gene Ontology Analysis with STEM

STEP ONE: SET UP SOFTWARE
1. Downloaded STEM software. Click here to go to the STEM web site.
2. Clicked on the following download link, register, and download stem.zip file to Desktop.
  - Unzipped the file. Then inside the folder created called stem,clicked on the stem.cmd to launch STEM program
STEP TWO: PREPARE DATA FOR SOFTWARE
1. Created new worksheet entitled "stem" in Excel file
  - Copied "index" column from "stats" worksheet and Paste values into column A of "stem" worksheet
2. Copied data from "stats" worksheet and Past values into into column B of "stem" worksheet
  - Renamed column A "ID"
  - Renamed column B: "SPOT"
3. Filtered data on B-H corrected p value to be >.05, then deleted these rows (except the header row)
  - then after deleting these, unfiltered the data
4. Deleted all data columns except for Average Log Fold change columns for each time point (e.g. "dZAP1_xbar_t15")
5. Renamed data columns with just time and units
RUN STEM
1. In section 1 (Expression Data Info) of the STEM interface window...
  - clicked "Browse" and then choose your file
  - clicked radio button "No normalization/add 0"
  - Checked box: "Spot IDs included in the data file"
2. In section 2 (Gene Info) of the STEM interface window...
  - for Gene Annotation Source, selected Saccharomyces cerevisiae (SGD)
  - for Cross Reference Source, selected "No cross references"
  - for Gene Location Source, selected "No Gene Locations"
3. In section 3 (Options) of the STEM interface window...
  - checked that Clustering Method is "STEM Clustering Method"
  - defaults for Maximum Number or Model Profiles or Maximum Unit Change in Model between Time Points keep as is
4. in section 4 (Execute)
  - clicked on yellow Execute button to run STEM
VIEW & SAVE STEM RESULTS
1. In new window ("All STEM Profiles (1)") that opens, each box represents a model expression profile.
  - If it's colored, this indicates a statistically significant number of genes assigned
    - If two profiles have the same color, then they belong to the same cluster of profiles
  - All of them are arranged in order of p value: from most significant to least significant
  - The software assigns a number to each box to serve as an ID for each
2. Clicked on "Interface Options..." and at the bottom of this window, clicked on radio button: "Based on Real Time".
3. Took a screenshot of this window, and pasted into PPT presentation to save the generated figures
4. Clicked on each profile and took screenshot of a more detailed plot that opens in a new window
  - Took screenshot of each and saved each image in presentation
  - At bottom of each profile window, clicked on "Profile Gene Table" button to see the list of genes belonging to that profile
  - In window that appears, clicked "Save Table" button and saved file to Desktop
    - Made name of file descriptive of the contents (in format of: dZAP1_profile#_genelist.txt"
  - Uploaded these files to LionShare and provided link to Dr. Dahlquist and Dr. Fitzpatrick as a zipped file
5. For each profile, clicked "Profile GO Table" to see list of Gene Ontology terms belonging to the profile
  - In window that appears, clicked "Save Table" button and saved file to your Desktop
    - Made name of file descriptive of contents (in format of: dZAP1_profile#_GOlist.txt")
      - At this point, all data needed from STEM software is saved and we will begin interpreting it
  - Uploaded these files to LionShare and provided link to Dr. Dahlquist and Dr. Fitzpatrick as a zipped file

STEM Results

Powerpoint of STEM Figures and included table of p-values: Horstmann Results
GO Results for Profile 28 (discussed below): GO Profile 28

Analyzing and Interpreting STEM Results

Select one of the profiles you saved in the previous step for further intepretation of the data. We suggest that you choose one that has a pattern of up- or down-regulated genes at the early (first three) timepoints. Answer the following:
- Why did you select this profile? In other words, why was it interesting to you?
  - I chose profile 28 as there are different mixtures of some genes increasing dramatically, and some which decrease expression change while the majority increase. Some genes expression change spikes high and early, while others decrease at first then eventually increase. I also chose it because some of the other profiles didn't have as much early movement as this one.
- How many genes belong to this profile?
  - 95.0 Genes are Assigned to Profile 28
- How many genes were expected to belong to this profile?
  - 18.1 Genes were expected
- What is the p value for the enrichment of genes in this profile?
  - 6.7E-38 so this p-value is considered significant
- Open the GO list file you saved for this profile in Excel. This list shows all of the Gene Ontology terms that are associated with genes that fit this profile. Select the third row and then choose from the menu Data > Filter > Autofilter. Filter on the "p-value" column to show only GO terms that have a p value of < 0.05.
- How many GO terms are associated with this profile at p < 0.05?
  - Excel found 53 of 249 terms with a p value of <.05, around 21.3%
- The GO list also has a column called "Corrected p-value". This correction is needed because the software has performed thousands of significance tests. Filter on the "Corrected p-value" column to show only GO terms that have a corrected p value of < 0.05.
- How many GO terms are associated with this profile with a corrected p value < 0.05?
  - Excel found 4 of 249 terms with a corrected p-value of <.05 (1.6%)
- Select 10 Gene Ontology terms from your filtered list p<.05 . Look up the definitions for each of the terms at http://geneontology.org.
1. GO:0008652 cellular amino acid biosynthetic process: The chemical reactions and pathways resulting in the formation of amino acids, organic acids containing one or more amino substituents
2. GO:2000113 negative regulation of cellular macromolecule biosynthetic process: Any process that stops, prevents, or reduces the frequency, rate or extent of cellular macromolecule biosynthetic process.
3. GO:1901605 alpha-amino acid metabolic process: The chemical reactions and pathways involving an alpha-amino acid
4. GO:0006950 response to stress: Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a disturbance in organismal or cellular homeostasis, usually, but not necessarily, exogenous (e.g. temperature, humidity, ionizing radiation)
5. GO:0010565 regulation of cellular ketone metabolic process:Any process that modulates the chemical reactions and pathways involving any of a class of organic compounds that contain the carbonyl group, CO, and in which the carbonyl group is bonded only to carbon atoms. The general formula for a ketone is RCOR, where R and R are alkyl or aryl groups
6. GO:0009067 aspartate family amino acid biosynthetic process:The chemical reactions and pathways resulting in the formation of amino acids of the aspartate family, comprising asparagine, aspartate, lysine, methionine and threonine
7. GO:0009064 glutamine family amino acid metabolic process: The chemical reactions and pathways involving amino acids of the glutamine family, comprising arginine, glutamate, glutamine and proline
8. GO:1901566 organonitrogen compound biosynthetic process: The chemical reactions and pathways resulting in the formation of organonitrogen compound.
9. GO:0006082 organic acid metabolic process: the chemical reactions and pathways involving organic acids, any acidic compound containing carbon in covalent linkage
10. GO:0010604 positive regulation of macromolecule metabolic process: Any process that increases the frequency, rate or extent of the chemical reactions and pathways involving macromolecules, any molecule of high relative molecular mass, the structure of which essentially comprises the multiple repetition of units derived, actually or conceptually, from molecules of low relative molecular mass
- Write a paragraph that describes the biological interpretation of these GO terms. In other words, why does the cell react to cold shock by changing the expression of genes associated with these GO terms?
  - The cell reacted to the cold shock mostly through increasing of the expression of the genes. Some of the genes decreased expression at first then increased after cold shock, while others showed slow expression at first, then a spike after the shock. I would assume that this means that the genes are responding positively to the shock. Many of the genes are pathways that form amino acids, macromolecules, nitrogen, etc. If these genes were the same that increased with the cold, it could mean that the cell is being forced to produce more of these nutrients as it is being introduced to the cold. Furthermore, looking at my last gene's definition, "Any process that increases the frequency, rate or extent of the chemical reactions and pathways involving macromolecules..." increases my thought that the cold is increasing the rate of reactions as this specific group of genes was definitely prominent in the profile. The cell is always trying to return to homeostasis (also shown by the expression of the gene that controls "any process that results in a change in state or activity of a cell or an organism ... a result of a disturbance in organismal or cellular homeostasis, usually, but not necessarily, exogenous (e.g. temperature)" in my definitions) so by increasing the expression of these genes, homeostasis is the main goal.

Template

Back to User: User: Kristen M. Horstmann

BIOL398-04/S15

Kristen M. Horstmann Week 11 Journal

Contents

Electronic Notebook

Procedure

Background

Statistical Analysis Part 1: ANOVA

Further Notes

Calculate the Benjamini & Hochberg p value Correction

Results for Excel Portion

Resulting WildType Spreadsheet

Comparison with Known Data NSR1

Sanity Check: Number of genes significantly changed

Clustering and Gene Ontology Analysis with STEM

STEM Results

Analyzing and Interpreting STEM Results

Template

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools