# Kara M Dismuke Week 11 Journal

## Electronic Journal Notebook

### Procedure

#### Background

To analyze microarray data:

• Using GenePix Pro Software...
• In each spot, quantitate the fluorescence signal
• Calculate ratio of red/green fluorescence
• Log transform ratios
• Normalize ratios on each microarray slide
• Using a script in R...
• Normalize ratios for set of slides
• Using Microsoft Excel...
• Perform statistical analysis on ratios
• Compare genes w/ known data
• USING STEM software...
• Map to biological pathways
• Using MATLAB...
• Create mathematical model of transcriptional network

Project Partner: User: Kristen M. Horstmann
Dahlquist lab microarray data set: Wild type vs. dZAP1

• Note: Kristen will be analyzing the Wild type strain data and I (Kara) will be analyzing the dZAP1 strain data.
• For dZAP1...
• t15: 4 replicates
• t30: 4 replicates
• t60: 4 replicates
• t90: 4 replicates
• t120: 4 replicates
• Procedural Notes:

#### Statistical Analysis: ANOVA

1. In the Excel data file, we created a new sheet, and named it "stats"
1. In first row...
2. Copied first two columns into the stats sheet (ID, Standard name)
1. Labeled columns C-G using the format: (STRAIN)_xbar_(TIME) in first row
• In the dZAP1 case: *"dZAP1_xbar_t15", "dZAP1_xbar_t30", "dZAP1_xbar_t60", "dZAP1_xbar_t90", "dZAP1_xbar_t120"
2. Labeled columns H and I using the format: (STRAIN)_xbar_grand and (STRAIN)_ss_HO
• In the dZAP1 case: *"dZAP1_xbar_grand", "dZAP1_ss_HO"
3. Labeled columns J-N using the format: (STRAIN)_SS_full, Fstat and p-value
• In the dZAP1 case: *"dZAP1_SS_full", "dZAP1_Fstat", "dZAP1_p-value"
3. Performed Computations
1. In A2, typed =AVERAGE(
• clicked tab containing the data, then highlighted all data sheet in row 2 associated with dZAP1 and t15 then close parenthesis with ) and press "Enter"
• Then, we did this for t30, t60, t90, t120 (to compute the average)
• Clicked cell C2, positioned cursor at bottom right corner and once, we saw a plus sign, double clicked so formula will be copied into column for all the other genes
2. Performed similar operation for D2 (with t30 data), E2 (with t60 data), F2 (with t90 data), and G2 (with t120 data)
3. Performed similar operation for all dZAP1 data in H2 (entire row 2 instead of specific time points)
• Note this is the computed the "grand" average for all of our data for a particular gene. For the case of dZAP1, note this could be computed by averaging all the data (this is contains all the data at each time point) or by averaging the averages previously connected because there were 4 replicates for each time point. If done both ways, you should get the same value.
4. In I2, typed =SUMSQ(
• clicked on data sheet and highlighted al data for dZAP1 then closed parenthesis and pressed "Enter"
5. We noted there were 4 data points at each time point, and with 5 time points, there were 20 total number of data points.
6. Typed =SUMSQ(data!C2:F2)-4*stats!C2^2 in J2 and pressed "Enter"
• Fill formula to entire column using previously discussed procedure
7. Perform step 6 for K2 (t30), L2 (t60), M2 (t90), and N2 (t120)
8. Set total number of data points equal to (n=20)
9. In P2, typed =((n-5/5)*(I2-O2)/O2 and hit "Enter" (note: n=20)
• Fill formula to entire column using previously discussed procedure
10. In Q2, typed =FDIST(F2, 5, n-5) (note: n=20)
• Fill formula to entire column using previously discussed procedure
11. Adjusted p-value to correct for multiple testing problem
1. Label column r (in first row): "dZAP1_Bonferroni_p-value"
2. Typed =Q2*6189 in R2 and hit "Enter"
• Fill formula to entire column using previously discussed procedure
3. In S2, typed formula =IF(R2>1,1,R2) to replace any adjusted p value greater than 1 by number 1
##### Calculate the Benjamini Hochberg p value Correction
1. Named new worsheet "B&H"
2. Created index column
• in A1, typed "Index"
• in A2, typed "1"
• in A3, typed "2"
• Selected cells A2 and A3 and use previously discussed procedure, to fill cells in column A
• this creates a series of numbers from 1 to 6189 in column A
3. In Column B, copied ID Column from one of previous worksheets
• To paste, used Paste special > Paste values
4. Copied column Q (unadjusted p-values) from "stats" worksheet and paste into Column C
5. Selected columns A-C, and then, sorted Column C by ascending values
6. In D1, type "Rank"
• Created series of numbers from 1-6189 using previously discussed procedure
7. In E1, typed "STRAIN_B-H_p-value" (dZAP1_B-H_p-value)
8. In E2, typed =(C2*6189)/D2 and pressed "Enter"
• Fill equation in cells of entire E column
9. in F1, typed "STRAIN_B-H_p-value"
• In F2, typed =IF(E2>1,1,E2) and pressed "Enter"
• Fill equation in cells of entire F column
10. Selected columns A-F, and sorted them by Index's ascending order (in Column A)
11. Copied column F and used Paste special > Paste values, to paste into Column T of stats sheet
• Labeled Column T in stats sheet "dZAP1_B-H_p-value"
12. Uploaded Excel file to LionShare
##### Sanity Check: Number of genes significantly changed
• Did sanity check to ensure we performed the data analysis correctly
• need to find the number of genes that are significantly changed at certain cut-off points
1. In "stats" worksheet, selected Menu item Data > Filter > Autofilter
2. Clicked drown-down arrow on Column Q to select "Custom" filter according to the following criterion:
• How many genes have p<.05? What is the percentage (out of 6189)?
• This is saying that less than 5% of the time, we would expect to see a gene expression change for at least one of time points.
• How many genes have p<.01? What is the percentage (out of 6189)?
• How many genes have p<.001? What is the percentage (out of 6189)?
• How many genes have p<.0001? What is the percentage (out of 6189)?
3. Noted: performing this procedure doesn't give us which genes pass the cut-off.
4. Noted: the Bonferroni p-value correction is very stringent and the Benjamini-Hochberg correction is less stringent. Filtered data to find:
• For the Bonferroni-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
• For the Benjamini and Hochberg-corrected p value, how many genes are p<.05? What is the percentage (out of 6189)?
5. Noted: p-value is not a magical number to determine significance; rather, it is a confidence level on a movable scale (with the smaller p-values used for higher levels of confidence)
6. Compared results with expression of gene NSR1 (ID: YGR159C)
• What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values?
• What is its average Log fold change at each of the timepoints in the experiment?
7. Compare my numbers (for dZAP1) to Kristen's numbers (for wildtype) and organize findings in a table in a PowerPoint slide
• Gave slide meaninful title
• Uploaded this slide to individual journal page
##### Sanity Check Results
###### p-value

• wild-type: 31.42% (2378/6189)
• dZAP1: 36.58% (2264/6189)

• wild-type: 24.67% (1527/6189)
• dZAP1: 23.35% (1445/6189)

• wild-type: 13.90% (860/6189)
• dZAP1: 12.80% (792/6189)

• wild-type: 7.43% (460/6189)
• dZAP1: 6.69% (414/6189)

Benjamini & Hochberg-adjusted p-value < .05

• wild-type: 26.76% (1656/6189)
• dZAP1: 24.85% (1538/6189)

• wild-type: 3.68% (228/6189)
• dZAP1: 3.10% (192/6189)

Note: We've organized this data into a table in a PPT slide: Table 1 (wildtype and dZAP1 p-values)‎

• Note that the wild-type data used came from Kristen.
###### Comparison with NSR1 (ID: YGR159C)
• Note in my Excel file the data for this gene is found in row 2362
• Data for NSR1
• Bonferroni-corrected p-value: .000362787
• B-H-corrected p-value: 1.25099x10^-5
• average Log fold change at each of the time points
• t15: 3.5895
• t30: 3.394075
• t60: 3.609025
• t90:-1.9752
• t120: .026325

#### Clustering and Gene Ontology Analysis with STEM

1. STEP ONE: SET UP SOFTWARE
2. Clicked on the provided link, register, and download stem.zip file to Desktop.
• Unzipped file. Clicked on stem.jar to launch STEM program
• If encounter a problem: inside the folder created called stem, double-click on the stem.cmd to launch STEM program
2. STEP TWO: PREPARE DATA FOR SOFTWARE
1. Created new worksheet entitled "stem" in Excel file
• Copied "index" column from "stats" worksheet and pasted values into column A of "stem" worksheet
2. Copied data from "stats" worksheet and pasted values into into column B of "stem" worksheet
• Renamed column A "ID"
• Renamed column B: "SPOT"
3. Filtered data on B-H corrected p value to be >.05, then deleted these rows (except the header row)
• then after deleting these, unfiltered the data
4. Deleted all data columns except for Average Log Fold change columns for each time point (e.g. "dZAP1_xbar_t15")
5. Renamed data columns with just time and units
• I.e. rename "dZAP1_xbar_t15" "15m"
• Saved work and then "Save As" the file in the "Text (Tab-delimited) (*txt)" format
3. STEP THREE: RUN STEM
1. In section 1 (Expression Data Info) of the STEM interface window...
• clicked "Browse" and then choose your file
• checked box: "Spot IDs included in the data file"
2. In section 2 (Gene Info) of the STEM interface window...
• for Gene Annotation Source, selected Saccharomyces cerevisiae (SGD)
• for Cross Reference Source, selected "No cross references"
• for Gene Location Source, selected "No Gene Locations"
3. In section 3 (Options) of the STEM interface window...
• left as is...
• Clustering Method is "STEM Clustering Method"
• Defaults for Maximum Number or Model Profiles or Maximum Unit Change in Model between Time Points
4. in section 4 (Execute)
• clicked on Execute button to run STEM
4. STEP FOUR: VIEW & SAVE STEM RESULTS
1. In new window ("All STEM Profiles (1)") that opens, noted that each box represents a model expression profile.
• Noted: Colored ones indicates a statistically significant number of genes assigned
• Noted: If two profiles have the same color, then they belong to the same cluster of profiles
• Noted: All of them are arranged in order of p value: from most significant to least significant
• Noted: The software assigns a number to each box to serve as an ID for each
2. Clicked on "Interface Options..." and at the bottom of this window, click on radio button: "Based on Real Time" (this will be next to X-axis scale should be"). Then close this window.
3. Took a screenshot of this window (using Snip program), and pasted into PPT presentation to save the generated figures
4. Clicked on each profile and took screenshot of this more detailed plot that opens in a new window
• Took screenshot of each then saved and inserted in my presentation
• For each profile, clicked on "Profile Gene Table" button to see the list of genes belonging to that profile
• In window that appears, clicked "Save Table" button and save file to your Desktop
• Named each file in format of: dZAP1_profile#_genelist.txt"
5. For each profile, clicked "Profile GO Table" to see list of Gene Ontology terms belonging to the profile
• In window that appears, clicked "Save Table" button and save file to your Desktop
• Named each file in format of: dZAP1_profile#_GOlist.txt"
• At this point, all data needed from STEM software is saved and interpretation can begin
• Zipped all the genelist, all of the GOlist files, and the STEM text file and uploaded them as one file to LionShare and provided link to Dr. Dahlquist and Dr. Fitzpatrick.
5. STEP FIVE: ANALYSIS & INTERPRETATION OF RESULTS
• Selected profile X then answered:
• Why did you select this profile? In other words, why was it interesting to you?
• How many genes belong to this profile?
• How many genes were expected to belong to this profile?
• What is the p value for the enrichment of genes in this profile?
• Opened the GO list file you saved for this profile in Excel. This list shows all of the Gene Ontology terms that are associated with genes that fit this profile.
• Selected 3rd row and then chose from the menu Data > Filter > Autofilter to filter "p-value" column to show only GO terms that have a p value of < 0.05
• Answered: How many GO terms are associated with this profile at p < 0.05?
• Filtered "Corrected p-value" column to show only GO terms that have a corrected p value of < 0.05. How many GO terms are associated with this profile with a corrected p value < 0.05?
• Selected 10 Gene Ontology terms from your filtered list (either p < 0.05 or corrected p < 0.05). Look up the definitions for each of the terms at http://geneontology.org.
• Copied and pasted the GO ID (e.g. GO:0044848) into search field "Search GO Data"
• In results page, click on "Link to detailed information about <term>, in this case "biological phase"".
• Wrote a paragraph that describes the biological interpretation of these GO terms to answer why the cell reacts to cold shock by changing the expression of genes associated with these GO terms

#### STEM Results

• STEM Text file
• PPT with p-value table and screen shot of dZAP1 stem results
• This PPT will serve as the basis for my final presentation (more to be added in the future)
• Note: individual genelist and GOlist files were zipped together with the STEM Text file and uploaded to LionShare
##### Analysis of STEM Results

dZAP1: profile 45

• I selected this profile because it yielded the most significant results (p-value with 3 significant figures was 0.00) and because of it's seemingly oscillating behavior.
• 456.0 genes were assigned to this profile.
• 40.6 genes were expected to belong to this profile.
• As noted above, this p value was 0.00 for this profile (noting, that it was limited to 3 significant figures).
• Opened the GO list file you saved for this profile in Excel. This list shows all of the Gene Ontology terms that are associated with genes that fit this profile.
• Filtered p-value (p < .05) in Excel to find that Excel found 229 out of 803 records
• Filtered "Corrected p-value" (p < .05) in Excel to find that Excel found 21 of 803 records
• Selected 10 Gene Ontology terms from your filtered list (either p < 0.05 or corrected p < 0.05). Look up the definitions for each of the terms at http://geneontology.org. Write a paragraph that describes the biological interpretation of these GO terms. In other words, why does the cell react to cold shock by changing the expression of genes associated with these GO terms?
• 10 Gene Ontology terms (chosen from list after corrected-p-value filter was applied):
1. GO:0000472...endonucleolytic cleavage to generate mature 5'-end of SSU-rRNA from (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
• Endonucleolytic cleavage between the 5'-External Transcribed Spacer (5'-ETS) and the 5' end of the SSU-rRNA of a tricistronic rRNA transcript that contains the Small Subunit (SSU) rRNA, the 5.8S rRNA, and the Large Subunit (LSU) rRNA in that order from 5' to 3' along the primary transcript, to produce the mature end of the SSU-rRNA.
2. GO:0006400...tRNA modification
• The covalent alteration of one or more nucleotides within a tRNA molecule to produce a tRNA molecule with a sequence that differs from that coded genetically.
3. GO:0032040...small-subunit processome
• A large ribonucleoprotein complex that is an early preribosomal complex. In S. cerevisiae, it has a size of 80S and consists of the 35S pre-rRNA, early-associating ribosomal proteins most of which are part of the small ribosomal subunit, the U3 snoRNA and associated proteins.
4. GO:0008186...RNA-dependent ATPase activity
• Catalysis of the reaction: ATP + H2O = ADP + phosphate; this reaction requires the presence of RNA, and it drives another reaction.
5. GO:0042255
• The aggregation, arrangement and bonding together of the mature ribosome and of its subunits.
6. GO:0043231...intracellular membrane-bounded organelle
• Organized structure of distinctive morphology and function, bounded by a single or double lipid bilayer membrane and occurring within the cell. Includes the nucleus, mitochondria, plastids, vacuoles, and vesicles. Excludes the plasma membrane.
7. GO:0006399...tRNA metabolic process
• The chemical reactions and pathways involving tRNA, transfer RNA, a class of relatively small RNA molecules responsible for mediating the insertion of amino acids into the sequence of nascent polypeptide chains during protein synthesis. Transfer RNA is characterized by the presence of many unusual minor bases, the function of which has not been completely established.
8. GO:0043227...membrane-bounded organelle
• Organized structure of distinctive morphology and function, bounded by a single or double lipid bilayer membrane. Includes the nucleus, mitochondria, plastids, vacuoles, and vesicles. Excludes the plasma membrane.
9. GO:0000480...endonucleolytic cleavage in 5'-ETS of tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
• Endonucleolytic cleavage within the 5'-External Transcribed Spacer (ETS) of a tricistronic rRNA transcript that contains the Small Subunit (SSU) rRNA, the 5.8S rRNA, and the Large Subunit (LSU) rRNA in that order from 5' to 3' along the primary transcript. Endonucleolytic cleavage within the 5'-ETS of the pre-RNA is conserved as one of the early steps of rRNA processing in all eukaryotes, but the specific position of cleavage is variable.
10. GO:0070035...acyl binding
• Interacting selectively and non-covalently with an acyl group, any group formally derived by removal of the hydroxyl group from the acid function of a carboxylic acid.
###### Biological Interpretation of STEM Results

In reacting to cold shock, gene expression increased until the t15 mark, stayed relatively steady through the t60 mark, decreased slightly from t60 to t90, and then leveled off from t90 until t120. The most extreme cases of gene expression occurred at t15 where one increased by approximately 9 (more expression) and at t90 where one decreased by approximately -11 (less expression). It is hard for me to relate many of this terms into this discussion as my background in biology is limited, but upon looking at the profile, it does seem as though there is a tendency for most genes to experience minimal amounts of expression change in response to the cold shock. This is evidenced by the fact that most of the lines shown in the profile (each line representing a specific gene) lie around 0 (roughly between 4 and -2), and thus, it appears as though this is a reflection of a cell's tendency towards homeostasis.