Sahil Patel Week 4/5
Electronic Lab Notebook
- In this assignment, we used data analysis softwares and techniques in order to create a visual representation, or model, of the results and findings of the experiment to better understand the data and observations that were acquired.
Statistical Analysis Part 1: ANOVA
- We began by creating a new worksheet in Microsoft Excel using the Master Sheet given to us from our instructors; this new sheet was given the name "wild_type_ANOVA" and the information from the Master sheet was copied and pasted onto the new sheet.
- At the right of the data set, five column headers were created using the notation, wt_AvgLogFC_(TIME), where TIME was replaced by the five time points: t15, t30, t60, t90, t120.
- Then in the cell directly beneath each time point, the averages were taken for those specific timepoints. The formula used was: =AVERAGE(range of cells associated with that time point. Double click black plus icon in the bottom right corner of the cell to complete averages for entire column.
- Repeat these steps for the t30, t60, t90, and the t120 data.
- Next, in the column to the right of the wt_AvgLogFC_t120 calculation, create the column header wt_ss_HO.
- In the first cell below this header, we used the equation =SUMSQ(D2:Z2); again used the black plus sign to complete the column. This method will be used throughout the analysis to expedite the calculation process.
- In the next 5 columns to the right of wt_ss_HO, create the column headers wt_ss_(TIME), where TIME again is the five timepoints. The wild type had 23 datapoints taken. This information will be used later.
- In the first cell below the header wt_ss_t15, type =SUMSQ(D2:G2)-COUNTA(D2:G2)*AA2^2 and hit enter.
- Repeat this computation for the t30 through t120 data points. For each data point use the correct range of data associated with the timepoints and copy the formula to the whole column for each computation.
- In the first column to the right of wt_ss_t120, create the column header wt_SS_full.
- In the first cell below this header, use the formula (=SUM(AG2:AK2)) and hit enter.
- In the next two columns to the right, create the headers wt_Fstat and wt_p-value.
- In the first cell of the wt_Fstat column, type =((23-5)/5)*(<wt_ss_HO> - <wt_SS_full>)/<wt_SS_full>) and hit enter. Select the cells so in this case it would be ((23-5)/5)*(AF2-AL2)/AL2.
- In the first cell below the wt_p-value header, type =FDIST(<wt_Fstat>,5,n-5) replacing the phrase wt_Fstat with the cell designation and the n with the number of data points total, 23. Again, note that the number of timepoints is actually "4" for the dSWI4 strain. Copy to the whole column.
- Use the filter icon to to filter data to display data with a p-value < 0.05 and in the lower left corner the number of rows that match this criteria will be displayed
Calculating the Bonferroni and p-value Correction
- Label the next two columns to the right with the same label, wt_Bonferroni_p-value.
- In the cell beneath the first header use the formula: =<wt_p-value>*6189.
- The second header is used to replace any adjusted p-value that is greater than one, by the number "1". This is done by using the formula =IF(STRAIN_Bonferroni_p-value>1,1,STRAIN_Bonferroni_p-value), where STRAIN_Bonferroni_p-value refers to the cell in which the first Bonferroni p-value was calculated.
Calculating the Benjamini and Hochberg p-value Correction
- Insert a new worksheet named "(STRAIN)_ANOVA_B-H".
- Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.
- For the following, use Paste special > Paste values. Copy your unadjusted p values from your ANOVA worksheet and paste it into Column D.
- Select all of columns A, B, C, and D from smallest to largest by clicking the sort button from A to Z on the toolbar.
- Type the header "Rank" in cell E1. This will be used to rank numbers in ascending order from 1 to 6189 in this column based on p value rank. Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
- Now calculate the Benjamini and Hochberg p-value correction. Type wt_B-H_p-value in cell F1. Type the following formula in cell F2: =(D2*6189)/E2 and press enter. Copy that equation to the entire column.
- Type "STRAIN_B-H_p-value" into cell G1.
- Type the following formula into cell G2: =IF(F2>1,1,F2) and press enter. Copy that equation to the entire column.
- Select columns A through G. Now sort them by your MasterIndex in Column A in ascending order.
- Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.'
- After completing this procedure, the excel spreadsheets were submitted to the class BOX and shared with the instructors via emails.
Clustering and GO Term Enrichment Using Stem Part 2
Prepare your microarray data file for loading into STEM
- Insert a new worksheet into your Excel workbook, and name it "wt_stem".
- Select all of the data from your "wt_ANOVA" worksheet and Paste special > paste values into your "wt_stem" worksheet.
- Rename "Master Index" to "Spot"and "ID" to "Gene Symbol", also delete column "Standard Name".
- Cluster genes that have significant change by filtering data using B-H corrected p-value.
- Delete all columns except for Average Log Fold Change Columns for each time points and rename them as 15m, 30m, etc...
- Save work as Tab-Delimited file type (.txt).
- Download and extract STEM software and open "stem.jar"
- Download files: "gene_ontology.obo" and "gene_association.sgd.gz".
- Run stem and in section 1 (Expression Data Info), load the .txt file that was saved earlier.
- click on radio button for no normalization/add 0.
- click box for Spot IDs included in the data file.
- In section 2 (Gene Info), leave defaults but for the "Gene Annotation File" browse for and open "gene_association.sgd.gz".
- In section 3 (Options), leave defaults so no modifications
- In section 4, click Execute to run the STEM program.
Viewing and Saving STEM Results
- Click on Interface Options on the new window and set the X-axis to based on real time.
- Select every significant (colored) profile and screenshot it and add it to the powerpoint.
- At the bottom of each profile window, there are two yellow buttons "Profile Gene Table" and "Profile GO Table". For each of the profiles, click on the "Profile Gene Table" button to see the list of genes belonging to the profile and click on the "Profile GO Table" to see the list of Gene Ontology terms belonging to the profile.
- For each profile click on the Save table for each of the lists that are viewed and upload these to OpenWetWare. These are located under STEM Results.
Analyzing and Interpreting STEM Results
- Profile 9 was selected and interpreted
- Open the GO list file for Profile 9 in Excel and select the third row and then choose from the menu Data > Filter > Autofilter.
- Filter on the "p-value" column to show only GO terms that have a p value of < 0.05.
- Then filter on the "Corrected p-value" column to show only GO terms that have a corrected p value of < 0.05.
- Select 6 Gene Ontology terms from your filtered list using the definitions from http://geneontology.org.
Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes
- Open Profile 9 in Excel, copy its gene list and paste into the YEASTRACT online database.
- Select the box for Check for all TF and Rank accordingly
- Along with HAP4, GLN3, and ZAP1, select the other transcription factors that were either green or yellow and list them (14 in total).
- Copy and paste the list of transcription factors you identifiedinto both the "Transcription factors" field and the "Target ORF/Genes" field.
- Generate csv file using the "Regulations Filter" options of "Documented", "Only DNA binding evidence".
- Save the .csv file reading Regulation Matrix.
Visualizing Your Gene Regulatory Networks with GRNsight
- Format the output files from YEASTRACT in Excel and transpose the adjacency matrix so that it is usable in our visualization software.
- The labels for the genes in the columns and rows need to match and title Cell A1 "rows genes affected/cols genes controlling".
- Alphabatize the gene labels both across the top and side.
- Visualize in GRNsight by opening file via website and select grid layout to easily view results.
- Delete any nodes or genes that are unattached and screenshot this and add it to the powerpoint.
- The Excel spreadsheet for this experimental analysis can be found here.
- To view a powerpoint of the figures which were developed, click here.
Sanity Check Protocol Questions
- How many genes have p < 0.05? and what is the percentage (out of 6189)?
- How many genes have p < 0.01? and what is the percentage (out of 6189)?
- How many genes have p < 0.001? and what is the percentage (out of 6189)?
- How many genes have p < 0.0001? and what is the percentage (out of 6189)?
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
- Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment?
- Unadjusted = 2.87*10^-10
- Bonferroni-corrected = 1.77*10^-6
- B-H-corrected = 8.88*10^-7
- Average Log Fold Change: t-15 = 3.28, t-30 = 3.62, t-60 = 3.53, t-90 = -2.05, t-120 = -0.61
Why did you select this profile? In other words, why was it interesting to you?
- Profile 9 followed a down regulation until the 60m point at which it began to up regulate back to the baseline and it potentially could have continued with this trend. This profile intrigued me because the number of genes which belonged to this profile was much higher than what was expected.
How many genes belong to this profile?
How many genes were expected to belong to this profile?
What is the p-value for the enrichment of genes in this profile?
How many GO terms are associated with this profile at p<0.05?
- 116 of 498
How many GO terms are associated with this profile with a corrected p-value<0.05?
- 22 of 498
Definitions for GO terms (All definitions taken from http://geneontology.org/)
- Small molecule catabolic process - The chemical reactions and pathways resulting in the breakdown of small molecules, any low molecular weight, monomeric, non-encoded molecule.
- Bounding membrane of organelle - The lipid bilayer that forms the outer-most layer of an organelle.
- Endoplasmic reticulum to Golgi vesicle-mediated transport - The directed movement of substances from the endoplasmic reticulum (ER) to the Golgi, mediated by COP II vesicles. Small COP II coated vesicles form from the ER and then fuse directly with the cis-Golgi. Larger structures are transported along microtubules to the cis-Golgi.
- Protein transport - The directed movement of proteins into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore.
- Vesicle-mediated transport - A cellular transport process in which transported substances are moved in membrane-bounded vesicles; transported substances are enclosed in the vesicle lumen or located in the vesicle membrane.
- Organelle subcompartmet - A compartment that consists of a lumen and an enclosing membrane, and is part of an organelle.
How many transcription factors are green or "significant"?
Are GLN3, HAP4, and/or ZAP1 on the list? If so, what is their "% in user set", "% in YEASTRACT", and "p value"?
- GLN3: % in user set - 21.50%, % in YEASTRACT - 3.62%, p value ~ 0.02793
- HAP4: % in user set - 20.50%, % in YEASTRACT - 4.11%, p-value ~ 0.003855
- ZAP1: % in user set - 24.50%, % in YEASTRACT - 3.18%, p-value ~ 0.134785
List of Additional Transcription Factors Selected (14)
In this assignment the microarray data of wild type yeast was analyzed . Excel was used to conduct ANOVA test and find p-values which were also corrected using the Bonferroni and the Benjamini-Hochberg methods. There were more than 309 genes (>5%) which allows us to conclude that some genes were significantly changed. These genes were then clustered using the STEM software and each significant profile was viewed. Profile 9 was then interpreted and analyzed using YEASTRACT to infer which transcription factors regulated certain genes of which their were 8 in total. This low of a number was expected because the wild type was analyzed in this experiment thus there was little to no variation to the expected results.
I worked with my homework partner Angela Abarquez in class on February 12th where we started our ANOVA testing in Microsoft Excel. Outside of class we met on February 13th and met in the library to compare our final spreadsheets and powerpoint data tables. We also communicated via text on February 20th to confirm that we were selecting different profiles for our analysis.
Except for what is noted above, this individual journal entry was completed by me and not copied from another source.
The Week 4 Assignment Information was used as the instruction resource of this assignment. The methods and protocol questions were also developed from this page. The p-value powerpoint table was also taken from a template provided by Dr. Dahlquist. Dr. Dahlquist's provided the Excel file "BIOL388_S19_microarray-data_wt" with the raw data obtained from her lab.
Gene Ontology Resource. (n.d.). Retrieved February 24, 2019, from http://geneontology.org/
GRNsight. (n.d.). Retrieved February 24, 2019, from http://dondi.github.io/GRNsight/
Stem software. (n.d.). Retrieved February 24, 2019, from http://www.cs.cmu.edu/~jernst/stem/
YEASTRACT. (n.d.). Retrieved February 24, 2019, from http://www.yeastract.com