Dahlquist:Clustering Microarray Data with STEM
From OpenWetWare
Jump to navigationJump to search
This page contains the protocol used during SURP 2019.
Clustering and GO Term Enrichment with stem
- Go to the Dahlquist Lab repository on GitHub to obtain the microarray data.
- Microarray data for each of the strains used in analysis:
- Prepare your microarray data file for loading into STEM.
- Insert a new worksheet into your Excel workbook, and name it "(STRAIN)_stem".
- Select all of the data from your "(STRAIN)_ANOVA" worksheet and Paste special > paste values into your "(STRAIN)_stem" worksheet.
- The leftmost column is named "Systematic Name". Copy this column (column A) and paste in column B, named "Standard Name". Rename this column "Gene Symbol".
- In column A, number each of the rows (row 2 with a "1" and row 3 with a "2"). Highlight rows 2 and 3 of column A and hover cursor over the lower right edge of the black box and double click to fill in the rest of the rows. Rename this column to "SPOT".
- Filter the data on the B-H corrected p value to be > 0.05 (that's greater than in this case).
- Once the data has been filtered, select all of the rows (except for your header row) and delete the rows by right-clicking and choosing "Delete Row" from the context menu. Undo the filter. This ensures that we will cluster only the genes with a "significant" change in expression and not the noise.
- For dHMO1 strain, filter the data on the normal, uncorrected p value to be > 0.05.
- Delete all of the data columns EXCEPT for the Average Log Fold change columns for each timepoint (for example, wt_AvgLogFC_t15, etc.).
- Rename the data columns with just the time and units (for example, 15m, 30m, etc.).
- Save your work. Then use Save As to save this spreadsheet as Text (Tab-delimited) (*.txt). Click OK to the warnings and close your file.
- Note that you should turn on the file extensions if you have not already done so.
- Excel workbook for each strain:
- dCIN5: dCIN5_one_strain_ANOVA_out_data_add_stem.xls
- dGLN3: dGLN3_one_strain_ANOVA_out_data_add_stem.xls
- dHAP4: dHAP4_one_strain_ANOVA_out_data_add_stem.xls
- dHMO1: dHMO1_one_strain_ANOVA_out_data_add_stem.xls
- dZAP1: dZAP1_one_strain_ANOVA_out_data_add_stem.xls
- Wild-Type: wt_one_strain_ANOVA_out_data_add_stem.xls
- Text file spreadsheet for each strain:
- dCIN5: dCIN5_one_strain_ANOVA_out_data_add_STEM.txt
- dGLN3: dGLN3_one_strain_ANOVA_out_data_add_stem.txt
- dHAP4: dHAP4_one_strain_ANOVA_out_data_add_stem.txt
- dHMO1: dHMO1_one_strain_ANOVA_out_data_add_stem.txt
- dZAP1: dZAP1_one_strain_ANOVA_out_data_add_stem.txt
- Wild-Type: wt_one_strain_ANOVA_out_data_add_stem.txt
- Now download and extract the STEM software.
- Click here to download an archived version of the
stem.zip
file to your Desktop. - Unzip the file. Right click on the file icon and select the menu item 7-zip > Extract Here.
- This will create a folder called
stem
.- You now need to download the Gene Ontology and yeast GO annotations and place them in this folder.
- Click here to download the file "gene_ontology.obo".
- Click here to download the file "gene_association.sgd.gz".
- Inside the folder, double-click on the
stem.jar
to launch the STEM program.
- Click here to download an archived version of the
- Running STEM
- In section 1 (Expression Data Info) of the the main STEM interface window, click on the Browse... button to navigate to and select your file.
- Click on the radio button No normalization/add 0.
- Check the box next to Spot IDs included in the data file.
- In section 2 (Gene Info) of the main STEM interface window, leave the default selection for the three drop-down menu selections for Gene Annotation Source, Cross Reference Source, and Gene Location Source as "User provided".
- Click the "Browse..." button to the right of the "Gene Annotation File" item. Browse to your "stem" folder and select the file "gene_association.sgd.gz" and click Open.
- In section 3 (Options) of the main STEM interface window, make sure that the Clustering Method says "STEM Clustering Method" and do not change the defaults for Maximum Number of Model Profiles or Maximum Unit Change in Model Profiles between Time Points.
- In section 4 (Execute) click on the yellow Execute button to run STEM.
- In section 1 (Expression Data Info) of the the main STEM interface window, click on the Browse... button to navigate to and select your file.
- Viewing and Saving STEM Results
- A new window will open called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value. Profiles with the same color belong to the same cluster of profiles. The number in each box is simply an ID number for the profile.
- Click on the button that says "Interface Options...". At the bottom of the Interface Options window that appears below where it says "X-axis scale should be:", click on the radio button that says "Based on real time". Then close the Interface Options window.
- Take a screenshot of this window (on a PC, simultaneously press the
Alt
andPrintScreen
buttons to save the view in the active window to the clipboard) and paste it into a PowerPoint presentation to save your figures.
- Click on each of the SIGNIFICANT profiles (the colored ones) to open a window showing a more detailed plot containing all of the genes in that profile.
- Take a screenshot of each of the individual profile windows and save the images in your PowerPoint presentation, giving each slide an appropriate title.
- STEM profile screenshots for each strain:
- A STEM summary slide was created for the comparison of the profiles between the strains: All_Strain_STEM_Profile_Summary.pptx
- At the bottom of each profile window, there are two yellow buttons "Profile Gene Table" and "Profile GO Table". For each of the profiles, click on the "Profile Gene Table" button to see the list of genes belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your folder in the T: drive. Make your filename descriptive of the contents, e.g. "wt_profile#_genelist.txt", where you replace the number symbol with the actual profile number.
- Zip all the files together and store them as one file; upload to the Dahlquist Lab repository on GitHub. There is one folder per strain; upload the data to the correct strain's folder.
- Gene-list zip file for each strain:
- dCIN5: dCIN5_Genelist_Profiles.zip
- dGLN3: dGLN3_Genelist_Profiles.zip
- dHAP4: dHAP4_Genelist_Profiles.zip
- dHMO1: dHMO1_Genelist_Profiles.zip
- dZAP1: dZAP1_Genelist_Profiles.zip
- Wild-Type: wt_Genelist_Profiles.zip
- For each of the significant profiles, click on the "Profile GO Table" to see the list of Gene Ontology terms belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your folder. Make your filename descriptive of the contents, e.g. "wt_profile#_GOlist.txt", where you use "wt" to indicate the dataset and where you replace the number symbol with the actual profile number. At this point you have saved all of the primary data from the STEM software and it's time to interpret the results!
- Zip all the files together and store them as one file; upload to the Dahlquist Lab repository on GitHub. There is one folder per strain; upload the data to the correct strain's folder. Make sure that each strain's folder contains the input .txt file and all the results files.
- GO list zip file for each strain:
- dCIN5: dCIN5_GOlist_Profiles.zip
- dGLN3: dGLN3_GOlist_Profiles.zip
- dHAP4: dHAP4_GOlist_Profiles.zip
- dHMO1: dHMO1_GOlist_Profiles.zip
- dZAP1: dZAP1_GOlist_Profiles.zip
- Wild-Type: wt_GOlist_Profiles.zip
- Take a screenshot of each of the individual profile windows and save the images in your PowerPoint presentation, giving each slide an appropriate title.
- A new window will open called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value. Profiles with the same color belong to the same cluster of profiles. The number in each box is simply an ID number for the profile.
- Analyzing and Interpreting STEM Results
- Each person in the class will select one profile for further analysis. Answer the following:
- Why did you select this profile? In other words, why was it interesting to you?
- How many genes belong to this profile?
- How many genes were expected to belong to this profile?
- What is the p value for the enrichment of genes in this profile for each of the strains? Bear in mind that we just finished computing p values to determine whether each individual gene had a significant change in gene expression at each time point. The p value reported by stem determines whether the number of genes that show this particular expression profile across the time points is significantly more than expected.
- Open the GO list file you saved for this profile in Excel. This list shows all of the Gene Ontology terms that are associated with genes that fit this profile. Select the third row and then choose from the menu Data > Filter > Autofilter. Filter on the "p-value" column to show only GO terms that have a p value of < 0.05. How many GO terms are associated with this profile at p < 0.05? The GO list also has a column called "Corrected p-value". This correction is needed because the software has performed thousands of significance tests. Filter on the "Corrected p-value" column to show only GO terms that have a corrected p value of < 0.05. How many GO terms are associated with this profile with a corrected p value < 0.05?
- Select 10 Gene Ontology terms from your filtered list (either p < 0.05 or corrected p < 0.05) that you will present and analyze in your final report.
- Create a table for your final report with just those 10 terms. Your table should include the following data from the GO list file:
- Category ID
- Category Name
- #Genes Category
- #Genes Assigned
- #Genes Expected
- #Genes Enriched
- p-value
- Corrected p-value
- Fold
- Look up the definitions for each of the terms at http://geneontology.org. For your final lab report, you will supply the definition and discuss the biological interpretation of these GO terms. In other words, why does the cell react to cold shock by changing the expression of genes associated with these GO terms?
- To easily look up the definitions, go to http://geneontology.org.
- Copy and paste the GO ID (e.g. GO:0044848) into the search field on the left of the page.
- In the results page, click on the button that says "Link to detailed information about <term>, in this case "biological phase"".
- The definition will be on the next results page, e.g. here.
- Create a table for your final report with just those 10 terms. Your table should include the following data from the GO list file:
- Each person in the class will select one profile for further analysis. Answer the following: