Katherine Grace Johnson Electronic Lab Notebook
This is my lab notebook
Spring 2015
February 6, 2015
Repeat microchip data normalization for Ontario and GCAT from protocol Dahlquist:Microarray Data Processing in R. Data processed 1/30/15, but repeated today in order to record protocol to this notebook. Both normalized Excel data sheets will be compared to each other and to Natalie's to determine if there is a difference in normalization from computer to computer.
R x64 3.1.0 version used
Within Array Normalization for the Ontario Chips and Within Array Normalization for the GCAT Chips (includes between chip normalization)
- Change Directory - Must scroll down to "User" to locate kjohn102, then select folder "Microarray Data"
- to unzip files - right click, 7Zip, Extract here - this will place the unzipped file in the folder you are currently in
- R asks you to call the data file (.script), then an Excel target file (.csv) in which to put the normalized data. These must both be in the same folder (Microarray data), and downloaded before R is run
- Excel files are not generated until both normalizations are run
- Two Excel files generated: GCAT_and_Ontario_Within_Array_Normalization.csv and GCAT_and_Ontario_Final_Normalized_Data.csv. File desired is Normalized Data. Rename with suffix _date_GJ
- created Excel file, Comparison_Finalized_Normalized_Data_GJNW_20150206.csv to compare three sets of Normalized data: GJ1, GJ2, and NW
- GJ1 vs NW results - avg 10^-11 difference
- GJ1 vs GJ2 results - 0 difference
- Another normalization was run, named GJ3. This was compared to GJ2 in the Excel comparison document. Computer restarted, another normalization created - GJ4
- GJ2 vs GJ 3 results - 0 difference
- GJ3 vs GJ 4 results - 0 difference
Conclusions: Data normalization did not change from trial to trial on paradoxus computer, no matter the time of normalization. Normalization produced a slight difference between boulardii and paradoxus computers.
April 14, 2015
Completing Week 11 and Week 12 assignments from [BIOL398]. I will complete statistical testing of wild type data, and generate a network from this data.
Notes for improvement:
- use COUNTIF function instead of filtering the numbers when looking at p-values
- To prepare for analysis in STEM, columns containing #VALUE! had to be removed by using custom filter: does not equal #VALUE!. Remaining number values had to be copied and pasted into a new sheet.
- On macs, cluster files from STEM are not recognized by Excel. Textedit files must be converted to csv by the following procedure:
- Select a tab character and press Command F, Paste into top bar
- Click replace, then type a comma into the replace bar. Click replace all.
- Save with file extension .csv (type manually if it is not a drop down option)
YEASTRACT analysis of profile cluster #45
- 19 significant transcription factors
Sfp1 Fkh2 Yhp1 Yox1 Cyc8 YLR278C Ace2 Rif1 Msn2 Stb5 Asg1 Msn4 Mig2 Swi5 Snf6 Pdr1 Gcr2 Gat3 Mcm1
- Our transcription factors from deletion strains (CIN5, GLN3, HMO1, ZAP1) are not included on this list.
- Use "Only DNA binding evidence" selection choice when generating networks in YEASTRACT
- Network should have 40-60 edges
April 29, 2015
Completing Week 13 and Week 14 assignments from [BIOL398]. I will use profile #45 from the YEASTRACT database as the basis for the network to be run through GRNmap. Including the four deletion strains, this network has 23 nodes and 46 edges.
- Protocol for Week 13 and 14 assignments was followed to produce:
- Outputs keeping b parameter fixed (i.e. fix_b is set to 1 on the optimization_parameters sheet of input workbook)
- Outputs allowing b to be estimated (i.e. fix_b is set to 0 on the optimization_parameters sheet of input workbook)
- Outputs for both runs include:
- Estimation Excel sheet containing estimated production rates, estimated b values (if applicable), and optimized weights for each transcription factor in the network
- Output graphs for each transcription factor in the network
- Both networks were visualized using GRNsight:
- The output Excel sheet can be used in GRNsight with one minor edit: change name of "out_network_optimized_weights" to "network_optimized_weights"
SURP
May 18, 2015
Microarray Data Analysis
- Correct formatting:
- Show file extensions
- When downloading files, change Firefox settings to default save to Desktop; Firefox->Options->Change to Desktop->select always ask where to save
- Store files with your name, then yyyymmdd
- Microarray data
- Chips used are of two types: GCAT and Ontario
- GCAT has two gene blocks on each chip. WT data.
- Ontario has technical replicates, duplicate spots directly next to eachother. Ontario contains more genes than GCAT. WT and deletion data.
- Chips used are of two types: GCAT and Ontario
- Zipped files use compression algorithms
- To unzip files with Microsoft:
- right click on desired file
- click arrow on 7-zip line
- To unzip files with Microsoft:
Steps 4-5: Within- and Between-chip Normalization
- A more detailed protocol can be found on this page. An abbreviated protocol is summarized below.
Installing R 3.1.0 and the limma package
The following protocol was developed to normalize GCAT and Ontario DNA microarray chip data from the Dahlquist lab using the R Statistical Software and the limma package (part of the Bioconductor Project).
- The normalization procedure has been verified to work with version 3.1.0 of R released in April 2014 (link to download site) and and version 3.20.1 of the limma package ( direct link to download zipped file) on the Windows 7 platform.
- Note that using other versions of R or the limma package might give different results.
- Note also that using the 32-bit versus the 64-bit versions of R 3.1.0 will give different results for the normalization out in the 10-13 or 10-14 decimal place. The Dahlquist Lab is standardizing on using the 64-bit version of R.
- To install R for the first time, download and run the installer from the link above, accepting the default installation.
- To use the limma package, unzip the file and place the contents into a folder called "limma" in the library directory of the R program. If you accept the default location, that will be C:\Program Files\R\R-3.1.0\library (this will be different on the computers in S120 since you do not have administrator rights).
Running the Normalization Scripts
- Create a folder on your Desktop to store your files for the microarray analysis procedure.
- Download the zipped file that contains the
.gpr
files and save it to this folder (or move it if it saved in a different folder).- Unzip this file using 7-zip. Right-click on the file and select the menu item, "7-zip > Extract Here".
- Download the GCAT_Targets.csv file and Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv files and save them to this folder (or move them if they saved to a different folder).
- Download the Ontario_Chip_Within-Array_Normalization_modified_20150514.R script and save (or move) it to this folder.
- Download the Within-Array_Normalization_GCAT_and_Merged_Ontario-GCAT_Between-Chip_Normalization_modified_20150514.R script and save (or move) it to this folder.
Within Array Normalization for the Ontario Chips
- Launch R x64 3.1.0 (make sure you are using the 64-bit version).
- Change the directory to the folder containing the targets file and the GPR files for the Ontario chips by selecting the menu item File > Change dir... and clicking on the appropriate directory. You will need to click on the + sign to drill down to the right directory. Once you have selected it, click OK.
- In R, select the menu item File > Source R code..., and select the Ontario_Chip_Within-Array_Normalization_modified_20150514.R script.
- You will be prompted by an Open dialog for the Ontario targets file. Select the file Ontario_Targets_wt-dCIN5-dGLN3-dHAP4-dHMO1-dSWI4-dZAP1-Spar_20150514.csv and click Open.
- Wait while R processes your files.
Within Array Normalization for the GCAT Chips and Between Array Normalization for All Chips
- These instructions assume that you have just completed the Within Array Normalization for the Ontario Chips in the section above.
- In R, select the menu item File > Source R code..., and select the Within-Array_Normalization_GCAT_and_Merged_Ontario-GCAT_Between-Chip_Normalization_modified_20150514.R script.
- You will be prompted by an Open dialog for the GCAT targets file. Select the file GCAT_Targets.csv and click Open.
- Wait while R processes your files.
- When the processing has finished, you will find two files called GCAT_and_Ontario_Within_Array_Normalization.csv and GCAT_and_Ontario_Final_Normalized_Data.csv in the same folder.
- Save these files to LionShare and/or to a flash drive.
Visualizing the Normalized Data
Create MA Plots and Box Plots for the GCAT Chips
Input the following code, line by line, into the main R window. Press the enter key after each block of code.
GCAT.GeneList<-RGG$genes$ID
lg<-log2((RGG$R-RGG$Rb)/(RGG$G-RGG$Gb))
- If you get a message saying "NaNs produced" this is OK, proceed to the next step.
r0<-length(lg[1,]) rx<-tapply(lg[,1],as.factor(GCAT.GeneList),mean) r1<-length(rx) MM<-matrix(nrow=r1,ncol=r0)
for(i in 1:r0) {MM[,i]<-tapply(lg[,i],as.factor(GCAT.GeneList),mean)}
MC<-matrix(nrow=r1,ncol=r0)
for(i in 1:r0) {MC[,i]<-dw[i]*MM[,i]}
MCD<-as.data.frame(MC) colnames(MCD)<-chips rownames(MCD)<-gcatID
la<-(1/2*log2((RGG$R-RGG$Rb)*(RGG$G-RGG$Gb)))
- If you get these Warning messages, it's OK:
- 1: In (RGG$R - RGG$Rb) * (RGG$G - RGG$Gb) :
- NAs produced by integer overflow
- 2: NaNs produced
r2<-length(la[1,]) ri<-tapply(la[,1],as.factor(GCAT.GeneList),mean) r3<-length(ri) AG<-matrix(nrow=r3,ncol=r2)
for(i in 1:r2) {AG[,i]<-tapply(la[,i],as.factor(GCAT.GeneList),mean)}
par(mfrow=c(3,3))
for(i in 1:r2) {plot(AG[,i],MC[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))} browser()
- Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, press Enter.
- To make sure that you save the clearest image, do not scroll in the window because a grey bar will appear if you do so.
- The next set of code is for the generation of the GCAT boxplots for the wild-type data.
x0<-tapply(MAG$A[,1],as.factor(MAG$genes$ID),mean) y0<-length(MAG$A[1,]) x1<-length(x0) AAG<-matrix(nrow=x1,ncol=y0)
for(i in 1:y0) {AAG[,i]<-tapply(MAG$A[,i],as.factor(MAG$genes$ID),mean)}
par(mfrow=c(3,3))
for(i in 1:y0) {plot(AAG[,i],MG2[,i],main=chips[i],xlab='A',ylab='M',ylim=c(-5,5),xlim=c(0,15))} browser()
- Maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window. To continue with the rest of the code, press Enter.
par(mfrow=c(1,3))
boxplot(MCD,main="Before Normalization",ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1,at=xy.coords(chips)$x,tick=TRUE,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
boxplot(MG2,main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1,at=xy.coords(chips)$x,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
boxplot(MAD[,Gtop$MasterList],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n')
axis(1, at=xy.coords(chips)$x,labels=FALSE)
text(xy.coords(chips)$x-1,par('usr')[3]-0.6,labels=chips,srt=45,cex=0.9,xpd=TRUE)
- Maximize the window in which the plots have appeared. You may not want to actually maximize them because you might lose the labels on the x axis, but make them as large as you can. Save the plots as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window.
Create MA Plots and Box Plots for the Ontario Chips
Input the following code, line by line, into the main R window. Press the enter key after each block of code.
Ontario.GeneList<-RGO$genes$Name
lr<-log2((RGO$R-RGO$Rb)/(RGO$G-RGO$Gb))
- Warning message: "NaNs produced" is OK.
z0<-length(lr[1,]) v0<-tapply(lr[,1],as.factor(Ontario.GeneList),mean) z1<-length(v0) MT<-matrix(nrow=z1,ncol=z0)
for(i in 1:z0) {MT[,i]<-tapply(lr[,i],as.factor(Ontario.GeneList),mean)}
MI<-matrix(nrow=z1,ncol=z0)
for(i in 1:z0) {MI[,i]<-ds[i]*MT[,i]}
MID<-as.data.frame(MI) colnames(MID)<-headers rownames(MID)<-ontID
ln<-(1/2*log2((RGO$R-RGO$Rb)*(RGO$G-RGO$Gb)))
- Warning messages are OK:
- 1: In (RGO$R - RGO$Rb) * (RGO$G - RGO$Gb) :
- NAs produced by integer overflow
- 2: NaNs produced
z2<-length(ln[1,]) zi<-tapply(ln[,1],as.factor(Ontario.GeneList),mean) z3<-length(zi) AO<-matrix(nrow=z3,ncol=z2)
for(i in 1:z0) {AO[,i]<-tapply(ln[,i],as.factor(Ontario.GeneList),mean)}
strains<-c('wt','dCIN5','dGLN3','dHAP4','dHMO1','dSWI4','dZAP1','Spar')
- After entering the call browser() below, maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
- The last graph to appear will be the spar graphs.
- The graphs generated from this code are the before Ontario chips
- Be sure to save the 8 graphs before moving on to the next step
for (i in 1:length(strains)) { st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { par(mfrow=c(3,5)) } else { par(mfrow=c(4,5)) } for (i in lt) { plot(AO[,i],MI[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15)) } browser() }
- To continue generating plots, press enter.
j0<-tapply(MAO$A[,1],as.factor(MAO$genes[,5]),mean) k0<-length(MAO$A[1,]) j1<-length(j0) AAO<-matrix(nrow=j1,ncol=k0)
for(i in 1:k0) {AAO[,i]<-tapply(MAO$A[,i],as.factor(MAO$genes[,5]),mean)}
- Remember, that after entering the call readline('Press Enter to continue'), maximize the window in which the graphs have appeared. Save the graphs as a JPEG (File>Save As>JPEG>100% quality...). Once the graphs have been saved, close the window and press Enter for the next set of graphs to appear.
- Again, the last graphs to appear will be the spar graphs.
- These graphs that are produced are for the after Ontario chips
- Again, be sure to save 8 graphs before moving on to the next part of the code.
for (i in 1:length(strains)) { st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { par(mfrow=c(3,5)) } else { par(mfrow=c(4,5)) } for (i in lt) { plot(AAO[,i],MD2[,i],main=headers[i],xlab="A",ylab="M",ylim=c(-5,5),xlim=c(0,15)) } browser() }
- To continue generating plots, press enter.
for (i in 1:length(strains)) { par(mfrow=c(1,3)) st<-strains[i] lt<-which(Otargets$Strain %in% st) if (st=='wt') { xcoord<-xy.coords(lt)$x-1 fsize<-0.9 } else { xcoord<-xy.coords(lt)$x-1.7 fsize<-0.8 } boxplot(MID[,lt],main='Before Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) boxplot(MD2[,lt],main='After Within Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) ft<-Otargets$MasterList[which(Otargets$Strain %in% st)] boxplot(MAD[,ft],main='After Between Array Normalization',ylab='Log Fold Change',ylim=c(-5,5),xaxt='n') axis(1,at=xy.coords(lt)$x,labels=FALSE) text(xcoord,par('usr')[3]-0.65,labels=headers[lt],srt=45,cex=fsize,xpd=TRUE) browser() }
- To continue generating the box plots, press enter.
- You will have to save 8 plots before you have completed the procedure. The last box plot is for spar.
- Warnings are OK.
- Zip the files of the plots together and upload to LionShare and/or save to a flash drive.
Summary
- Followed protocol of steps 4 and 5 from Dahlquist:Microarray_Data_Analysis_Workflow
- MA plots generated show the log change vs intensity of red and green. Unbiased data should show straight, horizontal lines. Normalization forces conformation to straight lines, therefore after-normalization plots should be more straight than before normalization.
- Data produced for these strains:
- wt
- dCIN5
- dGLN3
- dHAP4
- dHMO1
- dSWI4
- dZAP1
- Spar, or Saccharomyces paradoxus
Step 6: Statistical Analysis
- For the statistical analysis, we will begin with the file "GCAT_and_Ontario_Final_Normalized_Data.csv" that you generated in the previous step.
- Open this file in Excel and Save As an Excel Workbook
*.xlsx
. It is a good idea to add your initials and the date (yyyymmdd) to the filename as well. - Rename the worksheet with the data "Compiled_Normalized_Data".
- Type the header "ID" in cell A1.
- Insert a new column after column A and name it "Standard Name". Column B will contain the common names for the genes on the microarray.
- Copy the entire column of IDs from Column A.
- Paste the names into the "Value" field of the ORF List <-> Gene List tool in YEASTRACT. Then, click on the "Transform" button.
- Select all of the names in the "Gene Name" column of the resulting table.
- Copy and paste these names into column B of the
*.xlsx
file. Save your work.
- Insert a new column on the very left and name it "MasterIndex". We will create a numerical index of genes so that we can always sort them back into the same order.
- Type a "1" in cell A2 and a "2" in cell A3.
- Select both cells. Hover your mouse over the bottom-right corner of the selection until it makes a thin black + sign. Double-click on the + sign to fill the entire column with a series of numbers from 1 to 6189 (the number of genes on the microarray).
- Insert a new worksheet and call it "Rounded_Normalized_Data". We are going to round the normalization results to four decimal places because of slight variations seen in different runs of the normalization script.
- Copy the first three columns of the "Compiled_Normalized_Data" sheet and paste it into the first three columns of the "Rounded_Normalized_Data" sheet.
- Copy the first row of the "Compiled_Normalized_Data" sheet and paste it into the first row of the "Rounded_Normalized_Data" sheet.
- In cell C2, type the equation
=ROUND(Compiled_Normalized_Data!C2,4)
. - Copy and paste this equation in the rest of the cells of row 2.
- Select all of the cells of row 2 and hover your mouse over the bottom right corner of the selection. When the cursor changes to a thin black "plus" sign, double-click on it to paste the equation to all the rows in the worksheet. Save your work.
- Insert a new worksheet and call it "Master_Sheet".
- Go back to the "Rounded_Normalized_Data" sheet and Select All and Copy.
- Click on cell A1 of the "Master_Sheet" worksheet. Select Paste special > Paste values to paste the values, but not the formulas from the previous sheet. Save your work.
- There will be some #VALUE! errors in cells where there was missing data for genes that existed on the Ontario chips, but not the GCAT chips.
- Select the menu item Find/Replace and Find all cells with "#VALUE!" and replace them with a single space character.
- 477 items found and replaced
- Select the menu item Find/Replace and Find all cells with "#VALUE!" and replace them with a single space character.
- This will be the starting point for our statistical analysis below.
Within-strain ANOVA
- Purpose of ANOVA: is there significance at any time point?
- Within-Strain ANOVA testing begun for dHAP4 strain with Monica Hong
- The purpose of the witin-stain ANOVA test is to determine if any genes had a gene expression change that was significantly different than zero at any timepoint.
- Each student in the lab group will be assigned one strain to analyze from this point forward.
- Anu/Natalie: wt
- Grace/Monica: Δhap4
- Kevin M./Nicole: Δgln3
- Kevin W./KD: Δswi4
- Tessa/Trixie:Δcin5
- Note that we chose not to do Δhmo1, Δzap1 or S. paradoxus so that students could work in pairs and check each others' work.
- Create a new worksheet, naming it "dHAP4_ANOVA"
- Copy all of the data from the "Master_Sheet" worksheet for your strain and paste it into your new worksheet.
- At the top of the first column to the right of your data, create five column headers of the form dHAP1_AvgLogFC_(TIME) where (TIME) is 15, 30, etc.
- In the cell below the dHAP4_AvgLogFC_t15 header, type
=AVERAGE(
- Then highlight all the data in row 2 associated with dHAP4 and t15, press the closing paren key (shift 0),and press the "enter" key.
- This cell now contains the average of the log fold change data from the first gene at t=15 minutes.
- Click on this cell and position your cursor at the bottom right corner. You should see your cursor change to a thin black plus sign (not a chubby white one). When it does, double click, and the formula will magically be copied to the entire column of 6188 other genes.
- Repeat steps (4) through (8) with the t30, t60, t90, and the t120 data.
- Now in the first empty column to the right of the dHAP4_AvgLogFC_t120 calculation, create the column header dHAP4_ss_HO.
- In the first cell below this header, type
=SUMSQ(
- Highlight all the LogFC data in row 2 for your dHAP4 (but not the AvgLogFC), press the closing paren key (shift 0),and press the "enter" key.
- In the next empty column to the right of dHAP4_ss_HO, create the column headers dHAP4_ss_(TIME) as in (3).
- Note: dHAP4 strain has 18 chips (4 replicates for t15, t30, t60 and 3 replicates for t90 and t120)
- In the first cell below the header dHAP4_ss_t15, type
=SUMSQ(<range of cells for logFC_t15>)-<number of data points>*<AvgLogFC_t15>^2
and hit enter.- The phrase <range of cells for logFC_t15> should be replaced by the data range associated with t15.
- The phrase <number of data points> should be replaced by the number of data points for that timepoint (either 3,or 4).
- The phrase <AvgLogFC_t15> should be replaced by the cell number in which you computed the AvgLogFC for t15, and the "^2" squares that value.
- Upon completion of this single computation, use the Step (7) trick to copy the formula throughout the column.
- Repeat this computation for the t30 through t120 data points. Again, be sure to get the data for each time point, type the right number of data points, and get the average from the appropriate cell for each time point, and copy the formula to the whole column for each computation.
- In the first column to the right of dHAP4_ss_t120, create the column header dHAP4_SS_full.
- In the first row below this header, type
=sum(<range of cells containing "ss" for each timepoint>)
and hit enter. - In the next two columns to the right, create the headers dHAP4_Fstat and dHAP4_p-value.
- Recall the total number of data points - 18.
- In the first cell of the dHAP4_Fstat column, type
=((18-5)/5)*(<dHAP4_ss_HO>-<dHAP4_SS_full>)/<dHAP4_SS_full>
and hit enter.- Copy to the whole column.
- In the first cell below the dHAP4_p-value header, type
=FDIST(<dHAP4_Fstat>,5,18-5)
replacing the phrase <dHAP4_Fstat> . Copy to the whole column. - Before we move on to the next step, we will perform a quick sanity check to see if we did all of these computations correctly.
- Click on cell A1 and click on the Data tab. Select the Filter icon (looks like a funnel). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow on your dHAP4_p-value column. Select "Number Filters". In the window that appears, set a criterion that will filter your data so that the p value has to be less than 0.05.
- Excel will now only display the rows that correspond to data meeting that filtering criterion. A number will appear in the lower left hand corner of the window giving you the number of rows that meet that criterion. We will check our results with each other to make sure that the computations were performed correctly.
Calculate the Bonferroni and p value Correction
- Now we will perform adjustments to the p value to correct for the multiple testing problem. Label the next two columns to the right with the same label, dHAP4_Bonferroni_p-value.
- Type the equation
=<dHAP4_p-value>*6189
, Upon completion of this single computation, use the Step (10) trick to copy the formula throughout the column. - Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dHAP4_Bonferroni_p-value header:
=IF(r2>1,1,r2)
. Use the Step (10) trick to copy the formula throughout the column.
Calculate the Benjamini & Hochberg p value Correction
- Insert a new worksheet named "dHAP4_B&H".
- Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.
- For the following, use Paste special > Paste values. Copy your unadjusted p values from your ANOVA worksheet and paste it into Column D.
- Select all of columns A, B, C, and D. Sort by ascending values on Column D. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column C, smallest to largest.
- Type the header "Rank" in cell E1. We will create a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
- Now you can calculate the Benjamini and Hochberg p value correction. Type dHAP4_B-H_p-value in cell F1. Type the following formula in cell F2:
=(D2*6189)/E2
and press enter. Copy that equation to the entire column. - Type "dHAP4_B-H_p-value" into cell G1.
- Type the following formula into cell G2:
=IF(F2>1,1,F2)
and press enter. Copy that equation to the entire column. - Select columns A through G. Now sort them by your MasterIndex in Column A in ascending order.
- Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ANOVA sheet.
Sanity Check: Number of genes significantly changed
Before we move on to further analysis of the data, we want to perform a more extensive sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs.
- Go to your dHAP4_ANOVA worksheet.
- Select row 1 (the row with your column headers) and select the menu item Data > Filter > Autofilter (The funnel icon on the Data tab). Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow for the unadjusted p value. Set a criterion that will filter your data so that the p value has to be less than 0.05.
- How many genes have p < 0.05? and what is the percentage (out of 6189)?
- 2387 (38.6%)
- How many genes have p < 0.01? and what is the percentage (out of 6189)?
- 1489 (24.1%)
- How many genes have p < 0.001? and what is the percentage (out of 6189)?
- 679 (11.0%)
- How many genes have p < 0.0001? and what is the percentage (out of 6189)?
- 240 (3.88%)
- How many genes have p < 0.05? and what is the percentage (out of 6189)?
- The 0.05 p-value must be narrowed down. To apply a more stringent filter and account for the multiple testing problem, we will apply the Bonferroni (multiply p-value by 6189) and B&H (same as Bonferroni, but stringency decreases as p-value rank decreases)corrections.
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
- 61 (0.986%)
- How many genes are p < 0.05 for the Benjamini and Hochberg-corrected p value? and what is the percentage (out of 6189)?
- 1615 (26.1%)
- How many genes are p < 0.05 for the Bonferroni-corrected p value? and what is the percentage (out of 6189)?
- Results of dHAP4 deletion compared to wt Media:DHAP1 p-value slide 20150518 GJ.pptx
- In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a moveable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off.
- Comparing results with known data: the expression of the gene NSR1 (ID: YGR159C)is known to be induced by cold shock. Find NSR1 in your dataset. What is its unadjusted, Bonferroni-corrected, and B-H-corrected p values? What is its average Log fold change at each of the timepoints in the experiment?
- unadjusted p-value:0.01505
- Bonferroni-corrected: 93.11
- B&H corrected: 0.05539
- Log fold changes:
- t15: 2.6095
- t30: 3.1899
- t60: 3.4241
- t90: -1.1533
- t120: -1.7281
May 19, 2015
Step 6: Statistical Analysis
Modified t-test for each timepoint
- purpose of t-test: is there a significant change from zero at each time point?
- analysis performed in same Excel workbook as above
- Insert a new worksheet into your Excel workbook and name it "dHAP4_ttest"
- Go back to the "Master_Sheet" worksheet for your strain. Copy the first three columns containing the "MasterIndex", "ID", and "Standard Name" from the "Master_Sheet" worksheet for your strain and paste it into your new worksheet. Copy the columns containing the data for dHAP4 and paste it into your new worksheet.
- Go to the empty columns to the right on your worksheet (columns V through Z). Create new column headings in the top cells to label the average log fold changes that you will compute. Name them with the pattern <dHAP4>_<AvgLogFC>_<tx> where x is the time. For example, "dHAP4_AvgLogFC_t15".
- Compute the average log fold change for the replicates for each timepoint by typing the equation:
=AVERAGE(range of cells in the row for that timepoint)
into the second cell below the column heading. For example, the equation for t15 reads
=AVERAGE(D2:G2)
Copy this equation and paste it into the rest of the column.
- Create the equation for the rest of the timepoints and paste it into their respective columns.
- Go to the empty columns to the right on your worksheet (AA through AE). Create new column headings in the top cells to label the T statistic that you will compute. Name them with the pattern <dHAP4>_<Tstat>_<tx> where x is the time. For example, "dHAP4_Tstat_t15". You will now compute a T statistic that tells you whether the normalized average log fold change is significantly different than 0 (no change in expression). Enter the equation into cell AA2 for the t15 time points:
=AVERAGE(D2:G2)/(STDEV(D2:G2)/SQRT(4))
(NOTE: in this case the number of replicates is 4. For dHAP4 strain, t90 and t120 only have 3 replicates. Be careful that you are using the correct number of parentheses.) Copy the equation and paste it into all rows in that column. Create the equation for the rest of the timepoints and paste it into their respective columns. Note that you can save yourself some time by completing the first equation for all of the T statistics and then copy and paste all the columns at once.
- Go to the empty columns to the right on your worksheet (AF through AJ). Create new column headings in the top cells to label the P value that you will compute. Name them with the pattern <dHAP4>_<Pval>_<tx> where x is the time. For example, "dHAP4_Pval_t15". In the cell AF2 for t15, enter the equation:
=TDIST(ABS(AA2),3,2)
Where 3 is degrees of freedom (4 replicates -1), and 2 is the number of tails on the t-test. Note that t90 and t120 will have 2 degrees of freedom. Copy the equation and paste it into all rows in that column.
- As with the ANOVA, we encounter the multiple testing problem here as well.
Bonferroni Correction
- Now we will perform adjustments to the p value to correct for the multiple testing problem. Label the columns to the right (AK through AT) with the label, dHAP4_Bonferroni-Pval_tx (do this twice in a row).
- Into cell AK2 type the equation
=<AF2>*6189
, Upon completion of this single computation, use the trick to copy the formula throughout the column. - Replace any corrected p value that is greater than 1 by the number 1 by typing the following formula into the first cell below the second dHAP2_Bonferroni-Pval_t15 header (cell AL2):
=IF(AK2>1,1,AK2)
. Use the trick to copy the formula throughout the column. - Repeat this for all five time points.
Benjamini & Hochberg Correction
- Insert a new worksheet named "dHAP_ttest_BH". You will need to perform the procedure below for the p values for each timepoint. Do them individually one at a time to avoid confusion.
- Copy and paste the "MasterIndex", "ID", and "Standard Name" columns from your previous worksheet into the first two columns of the new worksheet.
- For the following, use Paste special > Paste values. Copy your unadjusted p values from the first timepoint from your ttest worksheet and paste it into Column D.
- Select all of columns A, B, C, and D. Sort by ascending values on Column D. Click the sort button from A to Z on the toolbar, in the window that appears, sort by column D, smallest to largest.
- Type the header "Rank" in cell E1. We will create a series of numbers in ascending order from 1 to 6189 in this column. This is the p value rank, smallest to largest. Type "1" into cell E2 and "2" into cell E3. Select both cells E2 and E3. Double-click on the plus sign on the lower right-hand corner of your selection to fill the column with a series of numbers from 1 to 6189.
- Now you can calculate the Benjamini and Hochberg p value correction. Type dHAP4_BH_Pval_t15 in cell F1. Type the following formula in cell F2:
=(D2*6189)/E2
and press enter. Copy that equation to the entire column. - Type "dHAP4_B-H_Pval_t15" into cell G1 for the greater than 1 logic test.
- Type the following formula into cell G2:
=IF(F2>1,1,F2)
and press enter. Copy that equation to the entire column. - Select columns A through G. Now sort them by your MasterIndex in Column A in ascending order.
- Copy column G and use Paste special > Paste values to paste it into the next column on the right of your ttest sheet.
- Repeat for the other four time points.
Sanity Check
- We will also perform the "sanity check" as follows:
- Determine how many genes have a p value < 0.05 at each timepoint.
- t15: 1197
- t30: 1772
- t60: 2006
- t90: 234
- t120: 515
- Keeping the "Pval" filter at p < 0.05, How many have an average log fold change of > 0.25 and p < 0.05 at each timepoint?'
- t15: 690
- t30: 947
- t60: 1028
- t90: 141
- t120: 289
- How many have an average log fold change of < -0.25 and p < 0.05 at each timepoint? (These log fold change cut-offs represent about a 20% fold change in expression.)
- t15: 501
- t30: 814
- t60: 967
- t90: 83
- t120: 216
- How many genes have B&H corrected p < 0.05?
- t15: 2
- t30: 108
- t60: 241
- t90: 0
- t120: 0
- How many genes have a Bonferroni corrected p < 0.05?
- t15: 2
- t30: 0
- t60: 1
- t90: 0
- t120: 0
- Determine how many genes have a p value < 0.05 at each timepoint.
Summarized results of modified t-test (and withing strain ANOVA from May 18) can be found on this document: Media:DHAP1 p-value slide 20150518 GJ.pptx
Between-Strain ANOVA
The detailed description of how this is done can be found on this page. A brief version of the protocol appears below.
- All two strain comparisons were performed in MATLAB using the script Two_strain_compare_corrected_20140813_3pm.zip (within a zip file):
- Download the zipped script file, extract it to the folder that contains your Excel file with the worksheet named "Master_Sheet". (The script and Excel file must be in the same folder to work.)
- Launch MATLAB version 2014b.
- In MATLAB, you will need to navigate to the folder containing the script and the Excel file.
- Near the top of the page, you will see a a field that contains the path to the working directory. Just to the left of it, there is an icon that looks like a folder opening with a green down arrow. Click on this icon to open a dialog box where you can choose your folder containing the script and Excel file.
- Once you have selected your folder, the left-hand pane should display the contents of that folder. To open the MATLAB script, you can double-click on it from that pane. The code for the script will appear in the center pane.
- You will need to make a few edits to the code, depending on which strain comparison you want to make.
- For the first block of code, the user must input the name of the Excel file to be imported as the variable "filename", the sheet from which the data will be imported as the variable "sheetname", and the two strains that will be compared as the variables "strain1" and "strain2", in this case "wt" and "dHAP4"
- MATLAB will read either .xlsx or .xls
- Also note that this script will not work for any comparison involving dSWI4 because it has been hard-coded to expect 5 timepoints instead of 4.
- For the first block of code, the user must input the name of the Excel file to be imported as the variable "filename", the sheet from which the data will be imported as the variable "sheetname", and the two strains that will be compared as the variables "strain1" and "strain2", in this case "wt" and "dHAP4"
%% User must input filename, sheetname, and strains for comparison filename = 'dHAP4_2StrainANOVA_20150519_GJ.xlsx'; % Name of input file sheetname = 'Master_Sheet'; % Name of sheet in input file containing data to analyze % % If one of the two strains you are working on is the wildtype, keep that % % wildtype as strain 1. strain1 = 'wt'; %Here should be wt, dCIN5, dGLN3, dHAP4, dHMO1, dZAP1, or Spar % % Select strain 2 to be one of the other strains you would like to % % compare with the first strain. strain2 = 'dHAP4';
- The user does not have to modify any of the code from here on.
- The next two lines of code ask the user whether or not they would like to see plots for each gene with an unadjusted p-value < 0.05. If the user does want to see these plots, they enter "1". If they would not like to see these plots, the user enters "0". When prompted, enter a "1" to see the plots displayed.
disp('Do you want to view plots for each gene with an unadjusted p-value < 0.05?') graph = input('If yes, enter "1". If no, enter "0". ');
- Output files saved:
- wt_dHAP4_ANOVA_out_data.xls
- wt_dHAP4_out_data.mat
- Output p-values
- Number of genes with uncorrected p<0.05 for wt vs dHAP4(from column S of wt_dHAP4_ANOVA_out_data.xls): 640
- Number of genes with B&H p<0.05 for wt vs dHAP4 (from column V of wt_dHAP4_ANOVA_out_data.xls): 23
Step 7-8: Clustering and GO Term Enrichment with stem
- Prepare your microarray data file for loading into STEM.
- Insert a new worksheet into your Excel workbook, and name it "dHAP4_stem".
- Select all of the data from your "dHAP4_ANOVA" worksheet and Paste special > paste values into your "dHAP4_stem" worksheet.
- Your leftmost column should have the column header "MasterIndex". Rename this column to "SPOT". Column B should be named "ID". Rename this column to "Gene Symbol". Delete the column named "StandardName".
- Filter the data on the B-H corrected p value to be > 0.05 (that's greater than in this case).
- Once the data has been filtered, select all of the rows (except for your header row) and delete the rows by right-clicking and choosing "Delete Row" from the context menu. Undo the filter. This ensures that we will cluster only the genes with a "significant" change in expression and not the noise.
- Delete all of the data columns EXCEPT for the Average Log Fold change columns for each timepoint (for example, wt_AvgLogFC_t15, etc.).
- Rename the data columns with just the time and units (for example, 15m, 30m, etc.).
- Save your work. Then use Save As to save this spreadsheet as Text (Tab-delimited) (*.txt). Click OK to the warnings and close your file.
- Note that you should turn on the file extensions if you have not already done so.
- Now download and extract the STEM software. Click here to go to the STEM web site.
- Click on the download link, register, and download the
stem.zip
file to your Desktop. - Unzip the file. Right click on the file icon and select the menu item 7-zip > Extract Here.
- This will create a folder called
stem
. Inside the folder, double-click on thestem.jar
to launch the STEM program.
- Click on the download link, register, and download the
- Running STEM
- In section 1 (Expression Data Info) of the the main STEM interface window, click on the Browse... button to navigate to and select your file.
- Click on the radio button No normalization/add 0.
- Check the box next to Spot IDs included in the data file.
- In section 2 (Gene Info) of the main STEM interface window, select Saccharomyces cerevisiae (SGD), from the drop-down menu for Gene Annotation Source. Select No cross references, from the Cross Reference Source drop-down menu. Select No Gene Locations from the Gene Location Source drop-down menu.
- In section 3 (Options) of the main STEM interface window, make sure that the Clustering Method says "STEM Clustering Method" and do not change the defaults for Maximum Number of Model Profiles or Maximum Unit Change in Model Profiles between Time Points.
- In section 4 (Execute) click on the yellow Execute button to run STEM.
- In section 1 (Expression Data Info) of the the main STEM interface window, click on the Browse... button to navigate to and select your file.
- Viewing and Saving STEM Results
- A new window will open called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value. Profiles with the same color belong to the same cluster of profiles. The number in each box is simply an ID number for the profile.
- Click on the button that says "Interface Options...". At the bottom of the Interface Options window that appears below where it says "X-axis scale should be:", click on the radio button that says "Based on real time". Then close the Interface Options window.
- Take a screenshot of this window (on a PC, simultaneously press the
Alt
andPrintScreen
buttons to save the view in the active window to the clipboard) and paste it into a PowerPoint presentation to save your figures.
- Click on each of the SIGNIFICANT profiles (the colored ones) to open a window showing a more detailed plot containing all of the genes in that profile.
- Take a screenshot of each of the individual profile windows and save the images in your PowerPoint presentation.
- At the bottom of each profile window, there are two yellow buttons "Profile Gene Table" and "Profile GO Table". For each of the profiles, click on the "Profile Gene Table" button to see the list of genes belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your desktop. Make your filename descriptive of the contents, e.g. "wt_profile#_genelist.txt", where you replace the number symbol with the actual profile number.
- Zipped files uploaded LionShare with e-mail a link to Dr. Dahlquist.
- For each of the significant profiles, click on the "Profile GO Table" to see the list of Gene Ontology terms belonging to the profile. In the window that appears, click on the "Save Table" button and save the file to your desktop. Make your filename descriptive of the contents, e.g. "dHAP4_profile#_GOlist.txt", where you replace the number symbol with the actual profile number. At this point you have saved all of the primary data from the STEM software and it's time to interpret the results!
- Zipped files uploaded to LionShare with e-mail a link to Dr. Dahlquist.
- A new window will open called "All STEM Profiles (1)". Each box corresponds to a model expression profile. Colored profiles have a statistically significant number of genes assigned; they are arranged in order from most to least significant p value. Profiles with the same color belong to the same cluster of profiles. The number in each box is simply an ID number for the profile.
May 20, 2015
Step 9: GenMAPP & MAPPFinder
Preparing the Input File for GenMAPP
- Make a new workbook (the large Master workbook is probably getting to large) and name it dHAP4_GenMAPP.
- Go back to the "ANOVA" worksheet for your strain and Select All and Copy.
- Go to your new sheet and click on cell A1 and select Paste Special, click on the Values radio button, and click OK.
- Delete the columns containing the "ss" calculations, just retaining the individual log fold change data, the average log fold change data, and the p values. For the Bonferroni and B&H p values, just keep the one column where we replaced all values > 1 with 1.
- Now go to your "_ttest" worksheet. Copy just the columns containing the P values for the individual timepoints and Paste special > Paste values into your GenMAPP worksheet to the right of the previous data. For the Bonferroni and B&H p values, just keep one column where we replaced all values > 1 with 1.
- It will be useful if we arrange the columns in a slightly different order: all individual log fold change data, then the ANOVA p values, then the AvgLogFC and p values for the individual timepoints clustered together (e.g., all t15 data together).
- Go to the Excel file that was produced by MATLAB with the between-strain ANOVA results.
- Copy and paste Column S (p value) and Column V into the next columns to the right of your GenMAPP worksheet. Rename the columns "wt_v_dHAP4_Pval" and "wt_v_dHAP4_B-H-Pval".
- Select all of the columns containing Fold Changes. Select the menu item Format > Cells. Under the number tab, select 2 decimal places. Click OK.
- Select all of the columns containing T statistics or P values. Select the menu item Format > Cells. Under the number tab, select 4 decimal places. Click OK.
- We will now format this file for use with GenMAPP.
- Currently, the "MasterIndex" column is the first column in the worksheet. We need the "ID" column to be the first column. Select Column B and Cut. Right-click on Cell A1 and select "Insert cut cells". This will reverse the position of the columns.
- Insert a new empty column in Column B. Type "SystemCode" in the first cell and "D" in the second cell of this column. Use our trick to fill this entire column with "D".
- Reading left to right, the columns should be ID, System Code, Master Index, individual LogFC data, within strain ANOVA p-values (uncorrected, Bonferroni, B-H), then t tests for individual time points (e.g. AvgLogFC t15, pval t15, Bonferroni t15, B-H t15, then the same set for t30.... etc.). Finally, the two columns containing data from the MATLAB output of the two strain ANOVA
- Make sure to save this work as your .xlsx file. Now save this worksheet as a tab-delimited text file for use with GenMAPP in the next section.
Running GenMAPP
Each time you launch GenMAPP, you need to make sure that the correct Gene Database (.gdb) is loaded.
- Look in the lower left-hand corner of the window to see which Gene Database has been selected.
- If you need to change the Gene Database, select Data > Choose Gene Database. Navigate to the directory C:\GenMAPP 2 Data\Gene Databases and choose the correct one for your species.
- For the exercise today, if the yeast Gene Database is not present on your computer, you will need to download it. Click this link to download the yeast Gene Database.
- Unzip the file and save it, Sc-Std_20060526.gdb, to the folder C:\GenMAPP 2 Data\Gene Databases.
GenMAPP Expression Dataset Manager Procedure
- Launch the GenMAPP Program. Check to make sure the correct Gene Database is loaded.
- Select the Data menu from the main Drafting Board window and choose Expression Dataset Manager from the drop-down list. The Expression Dataset Manager window will open.
- Select New Dataset from the Expression Datasets menu. Select the tab-delimited text file that you formatted for GenMAPP (.txt) in the procedure above from the file dialog box that appears.
- The Data Type Specification window will appear. GenMAPP is expecting that you are providing numerical data. If any of your columns has text (character) data, check the box next to the field (column) name.
- The column StandardName has text data in it, but none of the rest do.
- Allow the Expression Dataset Manager to convert your data.
- This may take a few minutes depending on the size of the dataset and the computer’s memory and processor speed. When the process is complete, the converted dataset will be active in the Expression Dataset Manager window and the file will be saved in the same folder the raw data file was in, named the same except with a .gex extension; for example, MyExperiment.gex.
- A message may appear saying that the Expression Dataset Manager could not convert one or more lines of data. Lines that generate an error during the conversion of a raw data file are not added to the Expression Dataset. Instead, an exception file is created. The exception file is given the same name as your raw data file with .EX before the extension (e.g., MyExperiment.EX.txt). The exception file will contain all of your raw data, with the addition of a column named ~Error~. This column contains either error messages or, if the program finds no errors, a single space character.
- Note: 97 errors found - database from 2006, these errors are just unrecognized gene ID's
- Customize the new Expression Dataset by creating new Color Sets which contain the instructions to GenMAPP for displaying data on MAPPs.
- Color Sets contain the instructions to GenMAPP for displaying data from an Expression Dataset on MAPPs. Create a Color Set by filling in the following different fields in the Color Set area of the Expression Dataset Manager: a name for the Color Set, the gene value, and the criteria that determine how a gene object is colored on the MAPP. Enter a name in the Color Set Name field that is 20 characters or fewer. You will have one Color Set per strain per time point (name them t15, t30...)
- The Gene Value is the data displayed next to the gene box on a MAPP. Select the column of data to be used as the Gene Value from the drop down list. We will use "Avg_LogFC_" for the the appropriate time point.
- Activate the Criteria Builder by clicking the New button.
- Enter a name for the criterion in the Label in Legend field (Increased or Decreased).
- Choose a color for the criterion by left-clicking on the Color box (Pink for Increased and cyan for Decreased).
- State the criterion for color-coding a gene in the Criterion field.
- A criterion is stated with relationships such as "this column greater than this value" or "that column less than or equal to that value". Individual relationships can be combined using as many ANDs and ORs as needed. A typical relationship is
[ColumnName] RelationalOperator Value
with the column name always enclosed in brackets and character values enclosed in single quotes. For example:
[Fold Change] >= 2 [p value] < 0.05 [Quality] = 'high'
This is the equivalent to queries that you performed on the command line when working with the PostgreSQL movie database. GenMAPP is using a graphical user interface (GUI) to help the user format the queries correctly. The easiest and safest way to create criteria is by choosing items from the Columns and Ops (operators) lists shown in the Criteria Builder. The Columns list contains all of the column headings from your Expression Dataset. To choose a column from the list, click on the column heading. It will appear at the location of the cursor in the Criterion box. The Criteria Builder surrounds the column names with brackets.
The Ops (operators) list contains the relational operators that may be used in the criteria: equals ( = ) greater than ( > ), less than ( < ), greater than or equal to ( >= ), less than or equal to ( <= ), is not equal to ( <> ). To choose an operator from the list, click on the symbol. It will appear at the location of the insertion bar (cursor) in the Criterion box. The Criteria Builder automatically surrounds the operators with spaces. The Ops list also contains the conjunctions AND and OR, which may be used to make compound criteria. For example:
[Fold Change] > 1.2 AND [p value] <= 0.05
Parentheses control the order of evaluation. Anything in parentheses is evaluated first. Parentheses may be nested. For example:
[Control Average] = 100 AND ([Exp1 Average] > 100 OR [Exp2 Average] > 100)
Column names may be used anywhere a value can, for example:
[Control Average] < [Experiment Average]
- After completing a new criterion, add the criterion entry (label, criterion, and color) to the Criteria List by clicking the Add button.
- For the yeast dataset, you will create two criterion for each Color Set. "Increased" will be [dHAP4_Avg_LogFC_<timepoint>] > 0.25 AND [dHAP4_Pval_<timepoint>] < 0.05 and "Decreased will be [dHAP4_Avg_LogFC_<timepoint>] < -0.25 AND [dHAP4_Pval_<timepoint>] < 0.05. Make sure that the increased and decreased average log fold change values match the timepoint of the Color Set.
- You may continue to add criteria to the Color Set by using the previous steps.
- The buttons to the right of the list represent actions that can be performed on individual criteria. To modify a criterion label, color, or the criterion itself, first select the criterion in the list by left-clicking on it, and then click the Edit button. This puts the selected criterion into the Criteria Builder to be modified. Click the Save button to save changes to the modified criterion; click the Add button to add it to the list as a separate criterion. To remove a criterion from the list, left-click on the criterion to select it, and then click on the Delete button. The order of Criteria in the list has significance to GenMAPP. When applying an Expression Dataset and Color Set to a MAPP, GenMAPP examines the expression data for a particular gene object and applies the color for the first criterion in the list that is true. Therefore, it is imperative that when criteria overlap the user put the most important or least inclusive criteria in the list first. To change the order of the criteria in the list, left-click on the criterion to select it and then click the Move Up or Move Down buttons. No criteria met and Not found are always the last two positions in the list.
- You will also create a ColorSets to view the within-strain ANOVA p values for your strain, with criteria for viewing the unadjusted (yellow), Bonferroni-corrected (red), and B&H (orange) corrected p values.
- Finally, you will create a ColorSet to view the between-strain ANOVA p values for your wt v. dHAP4 comparison (orange for B-H corrected pval and yellow for unadjusted pval).
- Save the entire Expression Dataset by selecting Save from the Expression Dataset menu. Changes made to a Color Set are not saved until you do this.
- Exit the Expression Dataset Manager to view the Color Sets on a MAPP. Choose Exit from the Expression Dataset menu or click the close box in the upper right hand corner of the window.
- Upload your .gex file to Lionshare and share it with Dr. Dahlquist. E-mail the link to the file to Dr. Dahlquist.
- Click here to download a zipped set of MAPPs with which to view your Expression Dataset.
Data Exploration
The purpose of this section is to explore which transcription factors have interesting or significant expression patterns. We will attempt to decide (1) which strain (of dINO2, dOPI1, or dYAP1) to perform microarray hybridization and (2) which strains (of dARG80, dRSF2, dRTG3, dTBF1, dYHP1, dYOX1, dPHD1, dNRG1) to test for growth impairment in cold.
Recall that the expression and ANOVA significance are specific for the dHAP4 strain.
- Potential strains for microarray tests (cold shock RNA for these three strains are currently in our possession)
- dINO2
- Within-Strain ANOVA: B-H significance
- Between-Strain ANOVA: no significance
- Increased expression at t30 and t60
- Conclusion: most interesting pattern according to dHAP4 data
- dOPI1
- Within-Strain ANOVA: unadjusted p-value significance
- Between-Strain ANOVA: no significance
- Increased expression at t60
- Conclusion:some change, not as interesting as dINO2
- dYAP1
- Within-Strain ANOVA: unadjusted p-value significance
- Between-Strain ANOVA: no significance
- No significant changes in expression
- Conclusion: least interesting
- dINO2
- Potential strains for growth impairment testing (i.e. early exploration phase):
- ARG80
- Within-Strain ANOVA: no significance
- Between-Strain ANOVA: no significance
- No significant change in expression
- Conclusion: not interesting, do not explore first
- dRSF2
- Within-Strain ANOVA: B-H significance
- Between-Strain ANOVA: B-H significance
- Decreased expression at t15 and t30
- Conclusion: very interesting, potentially explore
- dRTG3
- Within-Strain ANOVA: no significance
- Between-Strain ANOVA: unadjusted significance
- No significant change in expression
- Conclusion: not interesting, do not explore first
- dTBF1
- Within-Strain ANOVA: no significance
- Between-Strain ANOVA: no significance
- No significant change in expression
- Conclusion: not interesting, do not explore first
- dYHP1
- Within-Strain ANOVA: B-H significance
- Between-Strain ANOVA: no significance
- Decreased at t15, increased at t60 and t120
- Conclusion: interesting, potentially explore
- dYOX1
- Within-Strain ANOVA: B-H significance
- Between-Strain ANOVA: no significance
- Decreased at t15
- Conclusion: sort of interesting
- dPHD1
- Within-Strain ANOVA: unadjusted significance
- Between-Strain ANOVA: unadjusted significance
- Increased at t30 and t60
- Conclusion: sort of interesting
- dNRG1
- Within-Strain ANOVA: B-H significance
- Between-Strain ANOVA: unadjusted significance
- Decreased at t15 and t30
- Conclusion: interesting, potentially explore
- ARG80
"Voting" using p-value significance for the dHAP4 strain was added to the SURP 2015 schedule page. The purpose of this system is to compile data from all the strains and make an informed decision on which deletion strains to test in the future.
Step 10: YEASTRACT
Using YEASTRACT to Infer which Transcription Factors Regulate a Cluster of Genes
In the previous analysis using STEM, we found a number of gene expression profiles (aka clusters) which grouped genes based on similarity of gene expression changes over time. The implication is that these genes share the same expression pattern because they are regulated by the same (or the same set) of transcription factors. We will explore this using the YEASTRACT database.
- Open the gene list in Excel for the one of the significant profiles from your stem analysis. Choose a cluster with a clear cold shock/recovery up/down or down/up pattern. You should also choose one of the largest clusters.
- Copy the list of gene IDs onto your clipboard.
- Launch a web browser and go to the YEASTRACT database.
- On the left panel of the window, click on the link to Rank by TF.
- Paste your list of genes from your cluster into the box labeled ORFs/Genes.
- In addition, paste lists of genes with p-values less than 0.05 (and 0.01) from the within-strain ANOVA test (in the giant spreadsheet) and the between-strain ANOVA (from the MATLAB output sheet)
- Check the box for Check for all TFs.
- Accept the defaults for the Regulations Filter (Documented, DNA binding plus expression evidence)
- Do not apply a filter for "Filter Documented Regulations by environmental condition".
- Rank genes by TF using: The % of genes in the list and in YEASTRACT regulated by each TF.
- Click the Search button.
- Answer the following questions:
- In the results window that appears, the p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant". How many transcription factors are green or "significant"?
- Profile 45: 25
- Profile 9: 20
- Profile 48: 7
- Profile 22: 57
- Profile 40: 0
- Profile 16: 10
- between-strain ANOVA p<0.05: 63
- between-strain ANOVA p<0.01: 26
- within-strain ANOVA p<0.05: 103
- within-strain ANOVA p<0.01: 70
- In the results window that appears, the p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant". How many transcription factors are green or "significant"?
- List the "significant" transcription factors on your wiki page, along with the corresponding "% in user set", "% in YEASTRACT", and "p value".
- These lists can be found on this document :
- Are CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 on the list?
- Media:DHAP4 YEASTRACT profiles.xlsx
- For the mathematical model that we will build, we need to define a gene regulatory network of transcription factors that regulate other transcription factors. We can use YEASTRACT to assist us with creating the network. We want to generate a network with approximately 15-30 transcription factors in it.
- You need to select from this list of "significant" transcription factors, which ones you will use to run the model. You will use these transcription factors and add CIN5, GLN3, HAP4, HMO1, SWI4, and ZAP1 if they are not in your list. Explain in your electronic notebook how you decided on which transcription factors to include. Record the list and your justification in your electronic lab notebook.
- Go back to the YEASTRACT database and follow the link to Generate Regulation Matrix.
- Copy and paste the list of transcription factors you identified (plus CIN5, HAP4, GLN3, HMO1, SWI4, and ZAP1) into both the "Transcription factors" field and the "Target ORF/Genes" field.
- We are going to generate several regulation matrices, with different "Regulations Filter" options.
- For the first one, accept the defaults: "Documented", "DNA binding plus expression evidence"
- Click the "Generate" button.
- In the results window that appears, click on the link to the "Regulation matrix (Semicolon Separated Values (CSV) file)" that appears and save it to your Desktop. Rename this file with a meaningful name so that you can distinguish it from the other files you will generate.
- Repeat these steps to generate a second regulation matrix, this time applying the Regulations Filter "Documented", "Only DNA binding evidence".
- Repeat these steps a third time to generate a third regulation matrix, this time applying the Regulations Filter "Documented", DNA binding and expression evidence".
- Note: we will really only use "DNA binding evidence only" for a medium-connected network
- Profile chosen to continue to step 11: Profile 16 for DNA plus Binding evidence: 15 genes with 46 edges
- Genes in data matrix must be alphabetized. Sort column alphabetically, transpose into a new sheet, sort the column there, then transpose back into the original sheet and delete the "Sheet 1" in which you transposed
Visualizing Your Gene Regulatory Networks with GRNsight
We will analyze the regulatory matrix files you generated above in Microsoft Excel and visualize them using GRNsight to determine which one will be appropriate to pursue further in the modeling.
- First we need to properly format the output files from YEASTRACT. You will repeat these steps for each of the three files you generated above.
- Open the file in Excel. It will not open properly in Excel because a semicolon was used as the column delimiter instead of a comma. To fix this, Select the entire Column A. Then go to the "Data" tab and select "Text to columns". In the Wizard that appears, select "Delimited" and click "Next". In the next window, select "Semicolon", and click "Next". In the next window, leave the data format at "General", and click "Finish". This should now look like a table with the names of the transcription factors across the top and down the first column and all of the zeros and ones distributed throughout the rows and columns. This is called an "adjacency matrix." If there is a "1" in the cell, that means there is a connection between the trancription factor in that row with that column.
- Save this file in Microsoft Excel workbook format (.xlsx).
- Check to see that all of the transcription factors in the matrix are connected to at least one of the other transcription factors by making sure that there is at least one "1" in a row or column for that transcription factor. If a factor is not connected to any other factor, delete its row and column from the matrix. Make sure that you still have somewhere between 15 and 30 transcription factors in your network after this pruning.
- Only delete the transcription factor if there are all zeros in its column AND all zeros in its row. You may find visualizing the matrix in GRNsight (below) can help you find these easily.
- For this adjacency matrix to be usable in GRNmap (the modeling software) and GRNsight (the visualization software), we need to transpose the matrix. Insert a new worksheet into your Excel file and name it "network". Go back to the previous sheet and select the entire matrix and copy it. Go to you new worksheet and click on the A1 cell in the upper left. Select "Paste special" from the "Home" tab. In the window that appears, check the box for "Transpose". This will paste your data with the columns transposed to rows and vice versa. This is necessary because we want the transcription factors that are the "regulatORS" across the top and the "regulatEES" along the side.
- The labels for the genes in the columns and rows need to match. Thus, delete the "p" from each of the gene names in the columns. Adjust the case of the labels to make them all upper case.
- In cell A1, copy and paste the text "rows genes affected/cols genes controlling".
- Now we will visualize what these gene regulatory networks look like with the GRNsight software.
- Go to the GRNsight home page (you can either use the version on the home page or the beta version.
- Select the menu item File > Open and select one of the regulation matrix .xlsx file that has the "network" worksheet in it that you formatted above. If the file has been formatted properly, GRNsight should automatically create a graph of your network. Move the nodes (genes) around until you get a layout that you like and take a screenshot of the results. Paste it into your PowerPoint presentation. Repeat with the other two regulation matrix files. You will want to arrange the genes in the same order for each screenshot so that the graphs can be easily compared.
May 26, 2015
Step 11: GRNmap
Create the Input Excel Workbook for the Model
- Your file will be similar to the file "21-genes_50-edges_Dahlquist-data_Sigmoid_estimation.xls", but with your expression data and network. You should download this file, change the name, and edit it to include your data. Make sure to give it a meaningful filename that includes your last name or initials. Click this link to download the sample file from the GRNmap GitHub repository.)
- The first thing you need to do is determine the transcription factors that you are including in your network. You are going to use the "transposed" Regulation Matrix that you generated from YEASTRACT in the previous section.
- Copy the transposed matrix from your "network" sheet and paste it into the worksheets called "network" and "network_weights".
- Note that the transcription factor names have to be in the same order and same format across the top row and first column. CIN5 does not match Cin5p, so the latter will need to be changed to CIN5 if you have not already done so.
- It may be easier for you if you put the transcription factors in alphabetical order (using the sort feature in Excel), but whether you leave your list the same as it is from the YEASTRACT assignment or in alphabetical order, make sure it is the same order for all of the worksheets.
- The next worksheet to edit is the one called "degradation_rates".
- Paste your list of transcription factors from your "network" sheet into the column named "StandardName". You will need to look up the "SystematicName" of your genes. YEASTRACT has a feature that will allow you to paste your list of standard names in to retrieve the systematic names here.
- Next, you will need to look up the degradation rates for your list of transcription factors. These rates have been calculated from protein half-life data from a paper by Belle et al. (2006). Look up the rates for your transcription factors from this file and include them in your "degradation_rates" worksheet.
- If a transcription factor does not appear in the file above, use the value "0.027182242" for the degradation rate.
- The next worksheet to edit is the one called "production_rates".
- Paste the "SystematicName" and "StandardName" columns from your "degradation_rates" sheet into the "production_rates" sheet.
- The initial guesses for the production rates we are using for the model are two times the degradation rate. Compute these values from your degradation rates and paste the values into the column titled "ProductionRate".
- Next you will input the expression data for the wild type strain and one other strain (dHAP4). You need to include only the data for the genes in your network (15 genes), in the same order as they appear in the other worksheets.
- Put the wild type data in the sheet called "wt".
- The sample spreadsheet has a worksheet named "dcin5". Change this name to match the strain you are using (dHAP4). The instructions below should be followed for each strain sheet.
- Paste the SystematicName and StandardName columns from one of your previous sheets into this one.
- This data in this sheet is the Log Fold Changes for each replicate and each timepoint from the "Rounded_Normalized_Data" worksheet from the big Excel workbook in which you computed the statistics. We are only going to use the cold shock timepoints for the modeling. Thus your column headings for the data should be "15", "30", and "60". There will be multiple columns for each timepoint (typically 4) to represent the replicate data, but they will all have the same name. For example, you may have four columns with the header "15".
- Copy and paste the data from your spreadsheet into this one. You need to include only the data for the genes in your network. Make sure that the genes are in the same order as in the other sheets.
- Delete unnecessary sheets (dhmo1, dzap1, etc)
- The "optimization_parameters" worksheet should have the following values:
- alpha should be 0.01
- kk_max should be 1
- MaxIter should be 1e08 (one hundred million in plain English)
- TolFun should be 1e-6
- MaxFunEval should be 1e08 (one hundred million in plain English)
- TolX should be 1e-6
- Sigmoid should be 1
- estimateParams should be 1
- makeGraphs should be 1
- fix_P should be 0
- fix_b should be 1
- For the parameter "time" (Cell A13), we should have "15", "30", and "60", since these are the timepoints we have in our data.
- For the parameter "Strain" (Cell A14), replace "dcin5" with the name of the second strain you are using, making sure that the capitalizaiton and spelling is the same as what you named the worksheet containing that strain's expression data. We are only going to compare two strains, so you can delete the other strain information.
- For the parameter "Sheet" (Cell A15), give the number of the worksheet from left to right that your "Strain" log2 expression data is in. Wt is 3 and dhap4 is 4. Delete any extra numbers because we are only comparing two strains.
- For the parameter "Deletion", leave the zero in cell B15 (corresponding to wt). In cell C15, put a number corresponding to the position in the list of gene names that the gene that was deleted appears. (hap4 is number 5)
- For the parameter, "simtime", you perform the forward simulation of the expression in five minute increments from 0 to 60 minutes. Thus, this row should read: simtime should be 0, 5, <...fill by steps of 5...>, 60, each number in a different cell.
- The last sheet you will need to modify is called "network_b".
- Paste in the list of standard names for your transcription factors from one of your previous sheets. Note that this sheet does not have a column for the Systematic Name.
- Cell A1 in the sample files has the text "rows genes affected/cols genes controlling". I believe you can either have this text in cell A1 or "StandardName".
- The "threshold" value for each gene should be "0".
- When you have completed the modifications to your file, upload it to LionShare and send Dr. Dahlquist an e-mail with a link to the file.
Appendix: Full explanation of the "optimization_parameters" sheet
alpha
: Penalty term weighting (from an L-curve analysis)kk_max
: Number of times to re-run the optimization loop: in some cases re-starting the optimization loop can improve performance of the estimation.MaxIter
: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.TolFun
: How different two least squares evaluations should be before it says it's not making any improvementMaxFunEval
: maximum number of times it will evaluate the least squares costTolX
: How close successive least squares cost evaluations should be before MATLAB determines that it is not making any improvement.Sigmoid
:=1
if sigmoidal model,=0
if Michaelis-Menten modelestimateParams
:=1
if want to estimate parameters and=0
if the user wants to do just one forward runmakeGraphs
:=1
to output graphs;=0
to not output graphsfix_P
:=1
if the user does not want to estimate the production rate, P, parameter, use initial guess and never change;=0
to estimatefix_b
:=1
if the user does not want to estimate the b parameter, use initial guess and never change;=0
to estimatetime
: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the expression sheets.Strain
: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the names of the sheets for each strain.Sheet
: A row where each entry is the order number of the sheet (left to right) that corresponds to the list of strains above.Deletion
: Gives the index of the gene in the network sheet that has been deleted in each strain listed above. For example, if data has been provided for the CIN5 deletion strain, then give the index number from the network sheet corresponding to CIN5.simtime
: A list of times for which the forward simulation should be evaluated.
Running GRNmap
You will now finally run the GRNmap model on the input workbook you created above. You will run the optimization twice; once where the threshold parameters, b, are not estimated and once where the threshold parameters 'are estimated. You will compare the estimated weight and production rate parameters outputted by these two runs with each other.
- Download the current version of GRNmap from GitHub. Version 1.0.6 can be downloaded by following this link.
- For the sake of organization, save it into a new folder called "GRNmap" either on your Desktop or within your "Microarray Analysis" folder.
- Unzip the file by right-clicking on it and choosing 7-zip > Extract here.
- Open the "GRNmap-1.0.6" folder and open the "matlab" subfolder. Double-click on the file "GRNmodel.m" to open GRNmap in MATLAB 2014b.
- Click on the green triangle "Run" button to run the model.
- You will be prompted by an Open dialog to find your input file that you created in the previous section. Browse and select this input file and click OK.
- Note that the Open dialog will default to show files of
*.xlsx
only. If your file is saved as*.xls
, you will need to select the drop-down menu to show all files. - A window called "Figure 1" will appear. The counter is showing the number of iterations of the least squares optimization algorithm. The top plot is showing the values of all the parameters being estimated. You should see some movement of the diamonds each time the counter iterates.
- Once the model has completed its run, plots showing the expression over time for all of the genes in the network will appear. Since we selected "makeGraphs = 1" these will automatically be saved as
*.jpg
files in the same folder as your input file. Compile the figures into a single PowerPoint file. Please label things clearly, placing an appropriate number of graphs on each page for a readable visual. Take some care to make sure that the graphs are the same size and the aspect ratio has not been changed.- Note: GRNmap run, but jpeg files not saved automatically. Checked file, save graphs setting was "1" as desired. Run repeated.
- Same problem encountered on the same computer
- Dr. Dahlquist ran same input file on her computer, and images were saved correctly.
- Files obtained on another computer.
- Problem: GRNmap files cannot be saved in a folder named with space characters! Images were able to be saved on another computer because the folder was named without space characters
- Note: GRNmap run, but jpeg files not saved automatically. Checked file, save graphs setting was "1" as desired. Run repeated.
- Create a new workbook for analyzing the weight data. In this workbook, create a new sheet: call it estimated_weights. In this new worksheet, create a column of labels of the form ControllerGeneA -> TargetGeneB, replacing these generic names with the standard gene names for each regulatory pair in your network. Remember that columns represent Controllers and rows represent Targets in your network and network_weights sheets.
- Extract the non-zero optimized weights from their worksheet and put them in a single column next to the corresponding ControllerGeneA -> TargetGeneB label.
- There should be 46 items (corresponding to 46 edges)
- Now we will run the model a second time, this time estimating the threshold parameters, b. Save the input workbook that you previously created as a new file with a meaningful name (e.g. append "estimate-b" to the previous filename), and change fix_b to 0 in the "optimization_parameters" worksheet, so that the thresholds will be estimated. Rerun GRNmodel with the new input sheet.
- Repeat Parts (4) through (6) with the new output.
- Create an empty excel workbook, and copy both sets of weights into a worksheet.
- Create a bar chart in order to compare the "fixed b" and "estimated b" weights.
- Create bar charts to compare the production rates from each run.
- Copy the two bar charts into your powerpoint.
- Visualize the output of each of your model runs with GRNsight.
- In order for this to work, you need to alter your output workbook slightly. You need to change the name of the sheet called "out_network_optimized_weights" to "network_optimized_weights"; i.e., delete the "out_" from that sheet name.
- Arrange the genes in the same order you used to display them previously when you visualized the networks from YEASTRACT for both of your model output runs. Take a screenshot of each of the results and paste it into your PowerPoint presentation. Clearly label which screenshot belongs to which run.
- Note that GRNsight will display differently now that you have estimated the weights. For positive weights > 0, the edge will be given a regular (pointy) arrowhead to indicate an activation relationship between the two nodes. For negative weights < 0, the edge will be given a blunt arrowhead (a line segment perpendicular to the edge direction) to indicate a repression relationship between the two nodes. The thickness of the edge will vary based on the magnitude of the absolute value of the weight. Larger magnitudes will have thicker edges and smaller magnitudes will have thinner edges. The way that GRNsight determines the edge thickness is as follows. GRNsight divides all weight values by the absolute value of the maximum weight in the matrix to normalize all the values to between zero and 1. GRNsight then adjusts the thickness of the lines to vary continuously from the minimum thickness (for normalized weights near zero) to maximum thickness (normalized weights of 1). The color of the edge also imparts information about the regulatory relationship. Edges with positive normalized weight values from 0.05 to 1 are colored magenta; edges with negative normalized weight values from -0.05 to -1 are colored cyan. Edges with normalized weight values between -0.05 and 0.05 are colored grey to emphasize that their normalized magnitude is near zero and that they have a weak influence on the target gene.
- Upload your PowerPoint, your two input workbooks, and your two output workbooks and link to them in your individual journal. Also upload the workbook where you made the bar charts comparing the weights from both runs.
- Powerpoint containing images
- Fixed b input sheet
- Fixed b output sheet
- Estimated b input sheet
- Estimated b output sheet
- Workbook comparing weight outputs
- Interpret the results of the model simulation.
- Examine the graphs that were output by each of the runs. Which genes in the model have the closest fit between the model data and actual data? Which genes have the worst fit between the model and actual data? Why do you think that is? (Hint: how many inputs do these genes have?) How does this help you to interpret the microarray data?
- Best fits: CIN5, CSE2
- Poor fits: CBF1, HSF1, HMO1
- Reasons for fit accuracy seem inconclusive. It might be an issue of precision of microarray data points, but then again that could just be viewer bias. In terms of number of inputs, the "good" and "poor" fits seem to have a relatively even distribution of number inputs, so this is inconclusive as well.
- Which genes showed the largest dynamics over the timecourse? In other words, which genes had a log fold change that is different than zero at one or more timepoints. The p values from the Week 11 ANOVA analysis are informative here. Does this seem to have an effect on the goodness of fit (see question above)?
- CIN5, significant BH p-value, good fit
- HMO1, not a significant BH p-value, not a great fit (data is fairly spread out)
- HSF1, significant BH p-value, fairly good fit
- YHP1, significant BH p-value, decent fit, but very dynamic (i.e. more data points needed to tell)
- YOX1, significant BH p-value, fairly good fit
- ZAP1, not a significant BH p-value, fairly good fit (data is fairly spread out)
- Which genes showed differences in dynamics between the wild type and the dHAP4 deletion strain? Does the model adequately capture these differences? Given the connections in your network (see the visualization in GRNsight), does this make sense? Why or why not?
- CIN5, over-expressed in deletion compared to wt. CIN5 is indirectly connected to HAP4, so it is difficult to see the reason (though possibly through decreased repression from YOX1)
- HAP4, zero expression as expected
- HSF1, over-expressed in deletion compared to wt at the later time points. Indirectly connected to HAP4.
- MGA2, under-expressed in deletion compared to wt. MGA2 is directly activated by HAP4, so it makes sense that deleting HAP4 would cause decreased expression in MGA2.
- NDT80, under-expressed in deletion compared to wt. NDT80 is directly activated by HAP4, so it makes sense that deleting HAP4 would cause decreased expression in NDT80.
- SWI4, slightly over-expressed in deletion compared to wt. SWI4 is directly repressed by HAP4, so deleting HAP4 would caused increased expression in SWI4.
- YHP1, slightly over-expressed in deletion compared to wt. YHP1 is directly activated by HAP4, but there are other influences that could have caused the increased expression. The model fit might also be slightly skewed.
- YOX1, under-expressed in deletion compared to wt. YOX1 is directly activated by HAP4, so it makes sense that deleting HAP4 would caused decreased expression of YOX1.
- ZAP1, under-expressed in deletion compared to wt at the later time points. ZAP1 is indirectly connected to HAP4, so relations are inconclusive.
- Examine the bar charts comparing the weights and production rates between the two runs. Were there any major differences between the two runs? Why do you think that was? Given the connections in your network (see the visualization in GRNsight), does this make sense? Why or why not?
- The one noticeable trend is that the fixed b tends to give larger magnitude weights on the repression side. This could be because the strongest weights in the network are repressive.
- Finally, based on the results of your entire project, which transcription factors are most likely to regulate the cold shock response and why?
- We are on the right track with the deletion strains (HMO1, CIN5, HAP4, SWI4). In addition, HSF1, YOX1, and YHP1 seem to be good candidates.
- Examine the graphs that were output by each of the runs. Which genes in the model have the closest fit between the model data and actual data? Which genes have the worst fit between the model and actual data? Why do you think that is? (Hint: how many inputs do these genes have?) How does this help you to interpret the microarray data?
- Based on these results, what future directions do you want to take?
- Investigate HSF1, YOX1, and YHP1 further to evaluate their role in the cold shock response
May 27, 2015
Beginning GRNmap code testing
- The purpose of today's experiments is to test the functionality of the code when run with several permutations of strain inputs (i.e. all strains at once, each strain individually, wt vs. deletion strain, three strains, etc.). The lab report can be found here: GRNmap Testing Report: Strain Run Comparisons 2015-05-27
May 28, 2015
Continuing GRNmap code testing
- Today, the experiments from yesterday (detailed in GRNmap Testing Report: Strain Run Comparisons 2015-05-27) will be completed.
In addition, we attended a workshop on bibliographies and databases, including information on Web of Science, Google Scholar, and LMU Lib Guides. I searched Web of Science for articles that cited the Vu and Vohradsky article (the original math model paper from 2007), and found an interesting paper that Tessa, Trixie and I will cover in journal club in a couple weeks: Yip et al. 2010. It seems to use the same or similar math models as our research.
June 1, 2015
Continuing GRNmap code testing
- Begun testing runs where initial weight guesses are 0, 3 or randomly distributed between -1 and 1 instead of 1: GRNmap Testing Report: Non-1 Initial Weight Guesses 2015-05-28
- Computed "ideal" LSE for wt alone and all strains
- In the GRNmap code, the LSE is computed as such: sum(data-model)^2 / total number of data points
- the sum is over flasks (replicates), times, and genes
- LSE would be minimized when the output value from the model is the average of the microarray data points (from each flask) at a certain time point. This value can be manually computed in Excel.
- Copy and paste logFC data for each strain you desire (wt, dCIN5, dGLN3, dHMO1, and/or dZAP1) from the input MATLAB sheet into an new workbook. Carry over the first Standard Name column.
- Below the t15 columns, create a column header "Var(t15)" and compute the variance for each gene. The variance gives us sum(data-avg)^2 / n-1 , for n replicates
- We want the sum of squares. Create a new column next to "Var(t15)" and call it "sum sq". Multiply the neighboring variance by n-1 to obtain the sum sq value for each gene.
- Below the sum sq column, sum the values over all 21 genes. This represents the sum over the time point for each gene.
- Repeat these steps for each time point on all the strains you desire.
- To obtain the overall LSE, add up the sums at each time point for each strain, and divide by the total number of data points.
- Ideal LSE for wt alone: 0.4875
- Ideal LSE for all strains: 0.5520
- Sheet containing LSE calculations: Media:Ideal LSE.xlsx
- In the GRNmap code, the LSE is computed as such: sum(data-model)^2 / total number of data points
June 2, 2015
Continuing GRNmap code testing
- Continued testing runs where initial weight guesses are -1, -3, 10 or randomly distributed between -1 and 1 or between -3 and 3. Notes can be found on this page: GRNmap Testing Report: Non-1 Initial Weight Guesses 2015-05-28
June 3, 2015
Continuing GRNmap code testing
- Formatting/updating the notes pages GRNmap Testing Report: Non-1 Initial Weight Guesses 2015-05-28 and GRNmap Testing Report: Strain Run Comparisons 2015-05-27
- Making a powerpoint presentation for general meeting this afternoon
June 8, 2015
Continuing GRNmap code testing
- Attempting to interpret results from GRNmap Testing Report: Non-1 Initial Weight Guesses 2015-05-28 and GRNmap Testing Report: Strain Run Comparisons 2015-05-27
June 9, 2015
- Analysis of GRNmap Testing Report: Strain Run Comparisons 2015-05-27
- Notes can be found here: Media:No-input genes investigation.docx
June 10, 2015
- Continuing analysis of GRNmap Testing Report: Strain Run Comparisons 2015-05-27
- Looking at within-strain ANOVA p-values from each deletion strain for genes that have no inputs or are autoregulatory
- Note: Poor fits (like SKN7, whose model does not fit the t60 data points) are, more likely than not, indicative of an incorrectly wired network.
- Presented at journal club on the article Yip et al. 2010. Powerpoint presentation is here: Media:Journal Club Yip et al. (2010).pptx
June 15, 2015
- Today I will tackle two issues on GitHub:
- Issue #111: Testing the new beta version to see that estimated b's for genes with no inputs are indeed 0
- Model run with wt only. Alpha set to the new value of 0.001, which is the better value after the corrected LSE scaling by Dr. Fitzpatrick.
- Note input file changes for v1.0.10 - 'wt' sheet should be changed to 'wt_log2_expression' and 'network_b' should be changed to 'threshold_b'. In the 'optimized_parameters' sheet, 'iestimate' should be changed to 'estimatedParams' and 'igraphs should be changed to 'makeGraphs'
- Initial weights 1 run:
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone alpha0.001 output.zip
- Zipped images: Media:B estimation no input genes w1 figures.zip
- LSE: 0.507517
- min LSE: 0.487463
- Penalty term: 3.178094
- Counter: 421949 - 2.5 hours
- Initial weights 0 run:
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model wt alone w0 alpha0.001 output.zip
- Zipped images: Media:B estimation no input genes w0 figures.zip
- LSE: 0.507516
- min LSE: 0.487463
- Penalty term: 3.178966
- Counter: 383031 - 2.25 hours
- Initial weights 1 run:
- Conclusion: Estimated b's for FHL1, SKO1, and SWI6 (genes with no inputs) are 0. New version of the code has fixed the bug that gave genes with no inputs non-zero estimated b's. Issue can be closed on GitHub.
- Issue #96: Comparing the ideal LSE (calculated with Excel and sum of squares) with the output SSE in the new diagnostic worksheet of the beta code to check if the code is calculating correctly
- Sheet containing manual calculations of ideal LSE (sum of squares) for wt alone run: Media:Wt alone idealLSE.xlsx
- Ideal LSE for wt alone (sum of squares over all time points and genes divided by the total number of data points): 0.487463
- This exact number is calculated by the beta code output (see section above)
- Conclusion: New version of the code correctly calculates the min LSE
- Note: The old version of the code incorrectly calculated the LSE (it did not divide by the total number of data points). Hopefully the SSE and the LSE will at least be on the same order of magnitude now.
- For the weight 1 run above, the LSE and minLSE are quite close. When looking at the ouput graphs, this makes sense as the trendlines seem to have a bit more flexibility and are fitting the data well. However, the run did take around 2.5 hours. This could be connected to the new alpha value (0.001) and subsequent flexibility.
- Sheet containing manual calculations of ideal LSE (sum of squares) for wt alone run: Media:Wt alone idealLSE.xlsx
- Issue #111: Testing the new beta version to see that estimated b's for genes with no inputs are indeed 0
- With a smaller alpha value of 0.001, the model can do more crazy things to try to fit the data. While the LSE is small and the fits (for the most part) are pretty good, other estimated parameters may have some crazy values. Estimated weights, production rates and b's vary widely between this run of wt alone and the run from the previous version of the code (see these bar graphs: Media:Code version wt alone estimated params comparison.xlsx)
- GRNSight Network, v.2015 (old version where alpha is 0.01). Largest value is 1.44
- GRNSight Network, v1.0.10 (new version where alpha is 0.001). Largest value is 5.73
June 16, 2015
- Reviewing Natalie's work and making a presentation for meeting tomorrow: Media:GRNmap_Testing_June17.pptx
- Re-running the single gene estimations using the latest beta release (v1.0.10). Alpha is set to 0.001 (due to new LSE scaling) and initial weight guesses are set to 0.
- dCIN5, alpha 0.001
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model dCIN5 w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model dCIN5 w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model dCIN5 w0 alpha0.001 output.zip
- Zipped images: Media:GJ dCIN5 w0 alpha0.001 figures.zip
- LSE: 0.5583
- min LSE: 0.501094
- Penalty term: 10.07612
- Counter: 80872 - 30 minutes
- dGLN3, alpha 0.001
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model dGLN3 w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model dGLN3 w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model dGLN3 w0 alpha0.001 output.zip
- Zipped images: Media:GJ dGLN3 w0 alpha0.001 figures.zip
- LSE: 0.750947
- min LSE: 0.608963
- Penalty term: 6.564411
- Counter: 103853 - 1 hour
- dHMO1, alpha 0.001
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model dHMO1 w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model dHMO1 w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model dHMO1 w0 alpha0.001 output.zip
- Zipped images: Media:GJ dHMO1 w0 alpha0.001 figures.zip
- LSE: 0.583598
- min LSE: 0.532734
- Penalty term: 4.126982
- Counter: 59660 - 1.5 hours
- dZAP1, alpha 0.001
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model dZAP1 w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model dZAP1 w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model dZAP1 w0 alpha0.001 output.zip
- Zipped images: Media:GJ dZAP1 w0 alpha0.001 figures.zip
- LSE: 0.758833
- min LSE: 0.635062
- Penalty term: 7.084318
- Counter: 54870 - 1 hour
- dCIN5, alpha 0.001
June 17, 2015
- Powerpoint containing side by side output graphs with corresponding B&H p-values for wt and dCIN5 runs: Media:Wt dCIN5 outputs pvals.pptx
June 22, 2015
- Today I am working on analyzing each gene in the network for its performance and validity of its connections.
- Genes are categorized based on connectivity (Number of inputs, self regulation). Each gene is then analyzed based on the following attributes from wt data and runs:
- Fit of model to data (Currently, this is based on visual cues. With runs on updated model, relative SSE's will give us a better clue to how the model fits the data)
- Dynamics of the gene (indicated by the B&H p-value from within strain ANOVA)
- Dynamics of the gene's regulators (indicated by the B&H p-value from within strain ANOVA)
- These will give an idea as to whether the gene is wired correctly.
- Presentation containing summary work so far: Media:SURP 2015 Presentation draft.pptx
- Genes are categorized based on connectivity (Number of inputs, self regulation). Each gene is then analyzed based on the following attributes from wt data and runs:
- Re-run of all-strains estimation using new beta code (v1.0.10)
- all-strains, alpha 0.001
- Input file: Media:GJ Input 21 Gene Network Sigmoid Model all strains w0 alpha0.001.xlsx
- Output file: Media:GJ Input 21 Gene Network Sigmoid Model all strains w0 alpha0.001 output.xlsx
- Output mat file: Media:GJ Input 21 Gene Network Sigmoid Model all strains w0 alpha0.001 output.zip
- Zipped images: Media:All strains w0 alpha 0.001.zip
- LSE: 0.73119
- min LSE: 0.551988
- Penalty term: 6.348851
- Counter: 41531 - 1 hour
- all-strains, alpha 0.001
June 23, 2015
- Today I am continuing to work on analyzing each gene (as per the outline above). Summary containing in-depth analysis of each gene can be found here: Media:Individual gene full analysis.pptx
- I am also working on a short joint presentation with other GRNmap and GRNsight members to give to President Snyder when he visits our lab meeting tomorrow. Our part of the presentation can be found here: Media:GRNmap GRNsight Short.pptx
June 24, 2013
- General lab meeting including concluding remarks from all groups (wet lab, data analysis and modeling, network visualization)
- My summary powerpoint given today can be found here: Media:SURP 2015 Final Presentation.pptx. The lab group's presentation can be found on the SURP 2015 page.
- Edited version of the short presentation given to President Snyder can be found here: Media:GRNmap GRNsight Short.pptx
- Our (Natalie and Grace) SURP 2015 abstract can be found here: https://github.com/kdahlquist/GRNmap/wiki/SURP-2015-Abstract-Grace-Johnson-and-Natalie-Williams
Fall 2015
September 9, 2015
- Document with notes on and links to data sets for transcription and degradation rates of yeast mRNA: Media:MRNARatesBibliography.docx
- Plan for next week: running new normalization procedure on the deletion data
- New workflow will be on Dahlquist page of open wet ware. R scripts will be available on this page. Data will be on lionshare. Completed Excel sheets should be shared back on lionshare.
- Natalie and I will run the normalization on wt, dHMO1, and Spar strains beginning with the Within-Strain ANOVA
- Split statistical analysis Excel sheets by strain so the files don't get too big
September 16, 2015
- Began work on updated Microarray data analysis workflow: Dahlquist:Microarray Data Analysis Workflow
- Completed steps 4-5, running new normalization script in R using data from lionshare (links in the workflow page.
- Began step 6, Statistical Analysis, by creating the Excel sheet. Workflow was followed up until heading "Within-Strain ANOVA." Excel sheet was saved on Google Drive
- There were 41,998 deletions of #VALUE! in the Excel sheet.
September 23, 2015
- Check everyone's master sheets to ensure there is zero difference after steps 1-2 and the first part of step 6 (Statistical analysis)
- Complete Within-Strain ANOVA section of Statistical analysis section (step 6) for the strains wt, Spar, and dHMO1
- File named 150916_GJ_ANOVA_Compiled_Normalized_Data.xlsx uploaded to lionshare and Google Drive
- Sanity Check: Number of genes significantly changed
- wt Sanity Check
P-Value Criteria Number of Genes % out of 6189 p<0.05 2600 42.0 p>0.01 1727 27.9 p>0.001 1015 16.4 p<0.0001 574 9.27 p>0.05 for Bonferroni corrected 302 4.88 p>0.05 for Benjamini and Hochberg-corrected 1936 31.28
- Spar Sanity Check
P-Value Criteria Number of Genes % out of 6189 p<0.05 2513 40.6 p>0.01 1637 26.45 p>0.001 852 13.77 p<0.0001 433 7.00 p>0.05 for Bonferroni corrected 232 3.75 p>0.05 for Benjamini and Hochberg-corrected 1791 28.9
- dHMO1 Sanity Check
P-Value Criteria Number of Genes % out of 6189 p<0.05 1019 16.46 p>0.01 432 6.98 p>0.001 119 1.92 p<0.0001 46 0.74 p>0.05 for Bonferroni corrected 25 0.404 p>0.05 for Benjamini and Hochberg-corrected 114 1.84
- Results of ANOVA were compared for each strain in Google doc shared between Kayla, Kirsten, Natalie, Tessa, and me.
October 7, 2015
- Note: LMU has transitioned away from lionshare. Links do not exist but lionshare can be access by the link lionshare.lmu.edu.
- Plan for abstract: each person generate 5 hypothesis networks. We will have 25 networks to compare to eachother.
- List Within Strain B&H significant genes - put into YEASTRACT (note data and version of database). We won't work with clustering at this point. Networks built simply on which TF's regulate genes that have a B&H p-value
- Different networks - how large of a network do we want? Medium connectivity. Fully document every choice to make this network.
- Don't worry about T test and Between stain ANOVA at this point. We'll just focus on network generation.
- Tessa: GLN3 and CIN5
- Natalie: HMO1 and Spar
- Grace: HAP4 and wt
- Kristen: ZAP1 and SWI4
- I will do Within-strain ANOVA for HAP4
- Completed and uploaded to Google Drive (Compiled Normalized Data doc) on October 7.
October 14, 2015
- Strains for network families (using "DNA binding only" in YEASTRACT)
- wt: Natalie
- dHAP4: Grace
- dCIN5: Kayla
- dGLN3: Tessa
- dZAP1: Kristen
- Pare down networks:
- First, eliminate genes that are unconnected
- Then, systematically pare down by p-value, one by one. For each elimination, check and get rid of unconnected genes. We want a gene list of less than 35 transcription factors. We want 15 - 35 range for our family of networks.
- For now, we will focus on wild type and deletion strains (not Spar)
- Two types of pare downs:
- First, just the genes from YEASTRACT
- Later, potentially adding in our deletion strain genes. Careful with elimination of TF's - deletion strain genes must stay in.
October 21, 2015
- Working on dHAP4 network family.
- Subamily with no addition of deletion strains
- 97 significant genes
- First run produces adjacency matrix with no unconnected genes.
- 96-62 genes: no unconnected genes
- 61 genes: Eliminate OPI1 and THI3 (unconnected)
- 59-55 genes: no unconnected genes
- 54 genes: Eliminate SPT10 (unconnected)
- 52-36 genes: no unconnected genes
- 35 genes: Eliminate BAS1 and RLM1 (unconnected)
- 33 genes: Visualize, see MIG1 and SPT20 are connected only to eachother. Eliminate.
- 30 gene network
- 29 genes: Eliminate MGA2, CST6, and PHO2
- 26 gene network
- 25 gene network
- 24 gene network
- 23 gene network
- 22 gene network
- 21 gene network
- 20 gene network
- 19 gene network
- 19 genes: Eliminate ACE2
- 17 gene network
- 16 gene network
- 15 gene network
- Total of 13 networks in this subfamily
- Subamily with addition of deletion strains
- CIN5, HAP4, SWI4, and ZAP1 were already on YEASTRACT's list of significant regulators. GLN3 and HMO1 were added, and all 6 deletion strain genes maintained throughout the pare-down.
- 99 "significant" genes
- 99-66 genes: no unconnected genes
- 65 genes: When INO4 is deleted, this eliminates OPI1 and THI3 (unconnected)
- 52-59 genes: no unconnected genes
- 58 genes: When STE12 is deleted, this eliminates SPT10 (unconnected)
- 56-41 genes: no unconnected genes
- 40 genes: When SPT23 is deleted, this eliminates BAS1 (unconnected)
- 39 genes: no unconnected genes
- 38 genes: when ABF1 is deleted, this eliminates CST6 and PHO2 (unconnected)
- 34 gene network
- 33 gene network
- 32 gene network
- 31 gene network
- 30 gene network
- 29 gene network
- 28 gene network
- 27 gene network Note: ASH1 must be kept to keep ACE2, ZAP1's only connection
- 26-15 gene network
- Total of 19 networks in this subfamily
October 28, 2015
- Number of networks in the subfamilies for all the strains
- dHAP4 (Grace)
- Regardless of deletion strains: 13
- Preserving deletion strains:19
- wt (Natalie)
- Regardless of deletion strains: 14
- Preserving deletion strains:19
- dGLN3 (Tessa)
- Regardless of deletion strains: 16
- Preserving deletion strains: *
- dZAP1 (Kristen)
- Regardless of deletion strains: 16
- Preserving deletion strains: 17
- dCIN5 (Kayla)
- Regardless of deletion strains:**
- Preserving deletion strains: 2*
- dHAP4 (Grace)
- Abstract for April Conference
- Ground in findings from summer - use abstract from the end of SURP. Introduce bio problem, introduce modeling, discuss results from summer. Introduce future work (what we're doing right now)
- Put on GitHub Wiki
- Looking at papers for mRNA degradation rates
- Wang (Grace)
- Info on the 200 transcription factors. Do all the TF's have corresponding data? Use Microsoft Access
- Turn half life into degradation rate -- look from Belle Spreadsheet (Dr. Dahlquist will provide)
- Belle_PNAS_06_SuppDataSet.xls: use this file to see how protein degradation rates were computed. — Kam D. Dahlquist 19:56, 28 October 2015 (EDT)
- Appears (by inspection) that most of the data is present in the Wang data set. Cannot format for conversion to degradation without the use of Microsoft Access.
- Maybe compare with protein degradation rates.
- Shalem (Natalie)
- Wang (Grace)
November 4, 2015
- Input sheets
- Use truncated normalized data
- Most protocol should be up for everything except degradation and production rates sheets
- Using Microsoft Access:
- Create an Excel spreadsheet with tabs containing expression for each strain (i.e. have one tab named "wt_expression", etc.).
- Use rounded normalized data for each strain
- Use only data from the 15, 30, and 60 timepoints --> use the LFC and not the average
- Find either a space or the #VALUE!, and replace with nothing (do not type anything in this space)
- Open Access, select External Data tab, then import Excel spreadsheet
- One tab is already available upon opening the program. Importing the sheet of interest creates a new tab instead of renaming the original tab (named Table1).
- To create a new table, go to the Create tab and select the Table button
- Select the file you wish to upload. Then select the tab that you want to build a database for.
- Choose my own primary key and select ID --> Systematic name of TFs; the systematic name is the primary key because it is unique to a specific transcription factor
- To rename a table, right click on the tab. Select Design View
- Name the new sheet whatever network you are creating it for (i.e. dHAP_35_network)
- Populate ID colony with genes from desired network (i.e. CIN5, GLN3). You can copy and paste these values.
- Select Query Design and select the tables of interest (Expression table and Network table)
- Both windows will show up
- Select ID from Network and Click, Drag, and Drop onto the ID of Expression Table
- Right click on connection formed and select Join Properties
- Make selection of Include ALL records from 'network'
- Drag Network down to the first Field and drag all records of Expression into the fields right of Network record (first field)
- Select make table and then name your table.
- You are now ready. Hit RUN.
- One tab is already available upon opening the program. Importing the sheet of interest creates a new tab instead of renaming the original tab (named Table1).
- Create an Excel spreadsheet with tabs containing expression for each strain (i.e. have one tab named "wt_expression", etc.).
- Wang mRNA degradation rate extraction from halflives
- Wang data has two trials of halflives. Microsoft access was used to extract the two half lives for each of the 203 transcription factors in the Harbison list (see Excel sheet provided by Dr. Dahlquist above).
- Nonzero data found for only 193 of the 203 transcription factors.
- Range of average: 6-80 minutes. Median of average: 16 minutes. Most standard deviations are under 10.
- Half lives for deletion strain genes:
- CIN5: 77, one data point
- GLN3: 17, st dev = 1.41
- HAP4: no data
- SWI4: 16, one data point
- HMO1: no data
- ZAP1: 23.5, st dev = 2.12
- Degradation rate calculated from half life by the equation =(LN(0.5))/"cell containing half life"
- Data sheet saved as "Wang_halflife_TF" and uploaded to Google Drive. Also available here: Media:Wang halflife TF.xlsx
- Begin generation of input sheets for network family testing
- Microsoft Access used to extract logFC data for t15, 30, and 60 for each strain (wt, dCIN5, dGLN3, dHAP4, dHMO1, dSWI4, and dZAP1) for:
- The largest network for the dHAP4, deletion strains added family (34 TF's)
- The largest network for the dHAP4, no deletion strains added family (30 TF's)
- Files uploaded to Google Drive.
- Microsoft Access used to extract logFC data for t15, 30, and 60 for each strain (wt, dCIN5, dGLN3, dHAP4, dHMO1, dSWI4, and dZAP1) for:
November 18, 2015
- Beginning to format input sheet for GRNmap for dHAP4 network family.
- Subfamily of dHAP4 family where deletion strains were added:
- STRAIN_log2_expression sheets created for the following strains:
- wt
- dCIN5
- dGLN3
- dHMO1
- dZAP1
- dHAP4
- For now, production and degradation rates will be taken from Belle (protein)
- Optimization parameters set according to standard used over the summer
- Network_weights contains network with initial weight guesses of 0 (from summer).
- STRAIN_log2_expression sheets created for the following strains:
December 1, 2015
- Input sheet (34_deletion_added) run on beta branch of GRNmap code.
- Bug found: Matlab calls cells with no expression data (blank cells are present because of unreliable data. This is a direct result of the new statistical analysis protocol for Fall 2015).
- New plan: run at least one sheet. Edit one input sheet to delete genes that have missing data. Start from largest network.
- Sheets formatted according to GitHub wiki instructions. (Instructions are for beta branch).
December 9, 2015
- Input sheet from last week successfully formatted to run on GRNmap v1.2. However, once graphs were created, the images were not saved and an output file was not created. This is due to an error in the code - the code can only read 5 expression sheets (because of output coloring), but my data file had 6.
- To end the semester, all files were saved on a CD with a directory and given to Dr. Dahlquist.
Spring 2016
January 16, 2016
- GRNmap beta code currently has three main bugs
- Mean square error is being compounded for each timepoint
- GRNmap cannot handle empty cell, which were present from elimination of unreliable
- Max number of strains is 5 in the code - only five color are available
- When reporting findings, be sure so specify:
- Branch of code used
- Date and time code was downloaded
- Post files on open wet ware with link to download. Name files appropriately. When reporting a bug, tag bug, functionality, no higher than 0.5
- Only tag data analysis if it is a purely data analysis problem (i.e. not if code bug problem)
- Tasks:
- Alphabetize all family sheets. This will make them more readable.
- easy for list sheets, just sort
- For network sheets (adjacency matrix), select entire matrix, custom sort for first column. Copy and paste special transposed (into new sheet). Then sort again for first column. Finally, transpose back.
- Populate with production rates from Belle data. (Dahlquist does not have time to look at mRNA data). Link to data is on GitHub Wiki.
- Starting with largest model, replace missing data with the average. Highlight cells that were populated with average. Don't leave equation in cell - paste over with past special, values.
- In largest network, have all strains. But eventually we will need to delete one of the strains (b/c we have 6 but GRNmap can only handle 5)
- Alphabetize all family sheets. This will make them more readable.
- All tasks completed for dHAP4 family according to GitHub issue #166.
- Family formatted for current beta code (using sheet called "How to format the input sheet for GRNmap (beta)")
- I edited the deletion-added subfamily (20 networks), while giving instructions to Maggie O'Neil about formatting the no-deletion subfamily (12 networks)
January 22, 2016
- Test run on GRNmap beta code using 15 gene deletion added network, which can be downloaded here: Media:GJ dHAP4 15 gene network deletion added input.xlsx
- Beta code is not ready yet - L-cure analysis run several times. Coders will fix this by next week
- Test run successful, meaning input sheets have been formatted correctly
- Maggie's file (no deletion added family) also runs successfully
- Purpose of network families: determine whether larger or smaller networks are modeled better
January 29, 2016
- Tasks for this week include:
- Editing abstract submitted to ASBMB to fit the 250 word list. Abstract in progress can be found here: Johnson and O'Neil LMU Symposium 2016
- Reformatting input sheets
- Running L-curve analysis on larges and smallest networks for both sub families of dHAP4 network family (4 networks total - each network GRNmap run will output 16 files for 16 alpha value iterations)
- Make 4 plots of LSE vs. penalty, plotting each of 16 alpha values
February 1, 2016
- Largest (34) and smallest (15) networks for the deletion added subfamily are being run on GRNmap beta code with the following parameter settings:
- L-curve = 1
- estimate_params = 1
- make graphs = 0 (making graphs caused the code to fail)
- production_function = sigmoid
- fix_b = 0
- fix_P = 0
- Runs begun at 1:20 pm on Monday, February 1, 2016
February 5, 2016
- Runs from February 1st still not completed on MATLAB
- 15 gene network is on alpha value run 13
- 34 gene network is on alpha value run 15 (and has been for 4 days)
- Last alpha values are not necessary for L-curve analysis, so L-curve analysis completed with the 12 (for 15 gene network) and 14 (for 34 gene network) output that were created.
- L-curve analysis
- LSE, penalty, minLSE, and iteration count were compiled for each alpha value. Each set came from a different output sheet (each sheet corresponds to a different alpha value.
- Compiled data file for both networks can be found here: Media:GJ L curves dHAP4 deletion added.xlsx
- Brandon Klein created an R script to create the L-curves from the alpha, penalty, and LSE columns from the sheets.
- Output L-curves plot LSE vs Penalty, labeling each alpha value point
- L-curve of 34 gene network: File:Deletion added dHAP4 34 genes plot.pdf
- L-curve of 15 gene network: File:Deletion added dHAP4 15 genes plot.pdf
- dHAP4 deletion added network subfamily
- Largest network: 34 genes, 102 edges: Media:DHAP4 34 gene network deletion added input GJ.xlsx
- Smallest network: 15 genes, 28 edges: Media:DHAP4 15 gene network deletion added input GJ.xlsx
February 7, 2016
- MATLAB run for largest (34-gene) network has completed.
- Alpha iterations 13-16 were added to the spreadsheet, but the L-curve image was not updated because it is clear the numbers follow the same trend (i.e. approaching the x-axis asymptotically).
- By similar reasoning (i.e. trendline is clear), the MATLAB run for the smallest network of the deletion added subfamily was aborted on February 11. (It has been running on alpha value 15 since 2/2/16. Counter was up to 10,780,000).
- All output files were saved on a flash drive with folder titles: GJ_deletionadded_largestnetwork_GRNmap
February 11, 2016
- Task for this week: Create parameter bar graphs as per GitHub issue #172 for w, p, and b for alphas = 0.01, 0.008, 0.005, 0.002, 0.001.
- Also note: Coders are up to date. Use the latest version ("latest release") to run code - for now v1.4.2.
- For this version, you can now make graphs for 6+ strains
- Experimental questions:
- What effect does the size of the network have on the performance/fit of the model?
- Evaluated by overall LSE's (LSE to min theoretical LSE) and individual gene MSE's. (MSE's tell us how well individual genes are modeled).
- Eventually, pick the best network (a particular size) and compare to 20+ random networks.
- What effect does the size of the network have on the performance/fit of the model?
February 14, 2016
- Reorganizing files of dHAP4 strains added subfamily (i.e. deletion strains added) with standardized naming convention as in Github issue #176 [[1]]
- 20 files:
- 15 nodes, 28 edges
- 16 nodes, 32 edges
- 17 nodes, 36 edges
- 18 nodes, 39 edges
- 19 nodes, 41 edges
- 20 nodes, 46 edges
- 21 nodes, 48 edges
- 22 nodes, 59 edges
- 23 nodes, 61 edges
- 24 nodes, 62 edges
- 25 nodes, 68 edges
- 26 nodes, 69 edges
- 27 nodes, 70 edges
- 28 nodes, 75 edges
- 29 nodes, 80 edges
- 30 nodes, 90 edges
- 31 nodes, 91 edges
- 32 nodes, 93 edges
- 33 nodes, 99 edges
- 34 nodes, 102 edges
- Parameter comparison: comparing optimized weights, production rates, and b's for the output files (from L-curve analysis) of alpha's 0.01, 0.008, 0.005, 0.002, and 0.001 (output file numbers 8, 9, 10, 11, and 12, respectively). As mentioned in GitHub issue #176 [[2]], Dr. Fitzpatrick notes that a=0.002 is probably the right choice. (It is slightly to the right of the "elbow" in the L-curve). This parameter comparison will serve as a check on the selection of this alpha value - we can make sure that 0.002 doesn't give us any crazy estimated parameters relative to neighboring alphas.
- Excel file with bar graph comparison for the smallest network of the dHAP4_strains-added subfamily can be found here: Media:15-genes 28-edges GJ-dHAP4-fam strains-added L-curve parameter-comparison.xlsx
February 19, 2016
- To finish ASAP
- Honors ambassadorial grant for Experimental Biology Conference in April
- Output parameter comparison for largest strains_added network (34 genes) for alpha = 0.01, 0.008, 0.005, 0.002, and 0.001
- To complete for the poster (two weeks after spring break)
- Stick with strains_added subfamily
- "Production runs:" Evaluate (with graphs) the networks of 15, 34, 25, 20, and 30 genes. (In that order, that way if we don't finish 20 or 30 we still have a good spread of data)
- Must use beta as it has the bug fix of being able to make graphs for 6+ strains
- "Production runs:" Evaluate (with graphs) the networks of 15, 34, 25, 20, and 30 genes. (In that order, that way if we don't finish 20 or 30 we still have a good spread of data)
February 23, 2016
- Used Brandon Klein's R script to convert the weights adjacency matrix to a list. Instructions can be found here: [[3]]
- Parameter comparison for the largest dHAP4 strains_added network containing 34 genes and 102 edges. Excel file with bar charts comparing w, p, and b optimized parameters for alphas = 0.01, 0.008, 0.005, 0.002, and 0.001 can be found here: Media:34-genes 102-edges GJ-dHAP4-fam strains-added L-curve parameter-comparison.xlsx
- As Dr. Fitzpatrick had mentioned, he thought 0.002 was an appropriate choice for the alpha based on its location in the L-curve.
- In this sheet, w and b parameters are fairly consistent across the five alpha values. For p, however, the optimized values for a=0.002 are consistently, significantly higher than other alphas. I am unsure of whether this is problematic if we choose a=0.002 moving forward.
February 26, 2016
- Confirm from GitHub issue #172 and #185: even when input sheets are formatted correctly for L-curve analysis, output files has the following mistakes:
- optimized_threshold_b sheet: places three zeros in cells C25, C28, and C35.
- Generated input sheets (i.e. filename_#) have the following mistakes:
- production_rates sheet: values shifted over one column to the right
- threshold_b sheet: values shifted over one column to the right
- This error may be connected to the strange weight parameters output in the file on Feb 23
- Redo estimation of 34 gene run 11 (run with alpha = 0.002).
- Master code
- Reformat input sheet (previously: Media:DHAP4 34 gene network deletion added input GJ 11.xlsx) to fix threshold_b and production_rates sheets. Set L-curve to 0, estimate_params to 1, make_graphs to 0, fix_b and fix_p to 0.
- input sheet for this run: Media:Edited dHAP4 34 gene network deletion added input GJ 11.xlsx
February 27, 2016
- I went to the S120 computer lab to check on Grace's run. It had completed with 200700 iterations at 11:51 PM. — Kam D. Dahlquist 17:14, 27 February 2016 (EST)
- Here is the input file, output files, and optimization diagnostic zipped together.
- In the future, you should run things from the T: drive. Stuff on the T: drive will not be deleted upon restarting the computer.
- I have compared the output file posted above by Dr. Dahlquist to the outputs of the L-curve analysis and found 0 difference in the optimized parameters for w, p, and b's.
- The output file has the same issues as the L-curve output files - the "optimized_threshold_b" sheet has added zeros in column C.
March 2, 2016
- Compare LSE to min LSE by dividing LSE/minLSE -- in Excel doc titled "GJ_dHAP4-fam_strains-added_LSE-compare.xlsx". Ask Dahlquist if this is how the comparison should be run.
March 6, 2016
- Verify with Dr. Dahlquist that GRNsight reads "network_optimized_weights" sheet in GRNmap output file
- Production runs began as per GitHub issue #187 [[4]]
- 15_gene run finished after 2.5 hours
- 20_gene run finished after 4 hours
- 25_gene run finished after 11 hours
- 30_gene run finished after 8 hours
- 34_gene run finished after 18 hours
- Files saved on Thaw space of each Seaver 120 computer and my flash drive
- Began poster formatting. Rough draft here: Media:GJ MO GRNmap Spring2015 dHAP4 poster draft1.pptx
- Background section nearly complete
- Set up formatting for data that will be compiled once production runs complete tomorrow.
March 7, 2016
- Another updated version of the poster here: Media:GJ MO GRNmap Spring2015 dHAP4 poster draft2.pptx
- Added LSE table and GRNsight visualizations
March 10, 2016
- Updated version of the poster here: Media:GJ MO GRNmap Spring2015 dHAP4 poster draft3.pptx
- With the help of Maggie O'Neil, added graph outputs for interesting genes, parameter comparison, and MSE/ANOVA values
- Reran 20_gene runs as there was an error in the first set
March 27, 2016
- Updating poster for ASBMB Conference
- Changing title, rewording a few things. Draft found here: Media:GJ MO GRNmap Spring2015 dHAP4 poster ASBMB draft.pptx
- Running three random 15 gene dHAP4 networks through GRNmap (i.e. networks containing the same nodes and the same number of edges, but with different connectivity). Run time for all three was just under 2 hours.
- Relative to the network given by YEASTRACT (and used in our poster) LSE's were higher in one random network (Rand2) but smaller in the other two random networks (Rand1 and Rand3).
- No discernible pattern found when comparing optimized production rates and threshold b's across the YEASTRACT network and the three random networks, except perhaps that estimated production rates tend to be higher for the random networks.
- Dynamics of output graphs are distinct across all four networks.
- GRNmap output files of the random production runs can be found here:
- Excel file containing LSE, production rate, and threshold b comparisons can be found here: Media:GJ-dHAP4-fam strains-added Sigmoid estimation Poster-analysis (1).xlsx
April 15, 2015
- In the last few weeks of the semester, we are compiling data and making charts to wrap up our analysis of dHAP4 (also in comparison to Tessa's dGLN3 and Kristen's dZAP1 families). In a powerpoint, the following information will be compiled
- dHAP4 family LSE/minLSE table
- LSE vs. total parameters plot for dHAP4 family (15, 20, 25, 30, and 34 nodes)
- LSE/minLSE ratios of 13 random 15 gene 28 edge networks normalized to LSE of YEASTRACT network
- Degree distribution charts of dHAP4 family
- Degree distribution charts of 13 random networks
- Any similarities of rand3, rand6, rand9, rand11, and rand13 with YEASTRACT network? Evaluate by visual inspection (GRNsight networks) and matrix comparison.
- Comparison of genes in smallest and largest networks across the three strains (dHAP4, dGLN3, and dZAP1).