Alyssa N Gomes Week 12 Journal
This week we are working with YEASTRACT. Our experiment is analyzing the 6000 genes in yeast (targets) and looking more directly at the 250 that code for transcription (regulators). We'll be looking at the binding studies (Lee et al 2002), expression studies, location of DNA binding sites of transcription factors. All three factors have potential for false positives, but binding studies also has a chance for false negative. What we need is a list of transcription factors that help us see the generated gene regulatory network.
Methods
- Turn on the file extensions
- By downloading your selected gene profile and downloading it from LionShare, we open it in Excel
- Going onto the YEASTRACT website, click rank by RF on the left-hand side
- Paste your list of Gene Symbols from your selected gene profile into the Target Genes window
- Check the box "Check for all TFs" and make sure all are set on the default (DNA binding plus expression evidence, TF acting as activator or inhibitor) and click Search
- The p values colored green are considered "significant", the ones colored yellow are considered "borderline significant" and the ones colored pink are considered "not significant"
- How many transcription factors are green or "significant"? 23
List the "significant" transcription factors on your wiki page, along with the corresponding "% in user set", "% in YEASTRACT", and "p value".
Are CIN5, GLN3, HMO1, and ZAP1 on the list? The significant factors are:
- There are 24 transcription factors that are green or "significant"
- "Significant" transcription factors:
- Sfp1p
- % in user set: 79.53%
- % in Yeastract: 9.41%
- p-value: 0
- Fkh2p
- % in user set: 21.44%%
- % in Yeastract: 15.76%%
- p-value: 0
- Yhp1p
- % in user set: 38.60%%
- % in Yeastract: 15.38%%
- p-value: 0
- Yox1p
- % in user set: 41.13%%
- % in Yeastract: 14.78%%
- p-value: 0
- Cyc8p
- % in user set: 0.39%
- % in Yeastract: 100.00%
- p-value: 0
- YLR278C
- % in user set: 14.62%%
- % in Yeastract: 17.65%%
- p-value: 2.9E-14
- Ace2p
- % in user set: 81.29%%
- % in Yeastract: 8.73%%
- p-value: 6.4E-14
- Rif1p
- % in user set: 12.87%
- % in Yeastract: 18.44%%
- p-value: 1.25E-13
- Msn2p
- % in user set: 63.35%%
- % in Yeastract: 9.53%%
- p-value: 1.69E-13
- Cse2p
- % in user set: 21.25%%
- % in Yeastract: 14.05%%
- p-value: 4.67E-13
- Stb5p
- % in user set: 27.88%%
- % in Yeastract: 12.31%%
- p-value: 2.672E-12
- Ndt80p
- % in user set: 15.59%%
- % in Yeastract: 14.08%%
- p-value: 7.8902E-10
- Asg1p
- % in user set: 8.77%%
- % in Yeastract: 17.58%%
- p-value: 4.5755E-09
- Msn4p
- % in user set: 47.95%%
- % in Yeastract: 9.59%%
- p-value: 4.6451E-09
- Mig2p
- % in user set: 9.75%%
- % in Yeastract: 16.29%%
- p-value: 1.0326E-08
- Snf2p
- % in user set: 40.35%
- % in Yeastract: 9.95%
- p-value: 1.0656E-08
- Swi5p
- % in user set: 38.21%
- % in Yeastract: 10.08%
- p-value: 1.1467E-08
- Spt20p
- % in user set: 38.01%
- % in Yeastract: 10.07%
- p-value: 1.4665E-08
- Snf6p
- % in user set: 46.98%
- % in Yeastract: 9.13%
- p-value: 9.9913E-07
- Pdr1p
- % in user set: 28.46%
- % in Yeastract: 10.15%
- p-value: 2.4577E-06
- Gcr2p
- % in user set: 25.73%
- % in Yeastract: 10.09%
- p-value: 7.7693E-06
- Gat3p
- % in user set: 10.92%
- % in Yeastract: 12.56%
- p-value: 1.1840E-05
- Mcm1p
- % in user set: 31.19%
- % in Yeastract: 9.58%
- p-value: 1.4349E-05
- Pop2p
- % in user set: 5.46%
- % in Yeastract: 15.64%
- p-value: 2.8483E-05
- Sfp1p
- CIN5, GLN3, HMO1, and ZAP1 are not listed on the significant value list, but are listed under the red/insignificant value genes
- 22 Transcription factors appeared on both lists. They were
- Ace2p
- Asg1p
- Cse2p
- Cyc8p
- Fkh2p
- Gcr2p
- Mcm1p
- Mig2p
- Msn2p
- Msn4p
- Ndt80p
- Pdr1p
- Rif1p
- Sfp1p
- Snf2p
- Snf6p
- Spt20p
- Stb5p
- Swi5p
- Yhp1p
- YLR278C
- Yox1p
- Copy and paste the l transcription factors you identified (plus CIN5, GLN3, HMO1, and ZAP1) into Transcription factors" field and the "Target ORF/Genes" field of YEASTRACT
- Select the "DNA binding plus expression evidence" and generate.
- Repeat twice, except instead checking "Only DNA binding evidence" and then "DNA binding and expression evidence"
- For each Excel file for each set of regulatory gene networks, delete any rows/columns that are only consisting of 0's and save as excel xlsx format.
- Make sure there are still 15-30 transcription factors in each matrix after pruning.
- Insert new sheets into each excel file named "network", then copy and paste the matrix. Utilize the transpose button on the whole matrix, and after, on the set of factors in column A in order to organize the excel sheet.
- Create a new sheet called "Degree".
- In the first empty column in A, type "Out-Degree" and get the sum for all rows adjacent to this column by typing "=SUM()" where in the parenthesis you select all above values.
- In the first empty row, type in "In-Degree" and repeat the same step as above except for each row, there is a calculated sum.
- Make sure the include the bottom right sum, containing the sum of all in-degree and out-degree values. These are the edges.
- Now we will look at some of the network properties. Again, repeat these steps for each of the three gene regulatory matrices you generated above. See this file for an example of how to do the following instructions.
- Create three columns to the right of everything labeled, Frequency, In-degree total and Out-degree total. FInd the highest value of either the in-degree or out-degree and under "Frequence" list the values 1-that number. The in the other two columns, for in degree and out degree, list the number of times each value of the frequency has shown up in each.
- Make a bar graph comparing the frequencies of values for In and Out degree totals. Put these graphs into your PPT.
- In GRN, upload each excel sheet. If the sheet is done right, GRNSight will output a graph of the network. Screenshot these and put into your powerpoint. Do these steps for all three networks.
- Discuss
- Both Tessa and I had 15-30 transcription factors so we did not undergo the procedure of adding more
- The Excel sheet with document binding plus expression had 153 edges.
- The Excel Sheet with DNA binding had 33 edges.
- The DNA Binding and Expression had 10 factors.
- The DNA Binding and Expression had 8 edges.
- PPT: Tessa Morris and Alyssa Gomes PPT
- Excel Sheet of DNA binding plus expression: I
- Excel Sheet of DNA binding Only: II
- Excel Sheet of DNA Binding and Expression: III
- Write a paragraph discussing and explaining the results of each aspect of today's work.
- Determining candidate transcription factors that regulate a cluster of genes from your dataset: When determining the candidate transcription factors that regulate the cluster of genes from my dataset, we used specific parameters into YEASTRACT, which sorted out the transcription factors based on significance, or smallest p-value. Comparing what GLN3 and Wild Type had in common both for Profile 45, there were 22/24 of the Wild Type transcription factors in common. We can see that there is a link in the cluster of genes selected between GLN 3 and Wild Type gene transcription in these values. In the Wild Type genes, none of CIN5, GLN3, HMO1, and ZAP1, factors assumed to be expected to show up as significant genes, did show up significantly. This makes me curious to see what the other gene profiles have listed for the significance of these factors.
- Creating three candidate gene regulatory networks: When sorting out the gene regulatory networks, we separated by DNA binding PLUS expression, Only DNA binding and DNA binding AND expression. To be honest, I'm not quite sure what the difference between these three gene regulatory networks are in terms of how the genes have been sorted, but in the order listed before, each set of genes factors numbers went down significantly from one to the next.
- Determining the total number of edges and degree distribution of your three gene regulatory networks: In determining the total number of edges, as the number of factors went down from Plus expression to Only DNA binding to DNA binding and expression, we see that the number of edges decreased as well. Because an excess number of edges may cause density in the model due to too many parameters, the PLUS expression had 133 which may be too dense. We see that the Only DNA binding set of networks had 33 edges, which was the closest to the target amount, 40-50. This GRNSight graph was shown very clearly. The DNA binding and Expression had less than the assumed 15-30 transcription factors, so the edges were only 8. This made this possibly unusable for research purposes as there may not be enough information given.
Visualizing the networks: Looking at the graphs, we see the varying frequencies and the differences between In/Out Total frequencies across the board. Refer to the PPT for further analysis of this. Looking at the GRNSight, it is hard to say what these graphs mean but we see the simplicities as less transcription factors work through. Choosing a particular gene regulatory network to pursue for the modeling: I have assumed that the preffered model will be the DNA binding ONLY one, as stated above. As I am not sure what occurs next, we can assume that this is a close enoguh value to the 40-50 transcription factors preferred in order to move on. This value has enough factors to get some information, yet a small enough number such that it wont over-clutter the model and confuse our assumptions to be made.
Revisions
- After discovering that I deleted the columns/rows wrong when creating my network, we were told to go back and re-do all of our last homework assignment.
- We found out that both Tessa and I must have the same network with the same genes.
- Instead, we were only supposed to delete rows and columns that both had only 0's in it.