Final Presentation Link
Complete Microarray Data Analysis
- Here is the file containing all of the relevant data: All Data
- In the data set that I used, from the Fontan, et al article (ArrayExpress), the gene IDs that correspond to each Reporter Identifier were not listed. As a result, it was impossible to say whether or not the Gene IDs were assigned correctly. Due to this, it is likely that the data I obtain in the sanity check will not be the same as the data in the paper. However, due to the lack of information from the authors of the paper, there is nothing that can be done.
- For the data, I calculated the meta fold change, which is the fold change of the sigB mutant divided by the fold change of the wild type.
- Sanity check methods from Week 13.
Sanity Check: Number of genes significantly changed
Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.
- Open your spreadsheet and go to the "forGenMAPP_norm_KD" tab.
- Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
- Click on the drop-down arrow on your "Pval_SigB-vs-wt" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
- How many genes have p value < 0.05?
- What about p < 0.01?
- What about p < 0.001?
- What about p < 0.0001?
- When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
- We have just performed 3924 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times. If have more than 274 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones.
- The "AvgLogFC_SigB-vs-WT" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
- Keeping the "Pvalue" filter at p < 0.05, filter the "AvgLogFC_SigB-vs-WT" column to show all genes with an average log fold change greater than zero. How many are there?
- Keeping the "Pvalue" filter at p < 0.05, filter the "AvgLogFC_SigB-vs-WT" column to show all genes with an average log fold change less than zero. How many are there?
- What about an average log fold change of > 0.25 and p < 0.05?
- Or an average log fold change of < -0.25 and p < 0.05? (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
|p < 0.05
|p < 0.01
|p < 0.001
|B-H p < 0.05
|Bon p < 0.05
- Table 1: The table listing the number of genes at various p value cutoffs. At each cutoff, we want to see the percentage of genes be larger than the cutoff. In the past of the p values below 0.05, 0.01, and 0.001, the percentage of genes is, in fact, larger, indicating that many of those genes are significant. However, in the case of the B-H and Bonferroni p-values, neither of them come close to 5% and are significantly lower, with Bonferroni having only 2 genes of the 3,924 genes having a p value of less than 0.05.
- In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a movable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off. For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.
- What criteria did your paper use to determine a significant gene expression change? How does it compare to our method?
- The paper used a false discovery rate of less than 0.02 and a meta fold change of at least 1.8 in order to determine if gene expression was significant. This is similar to our method. Their fold change cut-off is much higher, at 1.8 instead of 0.25. They also do not use the log fold change, but the regular fold change. With the log taken, they look for a fold change of 0.85, still higher than what we are looking for. Also, they use false discovery rate instead of p-value. False discovery rate is similar to p-value, but accounts for false positives to form an adjusted p-value that is likely much more accurate. The false discovery rate adjusted maximum they chose was 2%, which is lower than the recommended p-value and is also much more accurate. The false discovery rate was calculated by Significant Analysis of Microarray Program (SAM) (Totallab, University of Pennsylvania).
Sanity Check: Compare individual genes with known data
- Look in your paper for genes that are specifically mentioned. What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
- The genes below are the genes only mentioned during the discussion of gene expression changes during SDS stress. Genes mentioned in regards to oxidative or diamide stress will not be mentioned. The p values and fold changes are not given. To illustrate the difference in fold changes between the paper and my own analysis, my fold changes will also be given.
- The information below comes from the supplemental materials located here, specifically from table 2S.
||Paper Fold Change
||Paper Log Fold Change
||My Fold Change
||Significant in my analysis?
||No, p = 0.2199
||Yes, p = 0.0138
||No, p = 0.0596
||No, p = 0.3792
||No, p = 0.3634
||No, p = 0.8018
||No, p = 0.1409
- Table 2: The p-values and fold changes for the genes discussed in the paper. It is likely that the extreme discrepancy is due to us needing to guess on the genes that corresponded to each ID. Note that only a single gene has a significant p value.
Complete Microarray Data Analysis
- First, I launched GenMAPP. I selected Data > Choose Gene Database, and selected the M. tuberculosis database file. I selected "Expression Dataset Manager" and loaded the txt file containing the forGenMAPP_norm_KD sheet.
- GenMAPP detected 45 errors during conversion.
- I created a new Color Set named "SigB_vs_WT_SDS" and used the Gene Value of "AvgLogFC_SigB-vs-WT". I created four criteria, each one having a different color.
- The first, IncreasedBH, had the following criteria: [AvgLogFC_SigB-vs-WT] > 0.25 AND [B-H_Pval_SigB-vs-WT] < 0.05
- The second, DecreasedBH, had the following criteria: [AvgLogFC_SigB-vs-WT] < -0.25 AND [B-H_Pval_SigB-vs-WT] < 0.05
- The third, Increased, had the following criteria: [AvgLogFC_SigB-vs-WT] > 0.25 AND [Pval_SigB-vs-wt] < 0.05
- The fourth, Decreased, had the following criteria: [AvgLogFC_SigB-vs-WT] < -0.25 AND [Pval_SigB-vs-wt] < 0.05
- After, I selected Tools > MAPPFinder, then "Calculate New Results". Then I selected "Ok". After a time, the MAPP loaded.
- After examining the results, I decided to use the sheet labeled "MAPPFinder_Results_Fontan_NA-Criterion2-GO.txt", which corresponded to the "Increased" criterion, and the sheet labeled "MAPPFinder_Results_Fontan_NA-Criterion3-GO.txt", which corresponded to the "Decreased" criterion. Both of these files are available in the "All Data" file at the top of the page.
- The two txt files above were opened in excel. Then, the following filters were applied:
- Z score > 2
- PermuteP < 0.05
- Number changed ≥ 2
|pyrimidine nucleoside triphosphate metabolic process
|dUTP diphosphatase activity
|vitamin biosynthetic process
|avoidance of defenses of other organism involved in symbiotic interaction
|water-soluble vitamin metabolic process
|phosphoric diester hydrolase activity
|guanyl ribonucleotide binding
|cobalamin metabolic process
- Table 3: The MAPPFinder results for increased expression in 10 GO terms.
|nucleic acid binding
|ether hydrolase activity
|macromolecule metabolic process
|cellular response to oxygen levels
|pseudouridine synthase activity
|zinc ion binding
- Table 4: The MAPPFinder results for decreased expression in 10 GO terms.
- Using the data, I sorted the genes from the Fontan Dataset by p-value and found the functions of the top 10 most significant genes.
||Putative Protein Function
||Phosphate import ATP-binding protein pstB 1
||Part of the ABC transporter complex pstSACB involved in phosphate import. Responsible for energy coupling to the transport system.
||Possible membrane acyltransferase
||transferase activity, transferring acyl groups other than amino-acyl groups
||Probable conserved integral membrane protein
||cell wall, plasma membrane
||Uncharacterized protein Rv1708/MT1749
||May play a role in septum formation.
||AsnC family transcriptional regulator
||sequence-specific DNA binding transcription factor activity
||Isocitrate dehydrogenase [NADP]
||isocitrate dehydrogenase (NADP+) activity, protein homodimerization activity, protein homodimerization activity
BIOL 368, Fall 2014