BIOL368/F14:Nicole Anguiano Week 15

From OpenWetWare
Jump to navigationJump to search

Final Presentation Link

Final Presentation

Complete Microarray Data Analysis

  • Here is the file containing all of the relevant data: All Data
  • In the data set that I used, from the Fontan, et al article (ArrayExpress), the gene IDs that correspond to each Reporter Identifier were not listed. As a result, it was impossible to say whether or not the Gene IDs were assigned correctly. Due to this, it is likely that the data I obtain in the sanity check will not be the same as the data in the paper. However, due to the lack of information from the authors of the paper, there is nothing that can be done.
  • For the data, I calculated the meta fold change, which is the fold change of the sigB mutant divided by the fold change of the wild type.
  • Sanity check methods from Week 13.

Sanity Check: Number of genes significantly changed

Before we move on to the GenMAPP/MAPPFinder analysis, we want to perform a sanity check to make sure that we performed our data analysis correctly. We are going to find out the number of genes that are significantly changed at various p value cut-offs and also compare our data analysis with the published results.

  • Open your spreadsheet and go to the "forGenMAPP_norm_KD" tab.
  • Click on cell A1 and select the menu item Data > Filter > Autofilter. Little drop-down arrows should appear at the top of each column. This will enable us to filter the data according to criteria we set.
  • Click on the drop-down arrow on your "Pval_SigB-vs-wt" column. Select "Custom". In the window that appears, set a criterion that will filter your data so that the Pvalue has to be less than 0.05.
    • How many genes have p value < 0.05?
      • 655 of 3924 (16.69%).
    • What about p < 0.01?
      • 252 of 3924 (6.42%).
    • What about p < 0.001?
      • 70 of 3924 (1.78%).
    • What about p < 0.0001?
        • 15 of 3924 (0.38%).
  • When we use a p value cut-off of p < 0.05, what we are saying is that you would have seen a gene expression change that deviates this far from zero less than 5% of the time.
  • We have just performed 3924 T tests for significance. Another way to state what we are seeing with p < 0.05 is that we would expect to see this magnitude of a gene expression change in about 5% of our T tests, or 274 times. If have more than 274 genes that pass this cut off, we know that some genes are significantly changed. However, we don't know which ones.
  • The "AvgLogFC_SigB-vs-WT" tells us the size of the gene expression change and in which direction. Positive values are increases relative to the control; negative values are decreases relative to the control.
    • Keeping the "Pvalue" filter at p < 0.05, filter the "AvgLogFC_SigB-vs-WT" column to show all genes with an average log fold change greater than zero. How many are there?
      • 405 genes.
    • Keeping the "Pvalue" filter at p < 0.05, filter the "AvgLogFC_SigB-vs-WT" column to show all genes with an average log fold change less than zero. How many are there?
      • 250 genes.
    • What about an average log fold change of > 0.25 and p < 0.05?
      • 319 genes.
    • Or an average log fold change of < -0.25 and p < 0.05? (These are more realistic values for the fold change cut-offs because it represents about a 20% fold change which is about the level of detection of this technology.)
      • 181 genes.
P-Value Cutoff # Genes % Genes
p < 0.05 655 16.69%
p < 0.01 252 6.42%
p < 0.001 70 1.78%
B-H p < 0.05 56 1.43%
Bon p < 0.05 2 0.051%
  • Table 1: The table listing the number of genes at various p value cutoffs. At each cutoff, we want to see the percentage of genes be larger than the cutoff. In the past of the p values below 0.05, 0.01, and 0.001, the percentage of genes is, in fact, larger, indicating that many of those genes are significant. However, in the case of the B-H and Bonferroni p-values, neither of them come close to 5% and are significantly lower, with Bonferroni having only 2 genes of the 3,924 genes having a p value of less than 0.05.
  • In summary, the p value cut-off should not be thought of as some magical number at which data becomes "significant". Instead, it is a movable confidence level. If we want to be very confident of our data, use a small p value cut-off. If we are OK with being less confident about a gene expression change and want to include more genes in our analysis, we can use a larger p value cut-off. For the GenMAPP analysis below, we will use the fold change cut-off of greater than 0.25 or less than -0.25 and the p value cut off of p < 0.05 for our analysis because we want to include several hundred genes in our analysis.
  • What criteria did your paper use to determine a significant gene expression change? How does it compare to our method?
      • The paper used a false discovery rate of less than 0.02 and a meta fold change of at least 1.8 in order to determine if gene expression was significant. This is similar to our method. Their fold change cut-off is much higher, at 1.8 instead of 0.25. They also do not use the log fold change, but the regular fold change. With the log taken, they look for a fold change of 0.85, still higher than what we are looking for. Also, they use false discovery rate instead of p-value. False discovery rate is similar to p-value, but accounts for false positives to form an adjusted p-value that is likely much more accurate. The false discovery rate adjusted maximum they chose was 2%, which is lower than the recommended p-value and is also much more accurate. The false discovery rate was calculated by Significant Analysis of Microarray Program (SAM) (Totallab, University of Pennsylvania).

Sanity Check: Compare individual genes with known data

  • Look in your paper for genes that are specifically mentioned. What are their fold changes and p values in the paper? Are they significantly changed in your analysis?
  • The genes below are the genes only mentioned during the discussion of gene expression changes during SDS stress. Genes mentioned in regards to oxidative or diamide stress will not be mentioned. The p values and fold changes are not given. To illustrate the difference in fold changes between the paper and my own analysis, my fold changes will also be given.
  • The information below comes from the supplemental materials located here, specifically from table 2S.
Gene Name Paper Fold Change Paper Log Fold Change My Fold Change Significant in my analysis?
RV0465C 0.43 -1.23 -2.54 No, p = 0.2199
ideR (RV2711) 0.27 -1.87 0.21 Yes, p = 0.0138
mbtH (Rv2377C) 2.88 1.53 1.46 No, p = 0.0596
mbtF (Rv2379C) 2.14 1.09 1.42 No, p = 0.3792
mbtE (Rv2380C) 1.82 0.86 -1.16 No, p = 0.3634
mbtD (Rv2381C) 2.48 1.31 0.91 No, p = 0.8018
mbtB (Rv2383C) 2.00 1.00 0.39 No, p = 0.1409
  • Table 2: The p-values and fold changes for the genes discussed in the paper. It is likely that the extreme discrepancy is due to us needing to guess on the genes that corresponded to each ID. Note that only a single gene has a significant p value.

Complete Microarray Data Analysis

  • First, I launched GenMAPP. I selected Data > Choose Gene Database, and selected the M. tuberculosis database file. I selected "Expression Dataset Manager" and loaded the txt file containing the forGenMAPP_norm_KD sheet.
    • GenMAPP detected 45 errors during conversion.
  • I created a new Color Set named "SigB_vs_WT_SDS" and used the Gene Value of "AvgLogFC_SigB-vs-WT". I created four criteria, each one having a different color.
    1. The first, IncreasedBH, had the following criteria: [AvgLogFC_SigB-vs-WT] > 0.25 AND [B-H_Pval_SigB-vs-WT] < 0.05
    2. The second, DecreasedBH, had the following criteria: [AvgLogFC_SigB-vs-WT] < -0.25 AND [B-H_Pval_SigB-vs-WT] < 0.05
    3. The third, Increased, had the following criteria: [AvgLogFC_SigB-vs-WT] > 0.25 AND [Pval_SigB-vs-wt] < 0.05
    4. The fourth, Decreased, had the following criteria: [AvgLogFC_SigB-vs-WT] < -0.25 AND [Pval_SigB-vs-wt] < 0.05
  • After, I selected Tools > MAPPFinder, then "Calculate New Results". Then I selected "Ok". After a time, the MAPP loaded.
  • After examining the results, I decided to use the sheet labeled "MAPPFinder_Results_Fontan_NA-Criterion2-GO.txt", which corresponded to the "Increased" criterion, and the sheet labeled "MAPPFinder_Results_Fontan_NA-Criterion3-GO.txt", which corresponded to the "Decreased" criterion. Both of these files are available in the "All Data" file at the top of the page.
  • The two txt files above were opened in excel. Then, the following filters were applied:
    • Z score > 2
    • PermuteP < 0.05
    • Number changed ≥ 2
GO Term # Changed % Changed P-value
pyrimidine nucleoside triphosphate metabolic process 3 100% 0.002
ADP Binding 2 100% 0.004
dUTP diphosphatase activity 2 100% 0.007
vitamin biosynthetic process 10 18.52% 0.015
avoidance of defenses of other organism involved in symbiotic interaction 6 22.22% 0.02
water-soluble vitamin metabolic process 9 18.75% 0.025
GTP binding 7 19.44% 0.034
phosphoric diester hydrolase activity 3 30% 0.041
guanyl ribonucleotide binding 7 18.92% 0.041
cobalamin metabolic process 4 25% 0.049
  • Table 3: The MAPPFinder results for increased expression in 10 GO terms.
GO Term # Changed % Changed P-value
nucleic acid binding 32 7.62% 0.001
transaminase activity 5 20.83% 0.002
ether hydrolase activity 2 66.67% 0.005
lipid glycosylation 2 50% 0.008
macromolecule metabolic process 44 6.4% 0.013
intracellular organelle 9 11.11% 0.014
RNA modification 3 27.27% 0.017
cellular response to oxygen levels 2 33.33% 0.025
pseudouridine synthase activity 2 40% 0.028
zinc ion binding 7 10.61% 0.037
  • Table 4: The MAPPFinder results for decreased expression in 10 GO terms.
  • Using the data, I sorted the genes from the Fontan Dataset by p-value and found the functions of the top 10 most significant genes.
Gene Name P-value Protein Name Putative Protein Function
RV0820 0 Phosphate import ATP-binding protein pstB 1 Part of the ABC transporter complex pstSACB involved in phosphate import. Responsible for energy coupling to the transport system.
RV1770 0 Conserved Protein plasma membrane
RV0517 0 Possible membrane acyltransferase transferase activity, transferring acyl groups other than amino-acyl groups
RV0541C 0 Probable conserved integral membrane protein growth
RV3269 0 Conserved Protein cell wall, plasma membrane
RV1708 0.0001 Uncharacterized protein Rv1708/MT1749 May play a role in septum formation.
RV2779C 0.0001 AsnC family transcriptional regulator sequence-specific DNA binding transcription factor activity
RV2409C 0.0001 Conserved protein plasma membrane
RV0066C 0.0001 Isocitrate dehydrogenase [NADP] isocitrate dehydrogenase (NADP+) activity, protein homodimerization activity, protein homodimerization activity
RV3373 0.0001 Enoyl-CoA hydratase catalytic activity


Nicole Anguiano
BIOL 368, Fall 2014

Assignment Links
Individual Journals
Class Journals