Andrew Forney Week 8
Author: Andrew Forney
Assignment: Individual Journal 8
Electronic Lab Notebook
Normalization of Log Ratios
This portion was fairly straightforward and simply a task of rote repetition. Being a fan of algorithms, and as such, efficiency, I discovered that the fastest method was to first set up the appropriate columns (insert, then title) because the titles were so similar--the pattern buffer was used effectively because few changes were made between many copies and pastes.
After the columns were set, copying and pasting the pseudo-dynamic formula was made more efficient because the syntactic similarities made changes to the equations easy.
This part of the analysis had a couple rough patches but was still pretty straightforward. The first hiccup was the instruction: "Go to a new column on the right of your worksheet. Type the header "Avg_LogFC_A", "Avg_LogFC_B", and "Avg_LogFC_C" into the top cell of the next three columns." I was unsure of what "right side of your worksheet" really meant--did it mean the right of the current screen bounds of the worksheet or just the right-most unoccupied column? I read a little further and decided that since the AVG_LogFC columns corresponded to their respective four individual samples, I would place them adjacent to one another. This seemed to be the most logical choice as it mimicked the format of the first part of the instructions, but I then realized that the remainder of the statistical analysis section assumed that the AVG_LogFC columns were directly adjacent to one another. As such, I needed to adapt a tiny bit of the formulas to fit my differing layout, but in the end, the results were the same--the difference should have made no change to the "forGenMapp" sheet.
The second minor hiccup was the formula, "=TDIST(ABS(R2),degrees of freedom,2)" as I got a #NAME error when trying to enter it. I then looked up this error in the help documentation and facepalmed when I saw that "degrees of freedom" should actually be an integer value--in our case, 2.
As far as tips and tricks went for this section, I discovered that the copy-paste process could be sped by CTRL-selecting columns that were not necessarily adjacent, which then allowed me to paste them in their proper order where needed. Additionally, I had always known of the presence of "Paste Special..." but never really used it. This example, pasting the values of the copied material over their formulaic derivation, let me get a gist for its purpose. Other than these two nuances, coupled with those that I noted about the normalization step, there wasn't a whole lot to say about the process. The instructions were easy to follow aside from the couple of small issues I initially had, and I'm confident that my end result is accurate.
Sanity Check: Significant Differences
- How many genes have p value < 0.05? The filter result found 948 records.
- What about p < 0.01? The filter result found 235 records.
- What about p < 0.001? The filter result found 24 records.
- What about p < 0.0001? The filter result found 2 records!
- Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change greater than zero. How many are there? The filter result found 352 records.
- Keeping the "Pvalue" filter at p < 0.05, filter the "Avg_LogFC_all" column to show all genes with an average log fold change less than zero. How many are there? The filter result found 596 records.
- What about an average log fold change of > 0.25 or < -0.25? (This is a more realistic value for the fold change cut-off because it represents about a 20% fold change which is about the level of detection of this technology.) The filter result found 918 records.
Merrell et al (2002) Comparison
The methods used by Merrell et al (2002) to analyze the microarray data is explained by the following excerpt from the article:
"Data were quantified, normalized and corrected to yield the relative transcript abundance of each gene as an intensity ratio with respect to that of the reference signal. These intensity ratios were then used to identify statistically significant differences in gene expression using the Statistical Analysis for Microarrays (SAM) program6. A two-class SAM analysis was conducted using the strain grown in vitro as class I, and each individual stool sample as class II. Genes with statistically significant changes in the level of expression—at least a twofold change—in each patient sample were chosen, and the derived data from individual stool samples were collapsed to identify genes that were differentially regulated in all three samples. According to these criteria, 237 genes were differentially regulated: 44 genes were induced and 193 genes were repressed in human-shed V. cholerae..." (Merrell et al, 2002).
As per this report, their methods were similar (assuming regularity between Excel and SAM) except for one key point: they decided that the statistically significant change in the level of expression would be "at least a twofold change." While our filter was applied to check the x in Avg_LogFC_all such that 0.25< x <-0.25, the filter advocated by Merrell et al suggests a more stringent comparator: 1 < x < -1.
Sanity Check: Individual Gene Comparisons
|ID||Fold Changes||P-Value||Significantly Different?|
|VC0028||1.27||0.0692||No, p > 0.05|
|VC0941||-0.28||0.1636||No, p > 0.05|
|VC0869||1.59||0.0463||Yes, p < 0.05|
|VC0051||1.89||0.0160||Yes, p < 0.05|
|VC0647||-1.05||0.0051||Yes, p < 0.05|
|VC0468||-0.17||0.3350||No, p > 0.05|
|VC2350||-2.40||0.0130||Yes, p < 0.05|
|VCA0583||1.06||0.1011||No, p > 0.05|