20.109(S09): Microarray data analysis (Day8)
Sony Playstation or Microsoft X-box? Boxers or briefs? Coke or Pepsi? We all know that taste can’t be mandated, but then how do standards arise. Standards are a fundamental and required aspect of engineering. Without them, machines can’t talk to each other, hardware is difficult to repair, and profits disappear (try to estimate the recent earnings by Betamax). In many cases, standards are government mandated, e.g. the US public school curriculum, cell phone technology in Europe, internet protocols worldwide. On other occasions, external events or pressures influence standards. Sweet N’ Low was essentially the only artificial sweetener on the market until the saccharine it contained was “shown” to cause cancer in lab rats. On rare occasions, standards arise through extreme behavior. In 1888, Thomas Edison wanted to demonstrate the superior safety of direct current (the technology his company marketed) so he publicly electrocuted dogs with 1000 volts of alternating current, the technology his competitor, Westinghouse, was marketing for use in homes.
How do standards arise when there is no traditional financial market for them? In the case of BioBricks, a repository of DNA 'parts' that perform specific functions, the Registry of Standard Biological Parts is relying on the goodwill of the community to contribute standard parts that conform to the Registry’s rules. The payoff isn’t market share of the biological parts market, but rather the establishment of a shared resource that is reliable, reusable and useful. Community compliance to standards for microarray experiments and data analysis is similarly driven. Despite disagreement within the scientific community about how to collect meaningful microarray data, a “Minimum Information About a Microarray Experiment” (MIAME) checklist has been generated and is largely adhered to. “Minimum information” means only that the microarray data can be examined and interpreted by others…not a high bar for publication standards but one that is difficult to achieve since the arrays themselves are provided by different commercial vendors who disclose different amounts of information about their arrays. Moreover, the effort required to annotate MIAME data is significant and authors vary in their compliance.
Corroboration of published microarray data is further compounded by a lack of standards surrounding the data analysis itself. Processing the raw data mixes art and science. Algorithms used vary dramatically, and a single data set can appear compelling or noisy, depending on the analysis choices made by the investigator. For example, Cy3 and Cy5 are commonly used fluorescent probes but others dyes can be used and may be processed with different background correction and normalization factors. Not surprisingly, experiment protocols make a difference too. Researchers who indirectly label may find different outcomes than researchers who perform the same experiment but directly incorporate fluorescent dyes into their RNA. Also worth noting is human error, since microarray experiments require many steps over several days. There are even stories of people scanning their slides backwards and consequently mis-identifying every spot on the array.
This lack of consensus should be both liberating for you today and also burdensome. You will have great freedom in how to analyze and interpret your data. Some initial steps are suggested but then you’re free to try different approaches that you are interested in and that make sense to you. You will need to carefully annotate and justify the choices you make, to allow others to understand and critique your approach. Good luck and have fun!
Here is a rough outline of the steps you can take to examine your microarray data. There are many variations on this that are acceptable and that may be more interesting or appropriate for you. You should explore the data as you see fit.
- Open your TXT file as a tab-delimited file in Excel.
- Delete the top 9 rows.
- Label a new worksheet for working with your data
- Copy the columns for: GeneName, SystematicName, Description, gMedianSignal, rMedianSignal, gBGMedianSignal, rBGMedianSignal (and/or Mean Signals, if desired), gBGSDUsed, rBGSDUsed (these last two are for the optional Rosetta analysis).
- Format the numerical cells as numbers with no decimal place.
- Consider median and/or mean intensities and background, and correct as you see fit. Be sure you keep track of your analytical decisions in your notebook or in the XLS file.
- Consider the overall red/green ratios, and perform a normalization as you see fit.
- With your corrected data, you can now proceed with either taking intensity ratios, calculating the Rosetta X-scores, or both.
- For intensity ratios, just do what you did with the practice data last time: calculate red intensity/green intensity ratios and resulting log (base 2) values. For Rosetta scores, see below. Either way, the next step will be to sort your data according to either the ratios or the scores.
- Save your Excel file just before sorting your data. When sorting, it is essential that you select all the data in every column (click on the diamond in the corner), and sort by the appropriate column (either log2 ratios or Rosetta p-values). Otherwise you will be analyzing the wrong genes in the steps ahead!
- What do you see? Are the replicates of a specific gene in agreement? Are there particular genes you would expect NOT to change under any circumstances, and are they in fact expressed at equivalent levels in both samples? These are essentially questions about controls. Of course, you also want to see if the specific gene you targeted was down-regulated! When you are satisfied with your data manipulations and have a handle on your controls, proceed to investigating the genes that are differentially expressed.
- Ultimately, you should save your processed data as an XLS worksheet or workbook, and post it to today's Talk page. Your peers may want to compare their results to yours.
Optional: Rosetta score calculation
- The Rosetta X-score is described here.
- In the equation, a2 and a1 are the red and green intensities (corrected for background and/or normalized of course).
- The values s2 and s1 are found in the rBGSDUsed and gBGSDUsed columns - the standard deviations of the background intensities.
- The value f will be 0.3 for our purposes, although it can be solved for iteratively if desired.
- Once you have calculated your X-scores, you must translate them into p-values, which you recall indicate statistical significance. In order to do this, you must first calculate the mean and standard deviation of all the X-scores. They should form an approximately normal distribution.
- Now use the NORMDIST function in Excel (with the final argument set to TRUE). For numbers less than 0.5, the output of NORMDIST is the p-value directly. For example, an output of 0.04 has a p-value of 0.04, or 96% confidence. For numbers greater than 0.5, you must subtract them from 1. For example, an output of 0.97 has a p-value of 1-0.97 = 0.03, or 97% confidence.
- Because microarray data has so many data points, you should use a very strict p-value, such as 0.001, for determining which genes are up- and down-regulated.
- Sort your data according to p-values, as described above in step 10. Now you can easily select the differentially regulated genes to analyze below.
Now that you’ve decided which genes you believe are differentially expressed between your two samples, you need some way of making sense of your data. In particular, you might like to see if some of the genes with apparent increased or decreased expression share common features.
As a first approach, you might manually list the ten genes most highly upregulated in your sample as compared to your control, then learn a little about each one using a gene ontology website such as this one. Do the ten genes come from a common family? What about the ten most downregulated genes?
Of course, you have data on tens of thousands of genes, and manual analysis is not the most practical way to mine your data. Instead, you can use freely available databases and programs to find groups of genes that are statistically over- or under-represented compared to their expected value. For example, say you are analyzing a pool of 15, 500 genes, and 3600 of them (23%) are up-regulated in your microarray experiment. If 1100 genes exist in a particular family (such as developmental genes), then you would expect 23% of them, or ~250 to be up-regulated, assuming that the up-regulated genes are randomly distributed. If instead 700 of them are up-regulated, that family is over-represented. The steps for this analysis are outlined below.
- Go to GOstat. Note the appropriate way to cite this tool at the bottom of the page, then proceed to the search form.
- Choose the appropriate database for mouse, which is MGI. (You can click on Details to confirm this.)
- Choose a maximum p-value to display, such as 0.01.
- Change Cluster GOs from -1 to 0. This will display your results as clusters of related gene families.
- Note that if a specific gene family is up-regulated, such as the one consisting of genes for neuronal development, that alone might statistically up-regulate the higher-level family of all developmental genes.
- Change the display output to HTML, GO Stats only.
- Leave Multiple Testing as the default option. Any of these choices attempts to correct for the sheer number of statistical tests you are doing. Keep in mind, that if you make 10,000 comparisons, a simple p=0.05 value would give 500 positive results just by chance!
- In the first box, Group IDs, submit the systematic names of either your up-regulated or your own-regulated gene list. (Not both at the same time! The two lists must be submitted and analyzed separately.)
- For the second box, download and select the file 20109_mouse-array-genes.txt, which contains the full list of genes on your mouse microarray. Note that the MGI database may not be able to recognize all the genes, and that this also informs the statistical analysis.
- Finally, click Submit and wait for your results to appear.
- On the left-hand side, you should see GO groups listed in blue, followed by all the differentially expressed genes in that group. Further to the right, you see the number of genes expected to be differentially regulated for your sample and the number actually differentially regulated. There is an associated p-value with this over- or under-representation.
Take your time studying this list, noting which p-values are the lowest, and which GO groups appear. Are any of these groups associated with systems that you would expect to be affected by the presence of an siRNA (if your control sample had no siRNA), or by the presence of a working versus scrambled siRNA (if that was your control)? If you compared your experimental siRNA to the validated siRNA, does one have fewer or less significant off-target effects?
For next time
The first draft of your lab report is due by 11 AM on the next day that you have lab.