User:Matthew Whiteside/Notebook/Malaria Microarray/2009/03/04
Project name  Main project page Previous entry Next entry 
LIMMA notesT tests2 sampleT = x  y  (μ_{1}  μ_{2}) ÷ sqrt( S_{1}^{2} ÷ m + S_{2}^{2} ÷ n ) assumptions: rv's X & Y are independent. Paired 2 sampleassumptions: rv X & Y are not indepented. 2 observations are paired (i.e. dependent) but each pair is independent. so rv D = XY is independent and T = d  Δ_{0} ÷ S / sqrt(n) (similar to one sample, where null hypothesis is that H_{0}: μ_{d} = Δ_{0} (the null/zero rv) . ConsiderationsPaired data have less variation (because they will be correlated), so 2 sample may miss results, but paired ttest associated with less degrees of freedom. Guidelines: 1) if paired observations are very hetergeneous (vary greately between themselves) then paired test better suited. 2) if little correlation  increase in pairing precision is outweighed by increase in df in unpaired test. unpaired test better suited. LIMMALinear models based on categorical dataEach gene's expression (Y) is fitted with a linear model: Y = β + ε where ε is rv, E[ε] = 0, V[ε] = σ^{2}. When least sq fitting is performed, β is mean expression of gene across independent biological replicates (samples). ε captures the systematic error of the array. When 2 or more conditions are being tested, the gene's expression will be altered by a coefficient, representing that condition's effect. The conditions (categories) are modeled in the linear model using indicator variables: Y = β_{0} + β_{1} × x_{1} + ε where x_{1} = 1 for arrays testing condition 1, otherwise = 0. Coefficient β_{1} captures the effect of condition 1 for that gene. Note: robust regression is also available in the fitting function lmFit. It weights outlier expression values from biological replicates less, when using computing coefficients. Hypothesis TesingTo test if coefficient β_{i} has a statistically significant effect on a particular gene's expression, a modified T test is used. The modified T test in LIMMA is implemented in the eBayes function in Bioconductor. The modified T test borrows some information from the chipwide coefficient's variation to modify the coefficient's variation for that individual gene. use topTable and decideTest functions to extract genes with significantly coefficients. Design MatricesAs input the LIMMA lmFit function requires design matrices. Design matrices are matrix of indicator variables used to represent the categorical linear model (see above). In the design matrix, each column represents a coefficient that will capture a particular condition. Each row represents the different arrays (or samples  call this "treatment parameterization" method): (intecept) Condition1 Array1 1 0 Array2 1 0 Array3 1 1 Array4 1 1 This is just one sample of many possible design matrices. If we assume singlechannel affy data, in this example there are two conditions: Array 1 & 2 are the baseline and A. Coefficient (intercept) captures effect of the baseline on gene expression. Coefficient Condition1 captures the effect of the modified variable. After fitting, if the coefficient Condition1 is statistically significant for individual genes, then those gene's expression was altered in response to Condition1. Note that techical replicates (i.e. dyeswaps) are often correlated and are not independent, so that correlation should be modeled (i.e. they should have a common coefficient capturing the effect resulting from they're common bio source). Function duplicateCorrelation can help with that. If to answer your hypothesis, you want to compare two coefficients then a constrast matrix is needed. For simple designs, the design matrix can often include a coefficient that represents the condition of interest and then the hypothesis is H_{0}: β = Δ_{0} (the null or zero rv). The design matrix above is an example of that. The following is an equivalent test, but requires a contrast matrix: Condition0 Condition1 Array1 1 0 Array2 1 0 Array3 0 1 Array4 0 1 then a contrast matrix could be created using contrast.m < makeContrasts( C1vsC2 = Condition1  Condition2) then a call to contrasts.fit is needed to fit this new model (call this approach "contrasts"). Design MatricesIMPORTANT NOTE: the significance of differential expression changes depending on how/which arrays are used to do fitting (more genes to estimate variability and perform moderated T test). E.g. this was noticed when i would use a subset of the arrays (those which corresponded to my groups of interest that i wanted to compare) and the only fit the subset of arrays and used the "treatment parameterization" to compare the groups (see above). if i used all arrays in the fitting and used "contrasts" to select the comparison of interest (see above), i obtained more DE genes (althou i lost a few) when compared to the other approach. I will employ the second approach. IMPORTANT NOTE2: in the Contrasts matrix, when wanting to obtain the difference between two conditions (R1,R2) when there exists a reference control, the comparison R1R2 is equivalent to (R1control)  (R2control). These are mathematically equivalent and as expected, produce equivalent results in LIMMA. Useful Links
