User:Matthew Whiteside/Notebook/Malaria Microarray/2009/03/04
|Project name||Main project page|
Previous entry Next entry
T = x - y - (μ1 - μ2) ÷ sqrt( S12 ÷ m + S22 ÷ n )
rv's X & Y are independent.
Paired 2 sample
assumptions: rv X & Y are not indepented. 2 observations are paired (i.e. dependent) but each pair is independent. so rv D = X-Y is independent and
T = d - Δ0 ÷ S / sqrt(n)
(similar to one sample, where null hypothesis is that H0: μd = Δ0 (the null/zero rv) .
Paired data have less variation (because they will be correlated), so 2 sample may miss results, but paired t-test associated with less degrees of freedom.
Guidelines: 1) if paired observations are very hetergeneous (vary greately between themselves) then paired test better suited. 2) if little correlation - increase in pairing precision is outweighed by increase in df in unpaired test. unpaired test better suited.
Linear models based on categorical data
Each gene's expression (Y) is fitted with a linear model:
Y = β + ε
where ε is rv, E[ε] = 0, V[ε] = σ2.
When least sq fitting is performed, β is mean expression of gene across independent biological replicates (samples). ε captures the systematic error of the array.
When 2 or more conditions are being tested, the gene's expression will be altered by a coefficient, representing that condition's effect. The conditions (categories) are modeled in the linear model using indicator variables:
Y = β0 + β1 × x1 + ε
where x1 = 1 for arrays testing condition 1, otherwise = 0.
Coefficient β1 captures the effect of condition 1 for that gene.
Note: robust regression is also available in the fitting function lmFit. It weights outlier expression values from biological replicates less, when using computing coefficients.
To test if coefficient βi has a statistically significant effect on a particular gene's expression, a modified T test is used.
The modified T test in LIMMA is implemented in the eBayes function in Bioconductor.
The modified T test borrows some information from the chip-wide coefficient's variation to modify the coefficient's variation for that individual gene.
use topTable and decideTest functions to extract genes with significantly coefficients.
As input the LIMMA lmFit function requires design matrices. Design matrices are matrix of indicator variables used to represent the categorical linear model (see above).
In the design matrix, each column represents a coefficient that will capture a particular condition. Each row represents the different arrays (or samples - call this "treatment parameterization" method):
(intecept) Condition1 Array1 1 0 Array2 1 0 Array3 1 1 Array4 1 1
This is just one sample of many possible design matrices. If we assume single-channel affy data, in this example there are two conditions: Array 1 & 2 are the baseline and A. Coefficient (intercept) captures effect of the baseline on gene expression. Coefficient Condition1 captures the effect of the modified variable. After fitting, if the coefficient Condition1 is statistically significant for individual genes, then those gene's expression was altered in response to Condition1.
Note that techical replicates (i.e. dye-swaps) are often correlated and are not independent, so that correlation should be modeled (i.e. they should have a common coefficient capturing the effect resulting from they're common bio source). Function duplicateCorrelation can help with that.
If to answer your hypothesis, you want to compare two coefficients then a constrast matrix is needed. For simple designs, the design matrix can often include a coefficient that represents the condition of interest and then the hypothesis is H0: β = Δ0 (the null or zero rv). The design matrix above is an example of that. The following is an equivalent test, but requires a contrast matrix:
Condition0 Condition1 Array1 1 0 Array2 1 0 Array3 0 1 Array4 0 1
then a contrast matrix could be created using
contrast.m <- makeContrasts( C1vsC2 = Condition1 - Condition2)
then a call to contrasts.fit is needed to fit this new model (call this approach "contrasts").
IMPORTANT NOTE: the significance of differential expression changes depending on how/which arrays are used to do fitting (more genes to estimate variability and perform moderated T test). E.g. this was noticed when i would use a subset of the arrays (those which corresponded to my groups of interest that i wanted to compare) and the only fit the subset of arrays and used the "treatment parameterization" to compare the groups (see above). if i used all arrays in the fitting and used "contrasts" to select the comparison of interest (see above), i obtained more DE genes (althou i lost a few) when compared to the other approach. I will employ the second approach.
IMPORTANT NOTE2: in the Contrasts matrix, when wanting to obtain the difference between two conditions (R1,R2) when there exists a reference control, the comparison R1-R2 is equivalent to (R1-control) - (R2-control). These are mathematically equivalent and as expected, produce equivalent results in LIMMA.