User:Timothee Flutre/Notebook/Postdoc/2011/06/28: Difference between revisions
(→Linear regression by ordinary least squares: improve notation) |
|||
Line 73: | Line 73: | ||
We can also benefit from this for the standard error of the parameters: | We can also benefit from this for the standard error of the parameters: | ||
<math>V(\hat{B}) = \hat{\sigma}^2 (X^T X)^{-1}</math> | <math>V(\hat{B}) = (X^T X)^{-1} X^T V(\vec{y}) X (X^T X)^{-1} = \hat{\sigma}^2 (X^T X)^{-1}</math> | ||
<math>V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2} | <math>V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2} |
Revision as of 09:38, 25 November 2013
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Linear regression by ordinary least squares
[math]\displaystyle{ \forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta g_i + \epsilon_i \text{ with } \epsilon_i \sim \mathcal{N}(0,\sigma^2) }[/math] In vector-matrix notation: [math]\displaystyle{ \vec{y} = X B + \vec{e} }[/math] with [math]\displaystyle{ \vec{e} \sim \mathcal{N}_N(0,\sigma^2 I_N) }[/math] and [math]\displaystyle{ B^T = [\mu \beta] }[/math] The parameters of the model are: [math]\displaystyle{ \Theta = \{\mu, \beta, \sigma\} }[/math]
The well-known ordinary-least-square (OLS) estimator of [math]\displaystyle{ B }[/math] is: [math]\displaystyle{ \hat{B} = (X^T X)^{-1} X^T \vec{y} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} N & \sum_i g_i \\ \sum_i g_i & \sum_i g_i^2 \end{bmatrix}^{-1} \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} \begin{bmatrix} \sum_i g_i^2 & - \sum_i g_i \\ - \sum_i g_i & N \end{bmatrix} \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} \begin{bmatrix} \sum_i g_i^2 \sum_i y_i - \sum_i g_i \sum_i g_i y_i \\ - \sum_i g_i \sum_i y_i + N \sum_i g_i y_i \end{bmatrix} }[/math] Let's now define 4 summary statistics, very easy to compute: [math]\displaystyle{ \bar{y} = \frac{1}{N} \sum_{i=1}^N y_i }[/math] [math]\displaystyle{ \bar{g} = \frac{1}{N} \sum_{i=1}^N g_i }[/math] [math]\displaystyle{ \vec{g}^T \vec{g} = \sum_{i=1}^N g_i^2 }[/math] [math]\displaystyle{ \vec{g}^T \vec{y} = \sum_{i=1}^N g_i y_i }[/math] This allows to obtain the estimate of the effect size only by having the summary statistics available: [math]\displaystyle{ \hat{\beta} = \frac{\vec{g}^T \vec{y} - N \bar{g} \bar{y}}{\vec{g}^T \vec{g} - N \bar{g}^2} }[/math] By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that: [math]\displaystyle{ \hat{\beta} = \frac{Cov[\vec{y},\vec{g}]}{Var[\vec{g}]} }[/math] The same works for the estimate of the standard deviation of the errors: [math]\displaystyle{ \hat{\sigma}^2 = \frac{1}{N-r}(\vec{y} - X\hat{B})^T(\vec{y} - X\hat{B}) }[/math] We can also benefit from this for the standard error of the parameters: [math]\displaystyle{ V(\hat{B}) = (X^T X)^{-1} X^T V(\vec{y}) X (X^T X)^{-1} = \hat{\sigma}^2 (X^T X)^{-1} }[/math] [math]\displaystyle{ V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2} \begin{bmatrix} \vec{g}^T\vec{g} & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} }[/math] [math]\displaystyle{ V(\hat{\beta}) = \frac{\hat{\sigma}^2}{\vec{g}^T\vec{g} - N\bar{g}^2} }[/math] which corresponds to: [math]\displaystyle{ V(\hat{\beta}) = \frac{1}{N} \frac{Var[\vec{e}]}{Var[\vec{g}]} }[/math]
[math]\displaystyle{ V(y) = V(\mu + \beta g + \epsilon) = V(\mu) + V(\beta g) + V(\epsilon) = \beta^2 V(g) + \sigma^2 }[/math] The most intuitive way to simulate data is therefore to fix the proportion of variance in [math]\displaystyle{ y }[/math] explained by the genotype, for instance [math]\displaystyle{ PVE=60% }[/math], as well as the standard deviation of the errors, typically [math]\displaystyle{ \sigma=1 }[/math]. From this, we can calculate the corresponding effect size [math]\displaystyle{ \beta }[/math] of the genotype: [math]\displaystyle{ PVE = \frac{V(\beta g)}{V(y)} }[/math] Therefore: [math]\displaystyle{ \beta = \pm \sigma \sqrt{\frac{PVE}{(1 - PVE) * V(g)}} }[/math] Note that [math]\displaystyle{ g }[/math] is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency [math]\displaystyle{ f }[/math] (eg. [math]\displaystyle{ f=0.3 }[/math]) and we will assume Hardy-Weinberg equilibrium. Then [math]\displaystyle{ g }[/math] is distributed according to a binomial distribution with 2 trials for which the probability of success is [math]\displaystyle{ f }[/math]. As a consequence, its variance is [math]\displaystyle{ V(g)=2f(1-f) }[/math]. Here is some R code implementing all this: set.seed(1859) N <- 100 # sample size mu <- 4 pve <- 0.6 sigma <- 1 maf <- 0.3 # minor allele frequency beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.89 g <- rbinom(n=N, size=2, prob=maf) # assuming Hardy-Weinberg equilibrium y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ g) summary(ols) # muhat=4.1+-0.13, betahat=1.6+-0.16, R2=0.49 sqrt((1/(N-2) * sum(ols$residuals^2))) # sigmahat=0.99 plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes (allele counts)", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want [math]\displaystyle{ \hat{B} }[/math], [math]\displaystyle{ \hat{\sigma} }[/math] and [math]\displaystyle{ V(\hat{B}) }[/math]. To efficiently get them, we start with the singular value decomposition of X: [math]\displaystyle{ X = U D V^T }[/math] This allows us to get the Moore-Penrose pseudoinverse matrix of X: [math]\displaystyle{ X^+ = (X^TX)^{-1}X^T = V D^{-1} U^T }[/math] From this, we get the OLS estimate of the effect sizes: [math]\displaystyle{ \hat{B} = X^+ \vec{y} }[/math] Then it's straightforward to get the residuals: [math]\displaystyle{ \hat{\vec{e}} = \vec{y} - X \hat{B} }[/math] With them we can calculate the estimate of the error variance: [math]\displaystyle{ \hat{\sigma}^2 = \frac{1}{N-3} \hat{\vec{e}}^T \hat{\vec{e}} }[/math] And finally the standard errors of the estimates of the effect sizes: [math]\displaystyle{ V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T }[/math] We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 4 pve.g <- 0.4 # genotype pve.c <- 0.2 # other covariate, eg. gender sigma <- 1 maf <- 0.3 sex.ratio <- 0.5 beta.g <- sigma * sqrt((1 / (2 * maf * (1 - maf))) * (pve.g / (1 - pve.g - pve.c))) # 1.543 beta.c <- beta.g * sqrt((pve.c / pve.g) * (2 * maf * (1 - maf) / sex.ratio * (1 - sex.ratio))) # 0.707 x.g <- rbinom(n=N, size=2, prob=maf) x.c <- rbinom(n=N, size=1, prob=sex.ratio) y <- mu + beta.g * x.g + beta.c * x.c + rnorm(n=N, mean=0, sd=sigma) ols <- lm(y ~ x.g + x.c) summary(ols) # muhat=3.9+-0.17, beta.g.hat=1.6+-0.17, beta.c.hat=0.58+-0.21, R2=0.51 sqrt((1/(N-3)) * sum(ols$residuals^2)) # sigma.hat = 1.058 ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), x.g, x.c) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 1.058 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.168 0.175 0.212 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |