User:Timothee Flutre/Notebook/Postdoc/2011/06/28: Difference between revisions
(→Simple linear regression: add regression with multiple predictors) |
|||
Line 6: | Line 6: | ||
| colspan="2"| | | colspan="2"| | ||
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### --> | ||
== | ==Linear regression by ordinary least squares== | ||
* '''Data''': | * '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship. | ||
* '''Model''': | * '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor). | ||
<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math> | <math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math> | ||
Line 74: | Line 74: | ||
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math> | <math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math> | ||
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>: | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>: | ||
Line 105: | Line 106: | ||
</nowiki> | </nowiki> | ||
* '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the genotypic effect. In matrix notation, we still have the same model Y = XB + E with Y an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and E an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix <math>\sigma^2 I_N</math>. | |||
As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X: | |||
<math>X = U D V^T</math> | |||
This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X: | |||
<math>X^+ = (X^TX)^{-1}X^T</math> | |||
<math>X^+ = V D^{-1} U^T</math> | |||
From this, we get the OLS estimate of the effect sizes: | |||
<math>\hat{B} = X^+ Y</math> | |||
Then it's straightforward to get the residuals: | |||
<math>\hat{E} = Y - X \hat{B}</math> | |||
With them we can calculate the estimate of the error variance: | |||
<math>\hat{\sigma} = \sqrt{\frac{1}{N-3} \hat{E}^T \hat{E}}</math> | |||
And finally the standard errors of the estimates of the effect sizes: | |||
<math>V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T</math> | |||
We can check this with some R code: | |||
<nowiki> | |||
## simulate the data | |||
set.seed(1859) | |||
N <- 100 | |||
mu <- 5 | |||
Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes | |||
beta.g <- 0.5 | |||
Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender | |||
beta.c <- 0.3 | |||
pve <- 0.8 | |||
betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405 | |||
sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) * | |||
(1-pve) / pve) # 0.2 | |||
y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma) | |||
## perform the OLS analysis with the SVD of X | |||
X <- cbind(rep(1,N), Xg, Xc) | |||
Xp <- svd(x=X) | |||
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y | |||
E.hat <- y - X %*% B.hat | |||
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211 | |||
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) | |||
sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463 | |||
## check all this | |||
ols <- lm(y ~ Xg + Xc) | |||
summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789 | |||
</nowiki> | |||
Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL ([http://www.gnu.org/software/gsl/manual/html_node/Multi_002dparameter-fitting.html here]). | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 12:35, 30 July 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Linear regression by ordinary least squares
[math]\displaystyle{ \forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2) }[/math] In matrix notation: [math]\displaystyle{ y = X \theta + \epsilon }[/math] with [math]\displaystyle{ \epsilon \sim N_N(0,\sigma^2 I_N) }[/math] and [math]\displaystyle{ \theta^T = (\mu, \beta) }[/math]
Here is the ordinary-least-square (OLS) estimator of [math]\displaystyle{ \theta }[/math]: [math]\displaystyle{ \hat{\theta} = (X^T X)^{-1} X^T Y }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math] [math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix} }[/math] Let's now define 4 summary statistics, very easy to compute: [math]\displaystyle{ \bar{y} = \frac{1}{N} \sum_{n=1}^N y_n }[/math] [math]\displaystyle{ \bar{g} = \frac{1}{N} \sum_{n=1}^N g_n }[/math] [math]\displaystyle{ g^T g = \sum_{n=1}^N g_n^2 }[/math] [math]\displaystyle{ g^T y = \sum_{n=1}^N g_n y_n }[/math] This allows to obtain the estimate of the effect size only by having the summary statistics available: [math]\displaystyle{ \hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2} }[/math] The same works for the estimate of the standard deviation of the errors: [math]\displaystyle{ \hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta}) }[/math] We can also benefit from this for the standard error of the parameters: [math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1} }[/math] [math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2} \begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} }[/math] [math]\displaystyle{ V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2} }[/math]
[math]\displaystyle{ PVE = \frac{V(\beta g)}{V(y)} = \frac{V(\beta g)}{V(\beta g) + V(\epsilon)} }[/math] with [math]\displaystyle{ V(\beta g) = \frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 }[/math] and [math]\displaystyle{ V(\epsilon) = \sigma^2 }[/math] This way, by also fixing [math]\displaystyle{ \beta }[/math], it is easy to calculate the corresponding [math]\displaystyle{ \sigma }[/math]: [math]\displaystyle{ \sigma = \sqrt{\frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 \frac{1 - PVE}{PVE}} }[/math] Here is some R code implementing this: set.seed(1859) N <- 100 mu <- 5 g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2 beta <- 0.5 pve <- 0.8 beta.g.bar <- mean(beta * g) sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.18 y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) plot(x=0, type="n", xlim=range(g), ylim=range(y), xlab="genotypes (allele dose)", ylab="phenotypes", main="Simple linear regression") for(i in unique(g)) points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) ols <- lm(y ~ g) summary(ols) # muhat=5.01, betahat=0.46, R2=0.779 abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want [math]\displaystyle{ \hat{B} }[/math], [math]\displaystyle{ \hat{\sigma} }[/math] and [math]\displaystyle{ V(\hat{B}) }[/math]. To efficiently get them, we start with the singular value decomposition of X: [math]\displaystyle{ X = U D V^T }[/math] This allows us to get the Moore-Penrose pseudoinverse matrix of X: [math]\displaystyle{ X^+ = (X^TX)^{-1}X^T }[/math] [math]\displaystyle{ X^+ = V D^{-1} U^T }[/math] From this, we get the OLS estimate of the effect sizes: [math]\displaystyle{ \hat{B} = X^+ Y }[/math] Then it's straightforward to get the residuals: [math]\displaystyle{ \hat{E} = Y - X \hat{B} }[/math] With them we can calculate the estimate of the error variance: [math]\displaystyle{ \hat{\sigma} = \sqrt{\frac{1}{N-3} \hat{E}^T \hat{E}} }[/math] And finally the standard errors of the estimates of the effect sizes: [math]\displaystyle{ V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T }[/math] We can check this with some R code: ## simulate the data set.seed(1859) N <- 100 mu <- 5 Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes beta.g <- 0.5 Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender beta.c <- 0.3 pve <- 0.8 betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405 sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) * (1-pve) / pve) # 0.2 y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma) ## perform the OLS analysis with the SVD of X X <- cbind(rep(1,N), Xg, Xc) Xp <- svd(x=X) B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y E.hat <- y - X %*% B.hat sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211 var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v) sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463 ## check all this ols <- lm(y ~ Xg + Xc) summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789 Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). |