User:Timothee Flutre/Notebook/Postdoc/2011/06/28: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(→‎Simple linear regression: add regression with multiple predictors)
Line 6: Line 6:
| colspan="2"|
| colspan="2"|
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
==Simple linear regression==
==Linear regression by ordinary least squares==


* '''Data''': Let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship.
* '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship.


* '''Model''': for this we use a simple linear regression (univariate phenotype, single predictor).
* '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor).


<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math>
<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math>
Line 74: Line 74:


<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>


* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>:
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>:
Line 105: Line 106:
</nowiki>
</nowiki>


* '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the genotypic effect. In matrix notation, we still have the same model Y = XB + E with Y an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and E an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix <math>\sigma^2 I_N</math>.
As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X:
<math>X = U D V^T</math>
This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X:
<math>X^+ = (X^TX)^{-1}X^T</math>
<math>X^+ = V D^{-1} U^T</math>
From this, we get the OLS estimate of the effect sizes:
<math>\hat{B} = X^+ Y</math>
Then it's straightforward to get the residuals:
<math>\hat{E} = Y - X \hat{B}</math>
With them we can calculate the estimate of the error variance:
<math>\hat{\sigma} = \sqrt{\frac{1}{N-3} \hat{E}^T \hat{E}}</math>
And finally the standard errors of the estimates of the effect sizes:
<math>V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T</math>
We can check this with some R code:
<nowiki>
## simulate the data
set.seed(1859)
N <- 100
mu <- 5
Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes
beta.g <- 0.5
Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender
beta.c <- 0.3
pve <- 0.8
betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405
sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) *
              (1-pve) / pve) # 0.2
y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma)
## perform the OLS analysis with the SVD of X
X <- cbind(rep(1,N), Xg, Xc)
Xp <- svd(x=X)
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y
E.hat <- y - X %*% B.hat
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v)
sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463
## check all this
ols <- lm(y ~ Xg + Xc)
summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789
</nowiki>
Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL ([http://www.gnu.org/software/gsl/manual/html_node/Multi_002dparameter-fitting.html here]).


<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 12:35, 30 July 2012

Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Linear regression by ordinary least squares

  • Data: let's assume that we obtained data from [math]\displaystyle{ N }[/math] individuals. We note [math]\displaystyle{ y_1,\ldots,y_N }[/math] the (quantitative) phenotypes (eg. expression level at a given gene), and [math]\displaystyle{ g_1,\ldots,g_N }[/math] the genotypes at a given SNP. We want to assess their linear relationship.
  • Model: to start with, we use a simple linear regression (univariate phenotype, single predictor).

[math]\displaystyle{ \forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2) }[/math]

In matrix notation:

[math]\displaystyle{ y = X \theta + \epsilon }[/math] with [math]\displaystyle{ \epsilon \sim N_N(0,\sigma^2 I_N) }[/math] and [math]\displaystyle{ \theta^T = (\mu, \beta) }[/math]

  • Use only summary statistics: most importantly, we want the following estimates: [math]\displaystyle{ \hat{\beta} }[/math], [math]\displaystyle{ se(\hat{\beta}) }[/math] (its standard error) and [math]\displaystyle{ \hat{\sigma} }[/math]. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.

Here is the ordinary-least-square (OLS) estimator of [math]\displaystyle{ \theta }[/math]:

[math]\displaystyle{ \hat{\theta} = (X^T X)^{-1} X^T Y }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix} }[/math]

Let's now define 4 summary statistics, very easy to compute:

[math]\displaystyle{ \bar{y} = \frac{1}{N} \sum_{n=1}^N y_n }[/math]

[math]\displaystyle{ \bar{g} = \frac{1}{N} \sum_{n=1}^N g_n }[/math]

[math]\displaystyle{ g^T g = \sum_{n=1}^N g_n^2 }[/math]

[math]\displaystyle{ g^T y = \sum_{n=1}^N g_n y_n }[/math]

This allows to obtain the estimate of the effect size only by having the summary statistics available:

[math]\displaystyle{ \hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2} }[/math]

The same works for the estimate of the standard deviation of the errors:

[math]\displaystyle{ \hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta}) }[/math]

We can also benefit from this for the standard error of the parameters:

[math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1} }[/math]

[math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2} \begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} }[/math]

[math]\displaystyle{ V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2} }[/math]


  • Simulation with a given PVE: when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: [math]\displaystyle{ y = \mu + \beta g + \epsilon }[/math]), it is frequent to fix the proportion of variance in [math]\displaystyle{ y }[/math] explained by [math]\displaystyle{ \beta g }[/math]:

[math]\displaystyle{ PVE = \frac{V(\beta g)}{V(y)} = \frac{V(\beta g)}{V(\beta g) + V(\epsilon)} }[/math] with [math]\displaystyle{ V(\beta g) = \frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 }[/math] and [math]\displaystyle{ V(\epsilon) = \sigma^2 }[/math]

This way, by also fixing [math]\displaystyle{ \beta }[/math], it is easy to calculate the corresponding [math]\displaystyle{ \sigma }[/math]:

[math]\displaystyle{ \sigma = \sqrt{\frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 \frac{1 - PVE}{PVE}} }[/math]

Here is some R code implementing this:

set.seed(1859)
N <- 100
mu <- 5
g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2
beta <- 0.5
pve <- 0.8
beta.g.bar <- mean(beta * g)
sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.18
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
plot(x=0, type="n", xlim=range(g), ylim=range(y),
     xlab="genotypes (allele dose)", ylab="phenotypes",
     main="Simple linear regression")
for(i in unique(g))
  points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
ols <- lm(y ~ g)
summary(ols) # muhat=5.01, betahat=0.46, R2=0.779
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])


  • Several predictors: let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the genotypic effect. In matrix notation, we still have the same model Y = XB + E with Y an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and E an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix [math]\displaystyle{ \sigma^2 I_N }[/math].

As above, we want [math]\displaystyle{ \hat{B} }[/math], [math]\displaystyle{ \hat{\sigma} }[/math] and [math]\displaystyle{ V(\hat{B}) }[/math]. To efficiently get them, we start with the singular value decomposition of X:

[math]\displaystyle{ X = U D V^T }[/math]

This allows us to get the Moore-Penrose pseudoinverse matrix of X:

[math]\displaystyle{ X^+ = (X^TX)^{-1}X^T }[/math]

[math]\displaystyle{ X^+ = V D^{-1} U^T }[/math]

From this, we get the OLS estimate of the effect sizes:

[math]\displaystyle{ \hat{B} = X^+ Y }[/math]

Then it's straightforward to get the residuals:

[math]\displaystyle{ \hat{E} = Y - X \hat{B} }[/math]

With them we can calculate the estimate of the error variance:

[math]\displaystyle{ \hat{\sigma} = \sqrt{\frac{1}{N-3} \hat{E}^T \hat{E}} }[/math]

And finally the standard errors of the estimates of the effect sizes:

[math]\displaystyle{ V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T }[/math]

We can check this with some R code:

## simulate the data
set.seed(1859)
N <- 100
mu <- 5
Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes
beta.g <- 0.5
Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender
beta.c <- 0.3
pve <- 0.8
betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405
sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) *
              (1-pve) / pve) # 0.2
y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma)

## perform the OLS analysis with the SVD of X
X <- cbind(rep(1,N), Xg, Xc)
Xp <- svd(x=X)
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y
E.hat <- y - X %*% B.hat
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v)
sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463

## check all this
ols <- lm(y ~ Xg + Xc)
summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789

Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here).