User:Timothee Flutre/Notebook/Postdoc/2011/06/28

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
m (Calculate OLS estimates with summary statistics for simple linear regression)
(Simple linear regression: add simulation with PVE and R code)
Line 18: Line 18:
<math>y = X \theta + \epsilon</math> with <math>\epsilon \sim N_N(0,\sigma^2 I_N)</math> and <math>\theta^T = (\mu, \beta)</math>
<math>y = X \theta + \epsilon</math> with <math>\epsilon \sim N_N(0,\sigma^2 I_N)</math> and <math>\theta^T = (\mu, \beta)</math>
-
Most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>se(\hat{\beta})</math> (its standard error) and <math>\hat{\sigma}</math>. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
+
* '''Use only summary statistics''': most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>se(\hat{\beta})</math> (its standard error) and <math>\hat{\sigma}</math>. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
Here is the ordinary-least-square (OLS) estimator of <math>\theta</math>:
Here is the ordinary-least-square (OLS) estimator of <math>\theta</math>:
Line 74: Line 74:
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>
 +
 +
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>:
 +
 +
<math>PVE = \frac{V(\beta g)}{V(y)} = \frac{V(\beta g)}{V(\beta g) + V(\epsilon)}</math> with <math>V(\beta g) = \frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2</math> and <math>V(\epsilon) = \sigma^2</math>
 +
 +
This way, by also fixing <math>\beta</math>, it is easy to calculate the corresponding <math>\sigma</math>:
 +
 +
<math>\sigma = \sqrt{\frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 \frac{1 - PVE}{PVE}}</math>
 +
 +
Here is some R code implementing this:
 +
 +
<nowiki>
 +
set.seed(1859)
 +
N <- 100
 +
g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2
 +
mu <- 5
 +
beta <- 0.5
 +
pve <- 0.8
 +
beta.g.bar <- mean(beta * g)
 +
sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.778
 +
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
 +
plot(x=0, type="n", xlim=range(g), ylim=range(y),
 +
    xlab="genotypes (allele dose)", ylab="phenotypes",
 +
    main="Simple linear regression")
 +
for(i in unique(g))
 +
  points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
 +
ols <- lm(y ~ g)
 +
summary(ols) # muhat=5.01, betahat=0.46, R2=0.779
 +
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
 +
</nowiki>
 +
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 14:19, 19 June 2012

Project name Main project page
Next entry

Simple linear regression

  • Data: Let's assume that we obtained data from N individuals. We note y_1,\ldots,y_N the (quantitative) phenotypes (eg. expression level at a given gene), and g_1,\ldots,g_N the genotypes at a given SNP. We want to assess their linear relationship.
  • Model: for this we use a simple linear regression (univariate phenotype, single predictor).

\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)

In matrix notation:

y = Xθ + ε with ε˜NN(0,σ2IN) and θT = (μ,β)

  • Use only summary statistics: most importantly, we want the following estimates: \hat{\beta}, se(\hat{\beta}) (its standard error) and \hat{\sigma}. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.

Here is the ordinary-least-square (OLS) estimator of θ:

\hat{\theta} = (X^T X)^{-1} X^T Y

\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix}
\begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1}
\begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix}
\begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix}

\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1}
\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}

\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
\begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix}
\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}

\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
\begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix}

Let's now define 4 summary statistics, very easy to compute:

\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n

\bar{g} = \frac{1}{N} \sum_{n=1}^N g_n

g^T g = \sum_{n=1}^N g_n^2

g^T y = \sum_{n=1}^N g_n y_n

This allows to obtain the estimate of the effect size only by having the summary statistics available:

\hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2}

The same works for the estimate of the standard deviation of the errors:

\hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta})

We can also benefit from this for the standard error of the parameters:

V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1}

V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2}
\begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix}

V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}

  • Simulation with a given PVE: when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: y = μ + βg + ε), it is frequent to fix the proportion of variance in y explained by βg:

PVE = \frac{V(\beta g)}{V(y)} = \frac{V(\beta g)}{V(\beta g) + V(\epsilon)} with V(\beta g) = \frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 and V(ε) = σ2

This way, by also fixing β, it is easy to calculate the corresponding σ:

\sigma = \sqrt{\frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 \frac{1 - PVE}{PVE}}

Here is some R code implementing this:

set.seed(1859)
N <- 100
g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2
mu <- 5
beta <- 0.5
pve <- 0.8
beta.g.bar <- mean(beta * g)
sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.778
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
plot(x=0, type="n", xlim=range(g), ylim=range(y),
     xlab="genotypes (allele dose)", ylab="phenotypes",
     main="Simple linear regression")
for(i in unique(g))
  points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
ols <- lm(y ~ g)
summary(ols) # muhat=5.01, betahat=0.46, R2=0.779
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])



Personal tools