Difference between revisions of "User:Timothee Flutre/Notebook/Postdoc/2011/06/28"

From OpenWetWare
Jump to: navigation, search
m (Simple linear regression)
(Linear regression by ordinary least squares: improve notation)
(3 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
| colspan="2"|
 
| colspan="2"|
 
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
 
<!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
==Simple linear regression==
+
==Linear regression by ordinary least squares==
  
* '''Data''': Let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship.
+
* '''Data''': let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (e.g. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship.
  
* '''Model''': for this we use a simple linear regression (univariate phenotype, single predictor).
+
* '''Model''': to start with, we use a simple linear regression (univariate phenotype, single predictor).
  
<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math>
+
<math>\forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta g_i + \epsilon_i \text{ with } \epsilon_i \sim \mathcal{N}(0,\sigma^2)</math>
  
In matrix notation:
+
In vector-matrix notation:
  
<math>y = X \theta + \epsilon</math> with <math>\epsilon \sim N_N(0,\sigma^2 I_N)</math> and <math>\theta^T = (\mu, \beta)</math>
+
<math>\vec{y} = X B + \vec{e}</math> with <math>\vec{e} \sim \mathcal{N}_N(0,\sigma^2 I_N)</math> and <math>B^T = [\mu \beta]</math>
  
* '''Use only summary statistics''': most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>se(\hat{\beta})</math> (its standard error) and <math>\hat{\sigma}</math>. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
+
The parameters of the model are: <math>\Theta = \{\mu, \beta, \sigma\}</math>
  
Here is the ordinary-least-square (OLS) estimator of <math>\theta</math>:
+
* '''Use only summary statistics''': most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>\hat{\sigma}</math> and <math>se(\hat{\beta})</math> (its standard error). In the case where we don't have access to the original data (e.g. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
  
<math>\hat{\theta} = (X^T X)^{-1} X^T Y</math>
+
The well-known ordinary-least-square ([http://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation OLS]) estimator of <math>B</math> is:
 +
 
 +
<math>\hat{B} = (X^T X)^{-1} X^T \vec{y}</math>
  
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
Line 32: Line 34:
  
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1}
+
\begin{bmatrix} N & \sum_i g_i \\ \sum_i g_i & \sum_i g_i^2 \end{bmatrix}^{-1}
\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}
+
\begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix}
 
</math>
 
</math>
  
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
+
\frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2}
\begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix}
+
\begin{bmatrix} \sum_i g_i^2 & - \sum_i g_i \\ - \sum_i g_i & N \end{bmatrix}
\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}
+
\begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix}
 
</math>
 
</math>
  
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
 
<math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
+
\frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2}
\begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix}
+
\begin{bmatrix} \sum_i g_i^2 \sum_i y_i - \sum_i g_i \sum_i g_i y_i \\ - \sum_i g_i \sum_i y_i + N \sum_i g_i y_i \end{bmatrix}
 
</math>
 
</math>
  
Let's now define 4 summary statistics, very easy to compute:
+
Let's now define 4 '''summary statistics''', very easy to compute:
  
<math>\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n</math>
+
<math>\bar{y} = \frac{1}{N} \sum_{i=1}^N y_i</math>
  
<math>\bar{g} = \frac{1}{N} \sum_{n=1}^N g_n</math>
+
<math>\bar{g} = \frac{1}{N} \sum_{i=1}^N g_i</math>
  
<math>g^T g = \sum_{n=1}^N g_n^2</math>
+
<math>\vec{g}^T \vec{g} = \sum_{i=1}^N g_i^2</math>
  
<math>g^T y = \sum_{n=1}^N g_n y_n</math>
+
<math>\vec{g}^T \vec{y} = \sum_{i=1}^N g_i y_i</math>
  
 
This allows to obtain the estimate of the effect size only by having the summary statistics available:
 
This allows to obtain the estimate of the effect size only by having the summary statistics available:
  
<math>\hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2}</math>
+
<math>\hat{\beta} = \frac{\vec{g}^T \vec{y} - N \bar{g} \bar{y}}{\vec{g}^T \vec{g} - N \bar{g}^2}</math>
 +
 
 +
By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that:
 +
 
 +
<math>\hat{\beta} = \frac{Cov[\vec{y},\vec{g}]}{Var[\vec{g}]}</math>
  
 
The same works for the estimate of the standard deviation of the errors:
 
The same works for the estimate of the standard deviation of the errors:
  
<math>\hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta})</math>
+
<math>\hat{\sigma}^2 = \frac{1}{N-r}(\vec{y} - X\hat{B})^T(\vec{y} - X\hat{B})</math>
  
 
We can also benefit from this for the standard error of the parameters:
 
We can also benefit from this for the standard error of the parameters:
  
<math>V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1}</math>
+
<math>V(\hat{B}) = \hat{\sigma}^2 (X^T X)^{-1}</math>
  
<math>V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2}
+
<math>V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2}
\begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix}
+
\begin{bmatrix} \vec{g}^T\vec{g} & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix}
 
</math>
 
</math>
  
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>
+
<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{\vec{g}^T\vec{g} - N\bar{g}^2}</math>
 +
 
 +
which corresponds to:
 +
 
 +
<math>V(\hat{\beta}) = \frac{1}{N} \frac{Var[\vec{e}]}{Var[\vec{g}]}</math>
 +
 
 +
 
 +
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case, the model is <math>y = \mu + \beta g + \epsilon</math>. Therefore, the variance of <math>y</math> can be decomposed like this:
 +
 
 +
<math>V(y) = V(\mu + \beta g + \epsilon) = V(\mu) + V(\beta g) + V(\epsilon) = \beta^2 V(g) + \sigma^2</math>
  
* '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case (linear regression: <math>y = \mu + \beta g + \epsilon</math>), it is frequent to fix the proportion of variance in <math>y</math> explained by <math>\beta g</math>:
+
The most intuitive way to simulate data is therefore to fix the proportion of variance in <math>y</math> explained by the genotype, for instance <math>PVE=60%</math>, as well as the standard deviation of the errors, typically <math>\sigma=1</math>. From this, we can calculate the corresponding effect size <math>\beta</math> of the genotype:
  
<math>PVE = \frac{V(\beta g)}{V(y)} = \frac{V(\beta g)}{V(\beta g) + V(\epsilon)}</math> with <math>V(\beta g) = \frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2</math> and <math>V(\epsilon) = \sigma^2</math>
+
<math>PVE = \frac{V(\beta g)}{V(y)}</math>
  
This way, by also fixing <math>\beta</math>, it is easy to calculate the corresponding <math>\sigma</math>:
+
Therefore:
 +
<math>\beta = \pm \sigma \sqrt{\frac{PVE}{(1 - PVE) * V(g)}}</math>
  
<math>\sigma = \sqrt{\frac{1}{N}\sum_{n=1}^N (\beta g_n - \bar{\beta g})^2 \frac{1 - PVE}{PVE}}</math>
+
Note that <math>g</math> is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency <math>f</math> (eg. <math>f=0.3</math>) and we will assume Hardy-Weinberg equilibrium. Then <math>g</math> is distributed according to a binomial distribution with 2 trials for which the probability of success is <math>f</math>. As a consequence, its variance is <math>V(g)=2f(1-f)</math>.
  
Here is some R code implementing this:
+
Here is some R code implementing all this:
  
 
  <nowiki>
 
  <nowiki>
 
set.seed(1859)
 
set.seed(1859)
N <- 100
+
N <- 100 # sample size
mu <- 5
+
mu <- 4
g <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # MAF=0.2
+
pve <- 0.6
beta <- 0.5
+
sigma <- 1
pve <- 0.8
+
maf <- 0.3 # minor allele frequency
beta.g.bar <- mean(beta * g)
+
beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.89
sigma <- sqrt((1/N) * sum((beta * g - beta.g.bar)^2) * (1-pve) / pve) # 0.18
+
g <- rbinom(n=N, size=2, prob=maf) # assuming Hardy-Weinberg equilibrium
 
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
 
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
 +
ols <- lm(y ~ g)
 +
summary(ols) # muhat=4.1+-0.13, betahat=1.6+-0.16, R2=0.49
 +
sqrt((1/(N-2) * sum(ols$residuals^2))) # sigmahat=0.99
 
plot(x=0, type="n", xlim=range(g), ylim=range(y),
 
plot(x=0, type="n", xlim=range(g), ylim=range(y),
     xlab="genotypes (allele dose)", ylab="phenotypes",
+
     xlab="genotypes (allele counts)", ylab="phenotypes",
 
     main="Simple linear regression")
 
     main="Simple linear regression")
 
for(i in unique(g))
 
for(i in unique(g))
 
   points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
 
   points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
ols <- lm(y ~ g)
 
summary(ols) # muhat=5.01, betahat=0.46, R2=0.779
 
 
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
 
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
 
</nowiki>
 
</nowiki>
  
 +
 +
* '''Several predictors''': let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the effect of the genotype. In matrix notation, we still have the same model, <math>\vec{y} = XB + \vec{e}</math> with <math>\vec{y}</math> an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and <math>\vec{e}</math> an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix <math>\sigma^2 I_N</math>.
 +
 +
As above, we want <math>\hat{B}</math>, <math>\hat{\sigma}</math> and <math>V(\hat{B})</math>. To efficiently get them, we start with the [http://en.wikipedia.org/wiki/Singular_value_decomposition singular value decomposition] of X:
 +
 +
<math>X = U D V^T</math>
 +
 +
This allows us to get the [http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse Moore-Penrose pseudoinverse] matrix of X:
 +
 +
<math>X^+ = (X^TX)^{-1}X^T = V D^{-1} U^T</math>
 +
 +
From this, we get the OLS estimate of the effect sizes:
 +
 +
<math>\hat{B} = X^+ \vec{y}</math>
 +
 +
Then it's straightforward to get the residuals:
 +
 +
<math>\hat{\vec{e}} = \vec{y} - X \hat{B}</math>
 +
 +
With them we can calculate the estimate of the error variance:
 +
 +
<math>\hat{\sigma}^2 = \frac{1}{N-3} \hat{\vec{e}}^T \hat{\vec{e}}</math>
 +
 +
And finally the standard errors of the estimates of the effect sizes:
 +
 +
<math>V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T</math>
 +
 +
We can check this with some R code:
 +
 +
<nowiki>
 +
## simulate the data
 +
set.seed(1859)
 +
N <- 100
 +
mu <- 4
 +
pve.g <- 0.4 # genotype
 +
pve.c <- 0.2 # other covariate, eg. gender
 +
sigma <- 1
 +
maf <- 0.3
 +
sex.ratio <- 0.5
 +
beta.g <- sigma * sqrt((1 / (2 * maf * (1 - maf))) * (pve.g / (1 - pve.g - pve.c))) # 1.543
 +
beta.c <- beta.g * sqrt((pve.c / pve.g) * (2 * maf * (1 - maf) / sex.ratio * (1 - sex.ratio))) # 0.707
 +
x.g <- rbinom(n=N, size=2, prob=maf)
 +
x.c <- rbinom(n=N, size=1, prob=sex.ratio)
 +
y <- mu + beta.g * x.g + beta.c * x.c + rnorm(n=N, mean=0, sd=sigma)
 +
ols <- lm(y ~ x.g + x.c)
 +
summary(ols) # muhat=3.9+-0.17, beta.g.hat=1.6+-0.17, beta.c.hat=0.58+-0.21, R2=0.51
 +
sqrt((1/(N-3)) * sum(ols$residuals^2)) # sigma.hat = 1.058
 +
 +
## perform the OLS analysis with the SVD of X
 +
X <- cbind(rep(1,N), x.g, x.c)
 +
Xp <- svd(x=X)
 +
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y
 +
E.hat <- y - X %*% B.hat
 +
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 1.058
 +
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v)
 +
sqrt(diag(var.theta.hat)) # 0.168 0.175 0.212
 +
</nowiki>
 +
 +
Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL ([http://www.gnu.org/software/gsl/manual/html_node/Multi_002dparameter-fitting.html here]).
  
 
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
 
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Revision as of 07:03, 21 November 2012

Owwnotebook icon.png Project name <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Linear regression by ordinary least squares

  • Data: let's assume that we obtained data from Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle N} individuals. We note Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle y_1,\ldots,y_N} the (quantitative) phenotypes (e.g. expression level at a given gene), and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle g_1,\ldots,g_N} the genotypes at a given SNP. We want to assess their linear relationship.
  • Model: to start with, we use a simple linear regression (univariate phenotype, single predictor).

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta g_i + \epsilon_i \text{ with } \epsilon_i \sim \mathcal{N}(0,\sigma^2)}

In vector-matrix notation:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{y} = X B + \vec{e}} with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{e} \sim \mathcal{N}_N(0,\sigma^2 I_N)} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle B^T = [\mu \beta]}

The parameters of the model are: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \Theta = \{\mu, \beta, \sigma\}}

  • Use only summary statistics: most importantly, we want the following estimates: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\beta}} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\sigma}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle se(\hat{\beta})} (its standard error). In the case where we don't have access to the original data (e.g. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.

The well-known ordinary-least-square (OLS) estimator of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle B} is:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{B} = (X^T X)^{-1} X^T \vec{y}}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} N & \sum_i g_i \\ \sum_i g_i & \sum_i g_i^2 \end{bmatrix}^{-1} \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} \begin{bmatrix} \sum_i g_i^2 & - \sum_i g_i \\ - \sum_i g_i & N \end{bmatrix} \begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_i g_i^2 - (\sum_i g_i)^2} \begin{bmatrix} \sum_i g_i^2 \sum_i y_i - \sum_i g_i \sum_i g_i y_i \\ - \sum_i g_i \sum_i y_i + N \sum_i g_i y_i \end{bmatrix} }

Let's now define 4 summary statistics, very easy to compute:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \bar{y} = \frac{1}{N} \sum_{i=1}^N y_i}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \bar{g} = \frac{1}{N} \sum_{i=1}^N g_i}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{g}^T \vec{g} = \sum_{i=1}^N g_i^2}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{g}^T \vec{y} = \sum_{i=1}^N g_i y_i}

This allows to obtain the estimate of the effect size only by having the summary statistics available:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\beta} = \frac{\vec{g}^T \vec{y} - N \bar{g} \bar{y}}{\vec{g}^T \vec{g} - N \bar{g}^2}}

By the way, in this case (i.e. simple linear regression, a single predictor), it's easy to see that:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\beta} = \frac{Cov[\vec{y},\vec{g}]}{Var[\vec{g}]}}

The same works for the estimate of the standard deviation of the errors:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\sigma}^2 = \frac{1}{N-r}(\vec{y} - X\hat{B})^T(\vec{y} - X\hat{B})}

We can also benefit from this for the standard error of the parameters:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{B}) = \hat{\sigma}^2 (X^T X)^{-1}}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{B}) = \hat{\sigma}^2 \frac{1}{N \vec{g}^T \vec{g} - N^2 \bar{g}^2} \begin{bmatrix} \vec{g}^T\vec{g} & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} }

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{\beta}) = \frac{\hat{\sigma}^2}{\vec{g}^T\vec{g} - N\bar{g}^2}}

which corresponds to:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{\beta}) = \frac{1}{N} \frac{Var[\vec{e}]}{Var[\vec{g}]}}


  • Simulation with a given PVE: when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case, the model is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle y = \mu + \beta g + \epsilon} . Therefore, the variance of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle y} can be decomposed like this:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(y) = V(\mu + \beta g + \epsilon) = V(\mu) + V(\beta g) + V(\epsilon) = \beta^2 V(g) + \sigma^2}

The most intuitive way to simulate data is therefore to fix the proportion of variance in Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle y} explained by the genotype, for instance Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle PVE=60%} , as well as the standard deviation of the errors, typically Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \sigma=1} . From this, we can calculate the corresponding effect size Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \beta} of the genotype:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle PVE = \frac{V(\beta g)}{V(y)}}

Therefore: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \beta = \pm \sigma \sqrt{\frac{PVE}{(1 - PVE) * V(g)}}}

Note that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle g} is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle f} (eg. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle f=0.3} ) and we will assume Hardy-Weinberg equilibrium. Then Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle g} is distributed according to a binomial distribution with 2 trials for which the probability of success is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle f} . As a consequence, its variance is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(g)=2f(1-f)} .

Here is some R code implementing all this:

set.seed(1859)
N <- 100 # sample size
mu <- 4
pve <- 0.6
sigma <- 1
maf <- 0.3 # minor allele frequency
beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.89
g <- rbinom(n=N, size=2, prob=maf) # assuming Hardy-Weinberg equilibrium
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
ols <- lm(y ~ g)
summary(ols) # muhat=4.1+-0.13, betahat=1.6+-0.16, R2=0.49
sqrt((1/(N-2) * sum(ols$residuals^2))) # sigmahat=0.99
plot(x=0, type="n", xlim=range(g), ylim=range(y),
     xlab="genotypes (allele counts)", ylab="phenotypes",
     main="Simple linear regression")
for(i in unique(g))
  points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])


  • Several predictors: let's now imagine that we also know the gender of the N sampled individuals. We hence want to account for that in our estimate of the effect of the genotype. In matrix notation, we still have the same model, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{y} = XB + \vec{e}} with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{y}} an Nx1 vector, X an Nx3 matrix with 1's in the first column, the genotypes in the second and the genders in the third, B a 3x1 vector and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \vec{e}} an Nx1 vector following a multivariate Normal distribution centered on 0 and with covariance matrix Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \sigma^2 I_N} .

As above, we want Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{B}} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\sigma}} and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{B})} . To efficiently get them, we start with the singular value decomposition of X:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle X = U D V^T}

This allows us to get the Moore-Penrose pseudoinverse matrix of X:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle X^+ = (X^TX)^{-1}X^T = V D^{-1} U^T}

From this, we get the OLS estimate of the effect sizes:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{B} = X^+ \vec{y}}

Then it's straightforward to get the residuals:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\vec{e}} = \vec{y} - X \hat{B}}

With them we can calculate the estimate of the error variance:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle \hat{\sigma}^2 = \frac{1}{N-3} \hat{\vec{e}}^T \hat{\vec{e}}}

And finally the standard errors of the estimates of the effect sizes:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://api.formulasearchengine.com/v1/":): {\displaystyle V(\hat{B}) = \hat{\sigma}^2 V D^{-2} V^T}

We can check this with some R code:

## simulate the data
set.seed(1859)
N <- 100
mu <- 4
pve.g <- 0.4 # genotype
pve.c <- 0.2 # other covariate, eg. gender
sigma <- 1
maf <- 0.3
sex.ratio <- 0.5
beta.g <- sigma * sqrt((1 / (2 * maf * (1 - maf))) * (pve.g / (1 - pve.g - pve.c))) # 1.543
beta.c <- beta.g * sqrt((pve.c / pve.g) * (2 * maf * (1 - maf) / sex.ratio * (1 - sex.ratio))) # 0.707
x.g <- rbinom(n=N, size=2, prob=maf)
x.c <- rbinom(n=N, size=1, prob=sex.ratio)
y <- mu + beta.g * x.g + beta.c * x.c + rnorm(n=N, mean=0, sd=sigma)
ols <- lm(y ~ x.g + x.c)
summary(ols) # muhat=3.9+-0.17, beta.g.hat=1.6+-0.17, beta.c.hat=0.58+-0.21, R2=0.51
sqrt((1/(N-3)) * sum(ols$residuals^2)) # sigma.hat = 1.058

## perform the OLS analysis with the SVD of X
X <- cbind(rep(1,N), x.g, x.c)
Xp <- svd(x=X)
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y
E.hat <- y - X %*% B.hat
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 1.058
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v)
sqrt(diag(var.theta.hat)) # 0.168 0.175 0.212

Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here).