User:Timothee Flutre/Notebook/Postdoc/2011/06/28
From OpenWetWare
(→Simple linear regression: add regression with multiple predictors) |
(→Linear regression by ordinary least squares: improv simul with PVE + expand variance geno) |
||
| Line 76: | Line 76: | ||
| - | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case | + | * '''Simulation with a given PVE''': when testing an inference model, the first step is usually to simulate data. However, how do we choose the parameters? In our case, the model is <math>y = \mu + \beta g + \epsilon</math>). Therefore, the variance of <math>y</math> can be decomposed like this: |
| - | <math> | + | <math>V(y) = V(\mu + \beta g + \epsilon) = V(\mu) + V(\beta g) + V(\epsilon) = \beta^2 V(g) + \sigma^2</math> |
| - | + | The most intuitive way to simulate data is therefore to fix the proportion of variance in <math>y</math> explained by the genotype, for instance <math>PVE=60%</math>, as well as the standard deviation of the errors, typically <math>\sigma=1</math>. From this, we can calculate the corresponding effect size <math>\beta</math> of the genotype: | |
| - | <math> | + | <math>PVE = \frac{V(\beta g)}{V(y)}</math> |
| - | Here is some R code implementing this: | + | Therefore: |
| + | <math>\beta = \pm \sigma \sqrt{\frac{PVE}{(1 - PVE) * V(g)}}</math> | ||
| + | |||
| + | Note that <math>g</math> is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency <math>f</math> (eg. <math>f=0.3</math>) and we will assume Hardy-Weinberg equilibrium. Then <math>g</math> is distributed according to a binomial distribution with 2 trials for which the probability of success is <math>f</math>. As a consequence, its variance is <math>V(g)=2f(1-f)</math>. | ||
| + | |||
| + | Here is some R code implementing all this: | ||
<nowiki> | <nowiki> | ||
set.seed(1859) | set.seed(1859) | ||
| - | N <- 100 | + | N <- 100 # sample size |
| - | mu <- | + | mu <- 4 |
| - | g <- sample(x=0:2, size=N, replace=TRUE, prob=c( | + | pve <- 0.6 |
| - | + | sigma <- 1 | |
| - | + | maf <- 0.3 | |
| - | + | beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.88 | |
| - | + | g <- sample(x=0:2, size=N, replace=TRUE, prob=c(maf^2, 2*maf*(1-maf), (1-maf)^2)) | |
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) | y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma) | ||
| + | ols <- lm(y ~ g) | ||
| + | summary(ols) # muhat=3.5, betahat=2.1, R2=0.64 | ||
| + | sqrt(mean(ols$residuals^2)) # sigmahat = 0.98 | ||
plot(x=0, type="n", xlim=range(g), ylim=range(y), | plot(x=0, type="n", xlim=range(g), ylim=range(y), | ||
| - | xlab="genotypes | + | xlab="genotypes", ylab="phenotypes", |
main="Simple linear regression") | main="Simple linear regression") | ||
for(i in unique(g)) | for(i in unique(g)) | ||
points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) | points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19) | ||
| - | |||
| - | |||
abline(a=coefficients(ols)[1], b=coefficients(ols)[2]) | abline(a=coefficients(ols)[1], b=coefficients(ols)[2]) | ||
</nowiki> | </nowiki> | ||
Revision as of 21:57, 14 August 2012
Main project pageNext entry
| |
Linear regression by ordinary least squares
In matrix notation: y = Xθ + ε with ε˜NN(0,σ2IN) and θT = (μ,β)
Here is the ordinary-least-square (OLS) estimator of θ:
Let's now define 4 summary statistics, very easy to compute:
This allows to obtain the estimate of the effect size only by having the summary statistics available:
The same works for the estimate of the standard deviation of the errors:
We can also benefit from this for the standard error of the parameters:
V(y) = V(μ + βg + ε) = V(μ) + V(βg) + V(ε) = β2V(g) + σ2 The most intuitive way to simulate data is therefore to fix the proportion of variance in y explained by the genotype, for instance PVE = 60%, as well as the standard deviation of the errors, typically σ = 1. From this, we can calculate the corresponding effect size β of the genotype:
Therefore:
Note that g is the random variable corresponding to the genotype encoded in allele dose, such that it is equal to 0, 1 or 2 copies of the minor allele. For our simulation, we will fix the minor allele frequency f (eg. f = 0.3) and we will assume Hardy-Weinberg equilibrium. Then g is distributed according to a binomial distribution with 2 trials for which the probability of success is f. As a consequence, its variance is V(g) = 2f(1 − f). Here is some R code implementing all this:
set.seed(1859)
N <- 100 # sample size
mu <- 4
pve <- 0.6
sigma <- 1
maf <- 0.3
beta <- sigma * sqrt(pve / ((1 - pve) * 2 * maf * (1 - maf))) # 1.88
g <- sample(x=0:2, size=N, replace=TRUE, prob=c(maf^2, 2*maf*(1-maf), (1-maf)^2))
y <- mu + beta * g + rnorm(n=N, mean=0, sd=sigma)
ols <- lm(y ~ g)
summary(ols) # muhat=3.5, betahat=2.1, R2=0.64
sqrt(mean(ols$residuals^2)) # sigmahat = 0.98
plot(x=0, type="n", xlim=range(g), ylim=range(y),
xlab="genotypes", ylab="phenotypes",
main="Simple linear regression")
for(i in unique(g))
points(x=jitter(g[g == i]), y=y[g == i], col=i+1, pch=19)
abline(a=coefficients(ols)[1], b=coefficients(ols)[2])
As above, we want X = UDVT This allows us to get the Moore-Penrose pseudoinverse matrix of X: X + = (XTX) − 1XT X + = VD − 1UT From this, we get the OLS estimate of the effect sizes:
Then it's straightforward to get the residuals:
With them we can calculate the estimate of the error variance:
And finally the standard errors of the estimates of the effect sizes:
We can check this with some R code:
## simulate the data
set.seed(1859)
N <- 100
mu <- 5
Xg <- sample(x=0:2, size=N, replace=TRUE, prob=c(0.5, 0.3, 0.2)) # genotypes
beta.g <- 0.5
Xc <- sample(x=0:1, size=N, replace=TRUE, prob=c(0.7, 0.3)) # gender
beta.c <- 0.3
pve <- 0.8
betas.gc.bar <- mean(beta.g * Xg + beta.c * Xc) # 0.405
sigma <- sqrt((1/N) * sum((beta.g * Xg + beta.c * Xc - betas.gc.bar)^2) *
(1-pve) / pve) # 0.2
y <- mu + beta.g * Xg + beta.c * Xc + rnorm(n=N, mean=0, sd=sigma)
## perform the OLS analysis with the SVD of X
X <- cbind(rep(1,N), Xg, Xc)
Xp <- svd(x=X)
B.hat <- Xp$v %*% diag(1/Xp$d) %*% t(Xp$u) %*% y
E.hat <- y - X %*% B.hat
sigma.hat <- as.numeric(sqrt((1/(N-3)) * t(E.hat) %*% E.hat)) # 0.211
var.theta.hat <- sigma.hat^2 * Xp$v %*% diag((1/Xp$d)^2) %*% t(Xp$v)
sqrt(diag(var.theta.hat)) # 0.0304 0.0290 0.0463
## check all this
ols <- lm(y ~ Xg + Xc)
summary(ols) # muhat=4.99+-0.03, beta.g.hat=0.52+--.-29, beta.c.hat=0.24+-0.046, R2=0.789
Such an analysis can also be done easily in a custom C/C++ program thanks to the GSL (here). | |
the (quantitative) phenotypes (eg. expression level at a given gene), and
the genotypes at a given SNP. We want to assess their linear relationship.
,
(its standard error) and
. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
,
. To efficiently get them, we start with the


