User:Timothee Flutre/Notebook/Postdoc/2011/11/10: Difference between revisions

Revision as of 10:02, 24 July 2014

Project name

<html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
<html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html>      </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Bayesian model of univariate linear regression for QTL detection

This page aims at helping people like me, interested in quantitative genetics, to get a better understanding of some Bayesian models, most importantly the impact of the modeling assumptions as well as the underlying maths. It starts with a simple model, and gradually increases the scope to relax assumptions. See references to scientific articles at the end.

Data: let's assume that we obtained data from N individuals. We note [math]\displaystyle{ y_1,\ldots,y_N }[/math] the (quantitative) phenotypes (e.g. expression levels at a given gene), and [math]\displaystyle{ g_1,\ldots,g_N }[/math] the genotypes at a given SNP (encoded as allele dose: 0, 1 or 2).

Goal: we want to assess the evidence in the data for an effect of the genotype on the phenotype.

Assumptions: the relationship between genotype and phenotype is linear; the individuals are not genetically related; there is no hidden confounding factors in the phenotypes.

Likelihood: we start by writing the usual linear regression for one individual

[math]\displaystyle{ \forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta_1 g_i + \beta_2 \mathbf{1}_{g_i=1} + \epsilon_i \; \text{ with } \; \epsilon_i \; \overset{i.i.d}{\sim} \; \mathcal{N}(0,\tau^{-1}) }[/math]

where [math]\displaystyle{ \beta_1 }[/math] is in fact the additive effect of the SNP, noted [math]\displaystyle{ a }[/math] from now on, and [math]\displaystyle{ \beta_2 }[/math] is the dominance effect of the SNP, [math]\displaystyle{ d = a k }[/math].

Let's now write the model in vector-matrix notation:

[math]\displaystyle{ Y = X B + E \text{ where } B = [ \mu \; a \; d ]^T }[/math]

This gives the following multivariate Normal distribution for the phenotypes:

[math]\displaystyle{ Y | X, \tau, B \sim \mathcal{N}(XB, \tau^{-1} I_N) }[/math]

Even though we can write the likelihood as a multivariate Normal, I still keep the term "univariate" in the title because the regression has a single response, [math]\displaystyle{ Y }[/math]. It is usual to keep the term "multivariate" for the case where there is a matrix of responses (i.e. multiple phenotypes).

The likelihood of the parameters given the data is therefore:

[math]\displaystyle{ \mathcal{L}(\tau, B) = \mathsf{P}(Y | X, \tau, B) }[/math]

[math]\displaystyle{ \mathcal{L}(\tau, B) = \left(\frac{\tau}{2 \pi}\right)^{\frac{N}{2}} exp \left( -\frac{\tau}{2} (Y - XB)^T (Y - XB) \right) }[/math]

Priors: we use the usual conjugate prior

[math]\displaystyle{ \mathsf{P}(\tau, B) = \mathsf{P}(\tau) \mathsf{P}(B | \tau) }[/math]

A Gamma distribution for [math]\displaystyle{ \tau }[/math]:

[math]\displaystyle{ \tau \sim \Gamma(\kappa/2, \, \lambda/2) }[/math]

which means:

[math]\displaystyle{ \mathsf{P}(\tau) = \frac{\frac{\lambda}{2}^{\kappa/2}}{\Gamma(\frac{\kappa}{2})} \tau^{\frac{\kappa}{2}-1} e^{-\frac{\lambda}{2} \tau} }[/math]

And a multivariate Normal distribution for [math]\displaystyle{ B }[/math]:

[math]\displaystyle{ B | \tau \sim \mathcal{N}(\vec{0}, \, \tau^{-1} \Sigma_B) \text{ with } \Sigma_B = diag(\sigma_{\mu}^2, \sigma_a^2, \sigma_d^2) }[/math]

which means:

[math]\displaystyle{ \mathsf{P}(B | \tau) = \left(\frac{\tau}{2 \pi}\right)^{\frac{3}{2}} |\Sigma_B|^{-\frac{1}{2}} exp \left(-\frac{\tau}{2} B^T \Sigma_B^{-1} B \right) }[/math]

Joint posterior (1):

[math]\displaystyle{ \mathsf{P}(\tau, B | Y, X) = \mathsf{P}(\tau | Y, X) \mathsf{P}(B | Y, X, \tau) }[/math]

Conditional posterior of B:

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) = \frac{\mathsf{P}(B, Y | X, \tau)}{\mathsf{P}(Y | X, \tau)} }[/math]

Let's neglect the normalization constant for now:

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto \mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B) }[/math]

Similarly, let's keep only the terms in [math]\displaystyle{ B }[/math] for the moment:

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto exp(B^T \Sigma_B^{-1} B) exp((Y-XB)^T(Y-XB)) }[/math]

We expand:

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto exp(B^T \Sigma_B^{-1} B - Y^TXB -B^TX^TY + B^TX^TXB) }[/math]

We factorize some terms:

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto exp(B^T (\Sigma_B^{-1} + X^TX) B - Y^TXB -B^TX^TY) }[/math]

Importantly, let's define:

[math]\displaystyle{ \Omega = (\Sigma_B^{-1} + X^TX)^{-1} }[/math]

We can see that [math]\displaystyle{ \Omega^T=\Omega }[/math], which means that [math]\displaystyle{ \Omega }[/math] is a symmetric matrix. This is particularly useful here because we can use the following equality: [math]\displaystyle{ \Omega^{-1}\Omega^T=I }[/math].

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto exp(B^T \Omega^{-1} B - (X^TY)^T\Omega^{-1}\Omega^TB -B^T\Omega^{-1}\Omega^TX^TY) }[/math]

This now becomes easy to factorizes totally (remember to add the term in [math]\displaystyle{ Y }[/math] to complete the square):

[math]\displaystyle{ \mathsf{P}(B | Y, X, \tau) \propto exp((B - \Omega X^TY)^T\Omega^{-1}(B - \Omega X^TY)) }[/math]

We recognize the kernel of a Normal distribution, allowing us to write the conditional posterior as:

[math]\displaystyle{ B | Y, X, \tau \sim \mathcal{N}(\Omega X^TY, \tau^{-1} \Omega) }[/math]

Posterior of [math]\displaystyle{ \tau }[/math]:

Similarly to the equations above:

[math]\displaystyle{ \mathsf{P}(\tau | Y, X) \propto \mathsf{P}(\tau) \mathsf{P}(Y | X, \tau) }[/math]

But now, to handle the second term, we need to integrate over [math]\displaystyle{ B }[/math], thus effectively taking into account the uncertainty in [math]\displaystyle{ B }[/math]:

[math]\displaystyle{ \mathsf{P}(\tau | Y, X) \propto \mathsf{P}(\tau) \int \mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B) \mathsf{d}B }[/math]

Again, we use the priors and likelihoods specified above (but everything inside the integral is kept inside it, even if it doesn't depend on [math]\displaystyle{ B }[/math]!):

[math]\displaystyle{ \mathsf{P}(\tau | Y, X) \propto \tau^{\frac{\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} \int \tau^{3/2} \tau^{N/2} exp(-\frac{\tau}{2} B^T \Sigma_B^{-1} B) exp(-\frac{\tau}{2} (Y - XB)^T (Y - XB)) \mathsf{d}B }[/math]

As we used a conjugate prior for [math]\displaystyle{ \tau }[/math], we know that we expect a Gamma distribution for the posterior. Therefore, we can take [math]\displaystyle{ \tau^{N/2} }[/math] out of the integral and start guessing what looks like a Gamma distribution. We also factorize inside the exponential:

[math]\displaystyle{ \mathsf{P}(\tau | Y, X) \propto \tau^{\frac{N+\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} \int \tau^{3/2} exp \left[-\frac{\tau}{2} \left( (B - \Omega X^T Y)^T \Omega^{-1} (B - \Omega X^T Y) - Y^T X \Omega X^T Y + Y^T Y \right) \right] \mathsf{d}B }[/math]

We recognize the conditional posterior of [math]\displaystyle{ B }[/math]. This allows us to use the fact that the pdf of the Normal distribution integrates to one:

[math]\displaystyle{ \mathsf{P}(\tau | Y, X) \propto \tau^{\frac{N+\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} exp\left[-\frac{\tau}{2} (Y^T Y - Y^T X \Omega X^T Y) \right] }[/math]

We finally recognize a Gamma distribution, allowing us to write the posterior as:

[math]\displaystyle{ \tau | Y, X \sim \Gamma \left( \frac{N+\kappa}{2}, \; \frac{1}{2} (Y^T Y - Y^T X \Omega X^T Y + \lambda) \right) }[/math]

Joint posterior (2): sometimes it is said that the joint posterior follows a Normal Inverse Gamma distribution:

[math]\displaystyle{ B, \tau | Y, X \sim \mathcal{N}IG(\Omega X^TY, \; \tau^{-1}\Omega, \; \frac{N+\kappa}{2}, \; \frac{\lambda^\ast}{2}) }[/math]

where [math]\displaystyle{ \lambda^\ast = Y^T Y - Y^T X \Omega X^T Y + \lambda }[/math]

Marginal posterior of B: we can now integrate out [math]\displaystyle{ \tau }[/math]:

[math]\displaystyle{ \mathsf{P}(B | Y, X) = \int \mathsf{P}(\tau) \mathsf{P}(B | Y, X, \tau) \mathsf{d}\tau }[/math]

[math]\displaystyle{ \mathsf{P}(B | Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}}}{(2\pi)^\frac{3}{2} |\Omega|^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \int \tau^{\frac{N+\kappa+3}{2}-1} exp \left[-\tau \left( \frac{\lambda^\ast}{2} + (B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY) \right) \right] \mathsf{d}\tau }[/math]

Here we recognize the formula to integrate the Gamma function:

[math]\displaystyle{ \mathsf{P}(B | Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}} \Gamma(\frac{N+\kappa+3}{2})}{(2\pi)^\frac{3}{2} |\Omega|^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \left( \frac{\lambda^\ast}{2} + (B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY) \right)^{-\frac{N+\kappa+3}{2}} }[/math]

And we now recognize a multivariate Student's t-distribution:

[math]\displaystyle{ \mathsf{P}(B | Y, X) = \frac{\Gamma(\frac{N+\kappa+3}{2})}{\Gamma(\frac{N+\kappa}{2}) \pi^\frac{3}{2} |\lambda^\ast \Omega|^{\frac{1}{2}} } \left( 1 + \frac{(B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY)}{\lambda^\ast} \right)^{-\frac{N+\kappa+3}{2}} }[/math]

We hence can write:

[math]\displaystyle{ B | Y, X \sim \mathcal{S}_{N+\kappa}(\Omega X^TY, \; (Y^T Y - Y^T X \Omega X^T Y + \lambda) \Omega) }[/math]

Bayes Factor: one way to answer our goal above ("is there an effect of the genotype on the phenotype?") is to do hypothesis testing.

We want to test the following null hypothesis:

[math]\displaystyle{ H_0: \; a = d = 0 }[/math]

In Bayesian modeling, hypothesis testing is performed with a Bayes factor, which in our case can be written as:

[math]\displaystyle{ \mathrm{BF} = \frac{\mathsf{P}(Y | X, a \neq 0, d \neq 0)}{\mathsf{P}(Y | X, a = 0, d = 0)} }[/math]

We can shorten this into:

[math]\displaystyle{ \mathrm{BF} = \frac{\mathsf{P}(Y | X)}{\mathsf{P}_0(Y)} }[/math]

Note that, compare to frequentist hypothesis testing which focuses on the null, the Bayes factor requires to explicitly model the data under the alternative. This makes a big difference when interpreting the results (see below).

Let's start with the numerator:

[math]\displaystyle{ \mathsf{P}(Y | X) = \int \mathsf{P}(\tau) \mathsf{P}(Y | X, \tau) \mathsf{d}\tau }[/math]

First, let's calculate what is inside the integral:

[math]\displaystyle{ \mathsf{P}(Y | X, \tau) = \frac{\mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B)}{\mathsf{P}(B | Y, X, \tau)} }[/math]

Using the formula obtained previously and doing some algebra gives:

[math]\displaystyle{ \mathsf{P}(Y | X, \tau) = \left( \frac{\tau}{2 \pi} \right)^{\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} exp\left( -\frac{\tau}{2} (Y^TY - Y^TX\Omega X^TY) \right) }[/math]

Now we can integrate out [math]\displaystyle{ \tau }[/math] (note the small typo in equation 9 of supplementary text S1 of Servin & Stephens):

[math]\displaystyle{ \mathsf{P}(Y | X) = (2\pi)^{-\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} \frac{\frac{\lambda}{2}^{\frac{\kappa}{2}}}{\Gamma(\frac{\kappa}{2})} \int \tau^{\frac{N+\kappa}{2}-1} exp \left( -\frac{\tau}{2} (Y^TY - Y^TX\Omega X^TY + \lambda) \right) \mathsf{d}\tau }[/math]

Inside the integral, we recognize the almost-complete pdf of a Gamma distribution. As it has to integrate to one, we get:

[math]\displaystyle{ \mathsf{P}(Y | X) = (2\pi)^{-\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY - Y^TX\Omega X^TY + \lambda}{2} \right)^{-\frac{N+\kappa}{2}} }[/math]

We can use this expression also under the null. In this case, as we need neither [math]\displaystyle{ a }[/math] nor [math]\displaystyle{ d }[/math], [math]\displaystyle{ B }[/math] is simply [math]\displaystyle{ \mu }[/math], [math]\displaystyle{ \Sigma_B }[/math] is [math]\displaystyle{ \sigma_{\mu}^2 }[/math] and [math]\displaystyle{ X }[/math] is a vector of 1's. We can also defines [math]\displaystyle{ \Omega_0 = ((\sigma_{\mu}^2)^{-1} + N)^{-1} }[/math]. In the end, this gives:

[math]\displaystyle{ \mathsf{P}_0(Y) = (2\pi)^{-\frac{N}{2}} \frac{|\Omega_0|^{\frac{1}{2}}}{\sigma_{\mu}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY - \Omega_0 N^2 \bar{Y}^2 + \lambda}{2} \right)^{-\frac{N+\kappa}{2}} }[/math]

We can therefore write the Bayes factor:

[math]\displaystyle{ \mathrm{BF} = \left( \frac{|\Omega|}{\Omega_0} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY - Y^TX\Omega X^TY + \lambda}{Y^TY - \Omega_0 N^2 \bar{Y}^2 + \lambda} \right)^{-\frac{N+\kappa}{2}} }[/math]

When the Bayes factor is large, we say that there is enough evidence in the data to support the alternative. Indeed, the Bayesian testing procedure corresponds to measuring support for the specific alternative hypothesis compared to the null hypothesis. Importantly, note that, for a frequentist testing procedure, we would say that there is enough evidence in the data to reject the null. However we wouldn't say anything about the alternative as we don't model it.

The threshold to say that a Bayes factor is large depends on the field. It is possible to use the Bayes factor as a test statistic when doing permutation testing, and then control the false discovery rate. This can give an idea of a reasonable threshold.

Hyperparameters: the model has 5 hyperparameters, [math]\displaystyle{ \{\kappa, \, \lambda, \, \sigma_{\mu}, \, \sigma_a, \, \sigma_d\} }[/math]. How should we choose them?

Such a question is never easy to answer. But note that all hyperparameters are not that important, especially in typical quantitative genetics applications. For instance, we are mostly interested in those that determine the magnitude of the effects, [math]\displaystyle{ \sigma_a }[/math] and [math]\displaystyle{ \sigma_d }[/math], so let's deal with the others first.

As explained in Servin & Stephens, the posteriors for [math]\displaystyle{ \tau }[/math] and [math]\displaystyle{ B }[/math] change appropriately with shifts ([math]\displaystyle{ y+c }[/math]) and scaling ([math]\displaystyle{ y \times c }[/math]) in the phenotype when taking their limits. This also gives us a new Bayes factor, the one used in practice (see Guan & Stephens, 2008):

[math]\displaystyle{ \mathrm{lim}_{\sigma_{\mu} \rightarrow \infty \; ; \; \lambda \rightarrow 0 \; ; \; \kappa \rightarrow 0 } \; \mathrm{BF} = \left( \frac{N}{|\Sigma_B^{-1} + X^TX|} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY - Y^TX (\Sigma_B^{-1} + X^TX)^{-1} X^TY}{Y^TY - N \bar{Y}^2} \right)^{-\frac{N}{2}} }[/math]

Now, for the important hyperparameters, [math]\displaystyle{ \sigma_a }[/math] and [math]\displaystyle{ \sigma_d }[/math], it is usual to specify a grid of values, i.e. [math]\displaystyle{ M }[/math] pairs [math]\displaystyle{ (\sigma_a, \sigma_d) }[/math]. For instance, Guan & Stephens used the following grid:

[math]\displaystyle{ M=4 \; ; \; \sigma_a \in \{0.05, 0.1, 0.2, 0.4\} \; ; \; \sigma_d = \frac{\sigma_a}{4} }[/math]

Then, we can average the Bayes factors obtained over the grid using, as a first approximation, equal weights:

[math]\displaystyle{ \mathrm{BF} = \sum_{m \, \in \, \text{grid}} \frac{1}{M} \, \mathrm{BF}(\sigma_a^{(m)}, \sigma_d^{(m)}) }[/math]

In eQTL studies, the weights can be estimated from the data using a hierarchical model (see below), by pooling all genes together as in Veyrieras et al (PLoS Genetics, 2010).

Implementation: the following R function is adapted from Servin & Stephens supplementary text 1.

BF <- function(G=NULL, Y=NULL, sigma.a=NULL, sigma.d=NULL, get.log10=TRUE){
  stopifnot(! is.null(G), ! is.null(Y), ! is.null(sigma.a), ! is.null(sigma.d))
  subset <- complete.cases(Y) & complete.cases(G)
  Y <- Y[subset]
  G <- G[subset]
  stopifnot(length(Y) == length(G))
  N <- length(G)
  X <- cbind(rep(1,N), G, G == 1)
  inv.Sigma.B <- diag(c(0, 1/sigma.a^2, 1/sigma.d^2))
  inv.Omega <- inv.Sigma.B + t(X) %*% X
  inv.Omega0 <- N
  tY.Y <- t(Y) %*% Y
  log10.BF <- as.numeric(0.5 * log10(inv.Omega0) -
                         0.5 * log10(det(inv.Omega)) -
                         log10(sigma.a) - log10(sigma.d) -
                         (N/2) * (log10(tY.Y - t(Y) %*% X %*% solve(inv.Omega)
                                        %*% t(X) %*% cbind(Y)) -
                                  log10(tY.Y - N*mean(Y)^2)))
  if(get.log10)
    return(log10.BF)
  else
    return(10^log10.BF)
}

In the same vein as what is explained here, we can simulate data under different scenarios and check the BFs:

N <- 300    # play with it
PVE <- 0.1  # play with it
grid <- c(0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2)
MAF <- 0.3
G <- rbinom(n=N, size=2, prob=MAF)
tau <- 1
a <- sqrt((2/5) * (PVE / (tau * MAF * (1-MAF) * (1-PVE))))
d <- a / 2
mu <- rnorm(n=1, mean=0, sd=10)
Y <- mu + a * G + d * (G == 1) + rnorm(n=N, mean=0, sd=tau)
for(m in 1:length(grid))
  print(BF(G, Y, grid[m], grid[m]/4))

Binary phenotype: using a similar notation, we model case-control studies with a logistic regression where the probability to be a case is [math]\displaystyle{ \mathsf{P}(y_i = 1) = p_i }[/math].

There are many equivalent ways to write the likelihood, the usual one being:

[math]\displaystyle{ y_i | p_i \; \overset{i.i.d}{\sim} \; \mathrm{Binomial}(1,p_i) }[/math] with the log-odds (logit function) being [math]\displaystyle{ \mathrm{ln} \frac{p_i}{1 - p_i} = \mu + a \, g_i + d \, \mathbf{1}_{g_i=1} }[/math]

Let's use [math]\displaystyle{ X_i^T=(1 \; g_i \; \mathbf{1}_{g_i=1}) }[/math] to denote the [math]\displaystyle{ i }[/math]-th row of the design matrix [math]\displaystyle{ X }[/math]. We can also keep the same definition as above for [math]\displaystyle{ B=(\mu \; a \; d)^T }[/math]. Thus we have:

[math]\displaystyle{ p_i = \frac{e^{X_i^TB}}{1 + e^{X_i^TB}} }[/math]

As the [math]\displaystyle{ y_i }[/math]'s can only take [math]\displaystyle{ 0 }[/math] and [math]\displaystyle{ 1 }[/math] as values, the likelihood can be written as:

[math]\displaystyle{ \mathcal{L}(B) = \mathsf{P}(Y | X, B) = \prod_{i=1}^N p_i^{y_i} (1-p_i)^{1-y_i} }[/math]

We still use the same prior as above for [math]\displaystyle{ B }[/math] (but there is no [math]\displaystyle{ \tau }[/math] anymore), so that:

[math]\displaystyle{ B | \Sigma_B \sim \mathcal{N}_3(0, \Sigma_B) }[/math]

where [math]\displaystyle{ \Sigma_B }[/math] is a 3 x 3 matrix with [math]\displaystyle{ (\sigma_\mu^2 \; \sigma_a^2 \; \sigma_d^2) }[/math] on the diagonal and 0 elsewhere.

As above, the Bayes factor is used to compare the two models:

[math]\displaystyle{ \mathrm{BF} = \frac{\mathsf{P}(Y | X, M1)}{\mathsf{P}(Y | X, M0)} = \frac{\mathsf{P}(Y | X, a \neq 0, d \neq 0)}{\mathsf{P}(Y | X, a=0, d=0)} = \frac{\int \mathsf{P}(B) \mathsf{P}(Y | X, B) \mathrm{d}B}{\int \mathsf{P}(\mu) \mathsf{P}(Y | X, \mu) \mathrm{d}\mu} }[/math]

The interesting point here is that there is no way to analytically calculate these integrals (marginal likelihoods). Therefore, we will use Laplace's method to approximate them, as in Guan & Stephens (2008).

Starting with the numerator:

[math]\displaystyle{ \mathsf{P}(Y|X,M1) = \int \exp \left[ N \left( \frac{1}{N} \mathrm{ln} \, \mathsf{P}(B) + \frac{1}{N} \mathrm{ln} \, \mathsf{P}(Y | X, B) \right) \right] \mathsf{d}B }[/math]

Let's use [math]\displaystyle{ f }[/math] to denote the function inside the exponential:

[math]\displaystyle{ \mathsf{P}(Y|X,M1) = \int \exp \left( N \; f(B) \right) \mathsf{d}B }[/math]

The function [math]\displaystyle{ f }[/math] is defined by:

[math]\displaystyle{ f: \mathbb{R}^3 \rightarrow \mathbb{R} }[/math]

[math]\displaystyle{ f(B) = \frac{1}{N} \left( -\frac{3}{2} \mathrm{ln}(2 \pi) - \frac{1}{2} \mathrm{ln}(|\Sigma_B|) - \frac{1}{2}(B^T \Sigma_B^{-1} B) \right) + \frac{1}{N} \sum_{i=1}^N \left( y_i \mathrm{ln}(p_i) + (1-y_i) \mathrm{ln}(1 - p_i) \right) }[/math]

This function will then be used to approximate the integral, like this:

[math]\displaystyle{ \mathsf{P}(Y|X,M1) \approx N^{-3/2} (2 \pi)^{3/2} |H(B^\star)|^{-1/2} e^{N f(B^\star)} }[/math]

where [math]\displaystyle{ H }[/math] is the Hessian of [math]\displaystyle{ f }[/math] and [math]\displaystyle{ B^\star = (\mu^\star a^\star d^\star)^T }[/math] is the point at which [math]\displaystyle{ f }[/math] is maximized.

We therefore need two things: [math]\displaystyle{ H }[/math] and [math]\displaystyle{ B^\star }[/math]. Note that for both we need to calculate the first derivatives of [math]\displaystyle{ f }[/math]. As [math]\displaystyle{ f }[/math] is multi-dimensional (it takes values in [math]\displaystyle{ \mathbb{R}^3 }[/math]), we need to calculate its gradient.

In the following, some formulas from matrix calculus are sometimes required. In such cases, I will use the numerator layout.

[math]\displaystyle{ \nabla f = \frac{\partial f}{\partial B} = \left( \frac{\partial f}{\partial \mu} \; \frac{\partial f}{\partial a} \; \frac{\partial f}{\partial d} \right) }[/math]

[math]\displaystyle{ \nabla f = - \frac{1}{2N} \frac{\partial B^T\Sigma_B^{-1}B}{\partial B} + \frac{1}{N} \sum_i \left( y_i \frac{\partial \mathrm{ln}(p_i)}{\partial B} + (1-y_i) \frac{\partial \mathrm{ln}(1-p_i)}{\partial B} \right) }[/math]

[math]\displaystyle{ \nabla f = - \frac{1}{N} B^T\Sigma_B^{-1} + \frac{1}{N} \sum_i \left( \frac{y_i}{p_i} - \frac{1-y_i}{1-p_i} \right) \frac{\partial p_i}{\partial B} }[/math]

A simple form for the first derivatives of [math]\displaystyle{ p_i }[/math] also exists when writing [math]\displaystyle{ p_i = e^{X_i^tB} (1 + e^{X_i^tB})^{-1} }[/math]:

[math]\displaystyle{ \frac{\partial p_i}{\partial B} = \frac{\partial e^{X_i^TB}}{\partial B} (1 + e^{X_i^TB})^{-1} + e^{X_i^TB} \frac{\partial (1 + e^{X_i^TB})^{-1}}{\partial B} }[/math]

[math]\displaystyle{ \frac{\partial p_i}{\partial B} = e^{X_i^TB} \frac{\partial X_i^TB}{\partial B} (1 + e^{X_i^TB})^{-1} - e^{X_i^TB} (1 + e^{X_i^TB})^{-2} \frac{\partial (1 + e^{X_i^TB})}{\partial B} }[/math]

[math]\displaystyle{ \frac{\partial p_i}{\partial B} = p_i X_i^T - p_i (1 + e^{X_i^TB})^{-1} e^{X_i^TB} \frac{\partial X_i^TB}{\partial B} }[/math]

[math]\displaystyle{ \frac{\partial p_i}{\partial B} = p_i (1 - p_i) X_i^T }[/math]

This simplifies the gradient of [math]\displaystyle{ f }[/math] into:

[math]\displaystyle{ \nabla f = - \frac{1}{N} B^T\Sigma_B^{-1} + \frac{1}{N} \sum_i (y_i - p_i) X_i^T }[/math]

To find [math]\displaystyle{ B^\star }[/math], we set [math]\displaystyle{ \nabla f(B^\star) = 0 }[/math]. However, in this equation, [math]\displaystyle{ B^\star }[/math] is present not only alone but also in the [math]\displaystyle{ p_i }[/math]'s. As [math]\displaystyle{ p_i }[/math] is a non-linear function of [math]\displaystyle{ B }[/math], the equation can't be solved directly but an iterative procedure is required, typically a conjugate gradient method (as in Guan & Stephens) or Newton's method. The former only requires [math]\displaystyle{ f }[/math] and [math]\displaystyle{ \nabla f }[/math] while the latter also requires [math]\displaystyle{ H }[/math]. Remember that, in any case, we need [math]\displaystyle{ H }[/math] for the Laplace approximation, so let's calculate it:

[math]\displaystyle{ H = - \frac{1}{N} \Sigma_B^{-1} - \frac{1}{N} \sum_i \frac{\partial p_i}{\partial B} X_i^T }[/math]

[math]\displaystyle{ H = - \frac{1}{N} \Sigma_B^{-1} - \frac{1}{N} \sum_i X_i^T p_i (1-p_i) X_i }[/math]

[math]\displaystyle{ H = - \frac{1}{N} (\Sigma_B^{-1} + X^T W X) }[/math]

where [math]\displaystyle{ W }[/math] is the N x N matrix with [math]\displaystyle{ p_i(1-p_i) }[/math] on the diagonal.

Note that all second derivatives of [math]\displaystyle{ f }[/math] are strictly negative. Therefore, [math]\displaystyle{ f }[/math] is globally convex, which means that it has a unique global maximum, at [math]\displaystyle{ B^\star }[/math]. As a consequence, we have the right to use Laplace's method to approximate the integral of [math]\displaystyle{ f }[/math] around its maximum.

implementation in R -> to do

finding the effect sizes and their std error: to do

Link between Bayes factor and P-value: see Wakefield (2008)

to do

Hierarchical model: pooling genes, learn weights for grid and genomic annotations, see Veyrieras et al (PLoS Genetics, 2010)

to do

Multiple SNPs with LD: joint analysis of multiple SNPs, handle correlation between them, see Guan & Stephens (Annals of Applied Statistics, 2011) for MCMC, see Carbonetto & Stephens (Bayesian Analysis, 2012) for Variational Bayes

to do

Confounders in phenotype: it is well known in molecular biology that any experiment involving several assays (e.g. measuring gene expression levels with a DNA microarray) suffers from "unknown confounders", the most (in)famous being the so-called "batch effects".

For instance, samples from individual 1 and 2 are correlated with each other because they were treated another day than all the other samples. Such a correlations has nothing to do with the genotype at a given SNP (in most cases). However, the core model, [math]\displaystyle{ y_i = \mu + \beta g_i + \epsilon_i }[/math] assumes that the errors are uncorrelated between individuals: [math]\displaystyle{ \epsilon_i \overset{\mathrm{i.i.d}}{\sim} \mathcal{N}(0,\tau^{-1}) }[/math]. If this is not the case, i.e. if the [math]\displaystyle{ y_i }[/math]'s are correlated but this correlation has nothing to do with the [math]\displaystyle{ g_i }[/math]'s, then more variance in the errors won't be accounted for, and we'll loose power when trying to detect weak, yet non-zero [math]\displaystyle{ \beta }[/math].

An intuitive way of removing these confounders is to realize that we can use all gene expression levels to try to identify them. Indeed, batch effects are very likely to affect all genes in a sample (though possibly at different magnitudes). As the effect of the confounders are, as a first approximation, typically much bigger than the effect of a SNP genotype, we can try to learn the confounders using all gene expression levels, and only them. So let's put all of them into a [math]\displaystyle{ G \times N }[/math] matrix [math]\displaystyle{ Y_1 }[/math] with genes in rows and individuals in columns.

For the moment, the data are expressed in the standard basis, i.e. the basis of the observations. But some confounders are present in these data, they contribute with noise and redundancy and hence dilute the signal. The idea is, first, to identify a new basis which will correspond to a "mix" of the original samples (e.g. one component of this "mix" may correspond to the day at which the samples were processed), and second, to remove these components from the data in order to only keep the signal.

to be continued

see also factor analysis, see Stegle et al (PLoS Computational Biology, 2010)

Confounders in genotype: mainly pop structure and genetic relatedness, linear mixed model (LMM), see Zhou & Stephens (Nature Genetics, 2012)

to do

Discrete phenotype: count data (e.g. from RNA-seq), Poisson-like likelihood, generalized linear model (GLM), see Sun (Biometrics, 2012)

to do

Multiple phenotypes: matrix-variate distributions, see Flutre et al (PLoS Genetics, 2013), Wen & Stephens (Annals of Applied Statistics, 2014), Wen (Biometrics, 2014)

to do

Non-independent genes: enrichment in known pathways, learn "modules", distributions on networks

to do

References:
- Servin & Stephens (PLoS Genetics, 2007)
- Guan & Stephens (PLoS Genetics, 2008)
- Stephens & Balding (Nature Reviews Genetics, 2009)

User:Timothee Flutre/Notebook/Postdoc/2011/11/10: Difference between revisions

Revision as of 10:02, 24 July 2014

Bayesian model of univariate linear regression for QTL detection

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools

@@ Line 9: / Line 9: @@
-''See Servin & Stephens (PLoS Genetics, 2007).''
+''This page aims at helping people like me, interested in quantitative genetics, to get a better understanding of some Bayesian models, most importantly the impact of the modeling assumptions as well as the underlying maths. It starts with a simple model, and gradually increases the scope to relax assumptions. See references to scientific articles at the end.''
-* '''Data''': let's assume that we obtained data from N individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (e.g. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP (as allele dose, 0, 1 or 2).
+* '''Data''': let's assume that we obtained data from N individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (e.g. expression levels at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP (encoded as allele dose: 0, 1 or 2).
@@ Line 21: / Line 21: @@
-* '''Likelihood''':
+* '''Likelihood''': we start by writing the usual linear regression for one individual
-<math>\forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta_1 g_i + \beta_2 \mathbf{1}_{g_i=1} + \epsilon_i</math>
+<math>\forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta_1 g_i + \beta_2 \mathbf{1}_{g_i=1} + \epsilon_i \; \text{ with } \; \epsilon_i \; \overset{i.i.d}{\sim} \; \mathcal{N}(0,\tau^{-1})</math>
-with: <math>\epsilon_i \overset{i.i.d}{\sim} \mathcal{N}(0,\tau^{-1})</math>
+where <math>\beta_1</math> is in fact the additive effect of the SNP, noted <math>a</math> from now on, and <math>\beta_2</math> is the dominance effect of the SNP, <math>d = a k</math>.
+Let's now write the model in vector-matrix notation:
+<math>Y = X B + E \text{ where } B = [ \mu \; a \; d ]^T</math>
+This gives the following [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate Normal distribution] for the phenotypes:
+<math>Y | X, \tau, B \sim \mathcal{N}(XB, \tau^{-1} I_N)</math>
-where <math>\beta_1</math> is in fact the additive effect of the SNP, noted <math>a</math> from now on, and <math>\beta_2</math> is the dominance effect of the SNP, <math>d = a k</math>.
+Even though we can write the likelihood as a multivariate Normal, I still keep the term "univariate" in the title because the regression has a single response, <math>Y</math>.
+It is usual to keep the term "multivariate" for the case where there is a matrix of responses (i.e. multiple phenotypes).
-Let's now write in matrix notation:
+The likelihood of the parameters given the data is therefore:
-<math>Y = X B + E</math>
+<math>\mathcal{L}(\tau, B) = \mathsf{P}(Y | X, \tau, B)</math>
-where <math>B = [ \mu \; a \; d ]^T</math>
+<math>\mathcal{L}(\tau, B) = \left(\frac{\tau}{2 \pi}\right)^{\frac{N}{2}} exp \left( -\frac{\tau}{2} (Y - XB)^T (Y - XB) \right)</math>
-which gives the following conditional distribution for the phenotypes:
-<math>Y | X, B, \tau \sim \mathcal{N}(XB, \tau^{-1} I_N)</math>
+* '''Priors''': we use the usual [http://en.wikipedia.org/wiki/Conjugate_prior conjugate prior]
+<math>\mathsf{P}(\tau, B) = \mathsf{P}(\tau) \mathsf{P}(B | \tau)</math>
-* '''Priors''': conjugate
+A [http://en.wikipedia.org/wiki/Gamma_distribution Gamma distribution] for <math>\tau</math>:
 <math>\tau \sim \Gamma(\kappa/2, \, \lambda/2)</math>
+which means:
+<math>\mathsf{P}(\tau) = \frac{\frac{\lambda}{2}^{\kappa/2}}{\Gamma(\frac{\kappa}{2})} \tau^{\frac{\kappa}{2}-1} e^{-\frac{\lambda}{2} \tau}</math>
+And a multivariate Normal distribution for <math>B</math>:
 <math>B | \tau \sim \mathcal{N}(\vec{0}, \, \tau^{-1} \Sigma_B) \text{ with } \Sigma_B = diag(\sigma_{\mu}^2, \sigma_a^2, \sigma_d^2)</math>
+which means:
+<math>\mathsf{P}(B | \tau) = \left(\frac{\tau}{2 \pi}\right)^{\frac{3}{2}} |\Sigma_B|^{-\frac{1}{2}} exp \left(-\frac{\tau}{2} B^T \Sigma_B^{-1} B \right)</math>
-* '''Joint posterior''':
+* '''Joint posterior (1)''':
 <math>\mathsf{P}(\tau, B | Y, X) = \mathsf{P}(\tau | Y, X) \mathsf{P}(B | Y, X, \tau)</math>
@@ Line 53: / Line 72: @@
 * '''Conditional posterior of B''':
-<math>\mathsf{P}(B | Y, X, \tau) = \mathsf{P}(B, Y | X, \tau)</math>
 <math>\mathsf{P}(B | Y, X, \tau) = \frac{\mathsf{P}(B, Y | X, \tau)}{\mathsf{P}(Y | X, \tau)}</math>
-<math>\mathsf{P}(B | Y, X, \tau) = \frac{\mathsf{P}(B | \tau) \mathsf{P}(Y | X, B, \tau)}{\int \mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B) \mathsf{d}B}</math>
+Let's neglect the normalization constant for now:
-Here and in the following, we neglect all constants (e.g. normalization constant, <math>Y^TY</math>, etc):
 <math>\mathsf{P}(B | Y, X, \tau) \propto \mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B)</math>
-We use the prior and likelihood and keep only the terms in <math>B</math>:
+Similarly, let's keep only the terms in <math>B</math> for the moment:
 <math>\mathsf{P}(B | Y, X, \tau) \propto exp(B^T \Sigma_B^{-1} B) exp((Y-XB)^T(Y-XB))</math>
@@ Line 76: / Line 91: @@
 <math>\mathsf{P}(B | Y, X, \tau) \propto exp(B^T (\Sigma_B^{-1} + X^TX) B - Y^TXB -B^TX^TY)</math>
-Let's define <math>\Omega = (\Sigma_B^{-1} + X^TX)^{-1}</math>. We can see that <math>\Omega^T=\Omega</math>, which means that <math>\Omega</math> is a [http://en.wikipedia.org/wiki/Symmetric_matrix symmetric matrix].
+Importantly, let's define:
+<math>\Omega = (\Sigma_B^{-1} + X^TX)^{-1}</math>
+We can see that <math>\Omega^T=\Omega</math>, which means that <math>\Omega</math> is a [http://en.wikipedia.org/wiki/Symmetric_matrix symmetric matrix].
 This is particularly useful here because we can use the following equality: <math>\Omega^{-1}\Omega^T=I</math>.
 <math>\mathsf{P}(B | Y, X, \tau) \propto exp(B^T \Omega^{-1} B - (X^TY)^T\Omega^{-1}\Omega^TB -B^T\Omega^{-1}\Omega^TX^TY)</math>
-This now becomes easy to factorizes totally:
+This now becomes easy to factorizes totally (remember to add the term in <math>Y</math> to [https://learnbayes.org/index.php?option=com_content&view=article&id=77:completesquare&catid=83:reference&Itemid=479 complete the square]):
-<math>\mathsf{P}(B | Y, X, \tau) \propto exp((B^T - \Omega X^TY)^T\Omega^{-1}(B - \Omega X^TY))</math>
+<math>\mathsf{P}(B | Y, X, \tau) \propto exp((B - \Omega X^TY)^T\Omega^{-1}(B - \Omega X^TY))</math>
 We recognize the [http://en.wikipedia.org/wiki/Kernel_%28statistics%29 kernel] of a Normal distribution, allowing us to write the conditional posterior as:
 <math>B | Y, X, \tau \sim \mathcal{N}(\Omega X^TY, \tau^{-1} \Omega)</math>
+* '''Posterior of <math>\tau</math>''':
+Similarly to the equations above:
+<math>\mathsf{P}(\tau | Y, X) \propto \mathsf{P}(\tau) \mathsf{P}(Y | X, \tau)</math>
+But now, to handle the second term, we need to integrate over <math>B</math>, thus effectively taking into account the uncertainty in <math>B</math>:
+<math>\mathsf{P}(\tau | Y, X) \propto \mathsf{P}(\tau) \int \mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B) \mathsf{d}B</math>
+Again, we use the priors and likelihoods specified above (but everything inside the integral is kept inside it, even if it doesn't depend on <math>B</math>!):
+<math>\mathsf{P}(\tau | Y, X) \propto \tau^{\frac{\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} \int \tau^{3/2} \tau^{N/2} exp(-\frac{\tau}{2} B^T \Sigma_B^{-1} B) exp(-\frac{\tau}{2} (Y - XB)^T (Y - XB)) \mathsf{d}B</math>
+As we used a conjugate prior for <math>\tau</math>, we know that we expect a Gamma distribution for the posterior.
+Therefore, we can take <math>\tau^{N/2}</math> out of the integral and start guessing what looks like a Gamma distribution.
+We also factorize inside the exponential:
+<math>\mathsf{P}(\tau | Y, X) \propto \tau^{\frac{N+\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} \int \tau^{3/2} exp \left[-\frac{\tau}{2} \left( (B - \Omega X^T Y)^T \Omega^{-1} (B - \Omega X^T Y) - Y^T X \Omega X^T Y + Y^T Y \right) \right] \mathsf{d}B</math>
+We recognize the conditional posterior of <math>B</math>.
+This allows us to use the fact that the pdf of the Normal distribution integrates to one:
+<math>\mathsf{P}(\tau | Y, X) \propto \tau^{\frac{N+\kappa}{2} - 1} e^{-\frac{\lambda}{2} \tau} exp\left[-\frac{\tau}{2} (Y^T Y - Y^T X \Omega X^T Y) \right]</math>
+We finally recognize a Gamma distribution, allowing us to write the posterior as:
+<math>\tau | Y, X \sim \Gamma \left( \frac{N+\kappa}{2}, \; \frac{1}{2} (Y^T Y - Y^T X \Omega X^T Y + \lambda) \right)</math>
+* '''Joint posterior (2)''': sometimes it is said that the joint posterior follows a Normal Inverse Gamma distribution:
+<math>B, \tau | Y, X \sim \mathcal{N}IG(\Omega X^TY, \; \tau^{-1}\Omega, \; \frac{N+\kappa}{2}, \; \frac{\lambda^\ast}{2})</math>
+where <math>\lambda^\ast = Y^T Y - Y^T X \Omega X^T Y + \lambda</math>
+* '''Marginal posterior of B''': we can now integrate out <math>\tau</math>:
+<math>\mathsf{P}(B | Y, X) = \int \mathsf{P}(\tau) \mathsf{P}(B | Y, X, \tau) \mathsf{d}\tau</math>
+<math>\mathsf{P}(B | Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}}}{(2\pi)^\frac{3}{2} |\Omega|^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \int \tau^{\frac{N+\kappa+3}{2}-1} exp \left[-\tau \left( \frac{\lambda^\ast}{2} + (B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY) \right) \right] \mathsf{d}\tau</math>
+Here we recognize the formula to [http://en.wikipedia.org/wiki/Gamma_function#Integration_problems integrate the Gamma function]:
+<math>\mathsf{P}(B | Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}} \Gamma(\frac{N+\kappa+3}{2})}{(2\pi)^\frac{3}{2} |\Omega|^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \left( \frac{\lambda^\ast}{2} + (B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY) \right)^{-\frac{N+\kappa+3}{2}}</math>
+And we now recognize a [http://en.wikipedia.org/wiki/Multivariate_t-distribution multivariate Student's t-distribution]:
+<math>\mathsf{P}(B | Y, X) = \frac{\Gamma(\frac{N+\kappa+3}{2})}{\Gamma(\frac{N+\kappa}{2}) \pi^\frac{3}{2} |\lambda^\ast \Omega|^{\frac{1}{2}} } \left( 1 + \frac{(B - \Omega X^TY)^T \Omega^{-1} (B - \Omega X^TY)}{\lambda^\ast} \right)^{-\frac{N+\kappa+3}{2}}</math>
+We hence can write:
+<math>B | Y, X \sim \mathcal{S}_{N+\kappa}(\Omega X^TY, \; (Y^T Y - Y^T X \Omega X^T Y + \lambda) \Omega)</math>
+* '''Bayes Factor''': one way to answer our goal above ("is there an effect of the genotype on the phenotype?") is to do [http://en.wikipedia.org/wiki/Hypothesis_testing hypothesis testing].
+We want to test the following [http://en.wikipedia.org/wiki/Null_hypothesis null hypothesis]:
+<math>H_0: \; a = d = 0</math>
+In Bayesian modeling, hypothesis testing is performed with a [http://en.wikipedia.org/wiki/Bayes_factor Bayes factor], which in our case can be written as:
+<math>\mathrm{BF} = \frac{\mathsf{P}(Y | X, a \neq 0, d \neq 0)}{\mathsf{P}(Y | X, a = 0, d = 0)}</math>
+We can shorten this into:
+<math>\mathrm{BF} = \frac{\mathsf{P}(Y | X)}{\mathsf{P}_0(Y)}</math>
+Note that, compare to frequentist hypothesis testing which focuses on the null, the Bayes factor requires to explicitly model the data under the alternative.
+This makes a big difference when interpreting the results (see below).
+Let's start with the numerator:
+<math>\mathsf{P}(Y | X) = \int \mathsf{P}(\tau) \mathsf{P}(Y | X, \tau) \mathsf{d}\tau</math>
+First, let's calculate what is inside the integral:
+<math>\mathsf{P}(Y | X, \tau) = \frac{\mathsf{P}(B | \tau) \mathsf{P}(Y | X, \tau, B)}{\mathsf{P}(B | Y, X, \tau)}</math>
+Using the formula obtained previously and doing some algebra gives:
+<math>\mathsf{P}(Y | X, \tau) = \left( \frac{\tau}{2 \pi} \right)^{\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} exp\left( -\frac{\tau}{2} (Y^TY - Y^TX\Omega X^TY) \right)</math>
+Now we can integrate out <math>\tau</math> (note the small typo in equation 9 of supplementary text S1 of Servin & Stephens):
+<math>\mathsf{P}(Y | X) = (2\pi)^{-\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} \frac{\frac{\lambda}{2}^{\frac{\kappa}{2}}}{\Gamma(\frac{\kappa}{2})} \int \tau^{\frac{N+\kappa}{2}-1} exp \left( -\frac{\tau}{2} (Y^TY - Y^TX\Omega X^TY + \lambda) \right) \mathsf{d}\tau</math>
+Inside the integral, we recognize the almost-complete pdf of a Gamma distribution.
+As it has to integrate to one, we get:
+<math>\mathsf{P}(Y | X) = (2\pi)^{-\frac{N}{2}} \left( \frac{|\Omega|}{|\Sigma_B|} \right)^{\frac{1}{2}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY - Y^TX\Omega X^TY + \lambda}{2} \right)^{-\frac{N+\kappa}{2}}</math>
+We can use this expression also under the null.
+In this case, as we need neither <math>a</math> nor <math>d</math>, <math>B</math> is simply <math>\mu</math>, <math>\Sigma_B</math> is <math>\sigma_{\mu}^2</math> and <math>X</math> is a vector of 1's.
+We can also defines <math>\Omega_0 = ((\sigma_{\mu}^2)^{-1} + N)^{-1}</math>.
+In the end, this gives:
+<math>\mathsf{P}_0(Y) = (2\pi)^{-\frac{N}{2}} \frac{|\Omega_0|^{\frac{1}{2}}}{\sigma_{\mu}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY - \Omega_0 N^2 \bar{Y}^2 + \lambda}{2} \right)^{-\frac{N+\kappa}{2}}</math>
+We can therefore write the Bayes factor:
+<math>\mathrm{BF} = \left( \frac{|\Omega|}{\Omega_0} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY - Y^TX\Omega X^TY + \lambda}{Y^TY - \Omega_0 N^2 \bar{Y}^2 + \lambda} \right)^{-\frac{N+\kappa}{2}}</math>
+When the Bayes factor is large, we say that there is enough evidence in the data to ''support the alternative''.
+Indeed, the Bayesian testing procedure corresponds to measuring support for the specific alternative hypothesis compared to the null hypothesis.
+Importantly, note that, for a frequentist testing procedure, we would say that there is enough evidence in the data to ''reject the null''.
+However we wouldn't say anything about the alternative as we don't model it.
+The threshold to say that a Bayes factor is large depends on the field. It is possible to use the Bayes factor as a test statistic when doing permutation testing, and then control the false discovery rate. This can give an idea of a reasonable threshold.
+* '''Hyperparameters''': the model has 5 hyperparameters, <math>\{\kappa, \, \lambda, \, \sigma_{\mu}, \, \sigma_a, \, \sigma_d\}</math>. How should we choose them?
+Such a question is never easy to answer. But note that all hyperparameters are not that important, especially in typical quantitative genetics applications. For instance, we are mostly interested in those that determine the magnitude of the effects, <math>\sigma_a</math> and <math>\sigma_d</math>, so let's deal with the others first.
+As explained in Servin & Stephens, the posteriors for <math>\tau</math> and <math>B</math> change appropriately with shifts (<math>y+c</math>) and scaling (<math>y \times c</math>) in the phenotype when taking their limits.
+This also gives us a new Bayes factor, the one used in practice (see Guan & Stephens, 2008):
+<math>\mathrm{lim}_{\sigma_{\mu} \rightarrow \infty \; ; \; \lambda \rightarrow 0 \; ; \; \kappa \rightarrow 0 } \; \mathrm{BF} = \left( \frac{N}{|\Sigma_B^{-1} + X^TX|} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY - Y^TX (\Sigma_B^{-1} + X^TX)^{-1} X^TY}{Y^TY - N \bar{Y}^2} \right)^{-\frac{N}{2}}</math>
+Now, for the important hyperparameters, <math>\sigma_a</math> and <math>\sigma_d</math>, it is usual to specify a grid of values, i.e. <math>M</math> pairs <math>(\sigma_a, \sigma_d)</math>. For instance, Guan & Stephens used the following grid:
+<math>M=4 \; ; \; \sigma_a \in \{0.05, 0.1, 0.2, 0.4\} \; ; \; \sigma_d = \frac{\sigma_a}{4}</math>
+Then, we can average the Bayes factors obtained over the grid using, as a first approximation,  equal weights:
+<math>\mathrm{BF} = \sum_{m \, \in \, \text{grid}} \frac{1}{M} \, \mathrm{BF}(\sigma_a^{(m)}, \sigma_d^{(m)})</math>
+In eQTL studies, the weights can be estimated from the data using a hierarchical model (see below), by pooling all genes together as in Veyrieras ''et al'' (PLoS Genetics, 2010).
+* '''Implementation''': the following R function is adapted from Servin & Stephens supplementary text 1.
+ <nowiki>
+BF <- function(G=NULL, Y=NULL, sigma.a=NULL, sigma.d=NULL, get.log10=TRUE){
+  stopifnot(! is.null(G), ! is.null(Y), ! is.null(sigma.a), ! is.null(sigma.d))
+  subset <- complete.cases(Y) & complete.cases(G)
+  Y <- Y[subset]
+  G <- G[subset]
+  stopifnot(length(Y) == length(G))
+  N <- length(G)
+  X <- cbind(rep(1,N), G, G == 1)
+  inv.Sigma.B <- diag(c(0, 1/sigma.a^2, 1/sigma.d^2))
+  inv.Omega <- inv.Sigma.B + t(X) %*% X
+  inv.Omega0 <- N
+  tY.Y <- t(Y) %*% Y
+  log10.BF <- as.numeric(0.5 * log10(inv.Omega0) -
+.5 * log10(det(inv.Omega)) -
+                         log10(sigma.a) - log10(sigma.d) -
+                         (N/2) * (log10(tY.Y - t(Y) %*% X %*% solve(inv.Omega)
+                                        %*% t(X) %*% cbind(Y)) -
+                                  log10(tY.Y - N*mean(Y)^2)))
+  if(get.log10)
+    return(log10.BF)
+  else
+    return(10^log10.BF)
+}
+</nowiki>
+In the same vein as what is explained [http://openwetware.org/wiki/User:Timothee_Flutre/Notebook/Postdoc/2011/06/28 here], we can simulate data under different scenarios and check the BFs:
+ <nowiki>
+N <- 300    # play with it
+PVE <- 0.1  # play with it
+grid <- c(0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2)
+MAF <- 0.3
+G <- rbinom(n=N, size=2, prob=MAF)
+tau <- 1
+a <- sqrt((2/5) * (PVE / (tau * MAF * (1-MAF) * (1-PVE))))
+d <- a / 2
+mu <- rnorm(n=1, mean=0, sd=10)
+Y <- mu + a * G + d * (G == 1) + rnorm(n=N, mean=0, sd=tau)
+for(m in 1:length(grid))
+  print(BF(G, Y, grid[m], grid[m]/4))
+</nowiki>
+* '''Binary phenotype''': using a similar notation, we model case-control studies with a [http://en.wikipedia.org/wiki/Logistic_regression logistic regression] where the probability to be a case is <math>\mathsf{P}(y_i = 1) = p_i</math>.
+There are many equivalent ways to write the likelihood, the usual one being:
+<math>y_i | p_i \; \overset{i.i.d}{\sim} \; \mathrm{Binomial}(1,p_i)</math> with the [http://en.wikipedia.org/wiki/Log-odds log-odds] (logit function) being <math>\mathrm{ln} \frac{p_i}{1 - p_i} = \mu + a \, g_i + d \, \mathbf{1}_{g_i=1}</math>
+Let's use <math>X_i^T=(1 \; g_i \; \mathbf{1}_{g_i=1})</math> to denote the <math>i</math>-th row of the design matrix <math>X</math>. We can also keep the same definition as above for <math>B=(\mu \; a \; d)^T</math>. Thus we have:
+<math>p_i = \frac{e^{X_i^TB}}{1 + e^{X_i^TB}}</math>
+As the <math>y_i</math>'s can only take <math>0</math> and <math>1</math> as values, the likelihood can be written as:
+<math>\mathcal{L}(B) = \mathsf{P}(Y | X, B) = \prod_{i=1}^N p_i^{y_i} (1-p_i)^{1-y_i}</math>
+We still use the same prior as above for <math>B</math> (but there is no <math>\tau</math> anymore), so that:
+<math>B | \Sigma_B \sim \mathcal{N}_3(0, \Sigma_B)</math>
+where <math>\Sigma_B</math> is a 3 x 3 matrix with <math>(\sigma_\mu^2 \; \sigma_a^2 \; \sigma_d^2)</math> on the diagonal and 0 elsewhere.
+As above, the Bayes factor is used to compare the two models:
+<math>\mathrm{BF} = \frac{\mathsf{P}(Y | X, M1)}{\mathsf{P}(Y | X, M0)} = \frac{\mathsf{P}(Y | X, a \neq 0, d \neq 0)}{\mathsf{P}(Y | X, a=0, d=0)} = \frac{\int \mathsf{P}(B) \mathsf{P}(Y | X, B) \mathrm{d}B}{\int \mathsf{P}(\mu) \mathsf{P}(Y | X, \mu) \mathrm{d}\mu}</math>
+The interesting point here is that there is no way to analytically calculate these integrals (marginal likelihoods). Therefore, we will use [http://en.wikipedia.org/wiki/Laplace_approximation Laplace's method] to approximate them, as in Guan & Stephens (2008).
+Starting with the numerator:
+<math>\mathsf{P}(Y|X,M1) = \int \exp \left[ N \left( \frac{1}{N} \mathrm{ln} \, \mathsf{P}(B) + \frac{1}{N} \mathrm{ln} \, \mathsf{P}(Y | X, B) \right) \right] \mathsf{d}B</math>
+Let's use <math>f</math> to denote the function inside the exponential:
+<math>\mathsf{P}(Y|X,M1) = \int \exp \left( N \; f(B) \right) \mathsf{d}B</math>
+The function <math>f</math> is defined by:
+<math>f: \mathbb{R}^3 \rightarrow \mathbb{R}</math>
+<math>f(B) = \frac{1}{N} \left( -\frac{3}{2} \mathrm{ln}(2 \pi) - \frac{1}{2} \mathrm{ln}(|\Sigma_B|) - \frac{1}{2}(B^T \Sigma_B^{-1} B) \right) + \frac{1}{N} \sum_{i=1}^N \left( y_i \mathrm{ln}(p_i) + (1-y_i) \mathrm{ln}(1 - p_i) \right)</math>
+This function will then be used to approximate the integral, like this:
+<math>\mathsf{P}(Y|X,M1) \approx N^{-3/2} (2 \pi)^{3/2} |H(B^\star)|^{-1/2} e^{N f(B^\star)}</math>
+where <math>H</math> is the [http://en.wikipedia.org/wiki/Hessian_matrix Hessian] of <math>f</math> and <math>B^\star = (\mu^\star a^\star d^\star)^T</math> is the point at which <math>f</math> is maximized.
+We therefore need two things: <math>H</math> and <math>B^\star</math>. Note that for both we need to calculate the first derivatives of <math>f</math>. As <math>f</math> is multi-dimensional (it takes values in <math>\mathbb{R}^3</math>), we need to calculate its [http://en.wikipedia.org/wiki/Gradient gradient].
+In the following, some formulas from [http://en.wikipedia.org/wiki/Matrix_calculus matrix calculus] are sometimes required. In such cases, I will use the [http://en.wikipedia.org/wiki/Matrix_calculus#Layout_conventions numerator layout].
+<math>\nabla f = \frac{\partial f}{\partial B} = \left( \frac{\partial f}{\partial \mu} \; \frac{\partial f}{\partial a} \; \frac{\partial f}{\partial d} \right)</math>
+<math>\nabla f = - \frac{1}{2N} \frac{\partial B^T\Sigma_B^{-1}B}{\partial B} + \frac{1}{N} \sum_i \left( y_i \frac{\partial \mathrm{ln}(p_i)}{\partial B} + (1-y_i) \frac{\partial \mathrm{ln}(1-p_i)}{\partial B} \right)</math>
+<math>\nabla f = - \frac{1}{N} B^T\Sigma_B^{-1} + \frac{1}{N} \sum_i \left( \frac{y_i}{p_i} - \frac{1-y_i}{1-p_i} \right) \frac{\partial p_i}{\partial B}</math>
+A simple form for the first derivatives of <math>p_i</math> also exists when writing <math>p_i = e^{X_i^tB} (1 + e^{X_i^tB})^{-1}</math>:
+<math>\frac{\partial p_i}{\partial B} = \frac{\partial e^{X_i^TB}}{\partial B} (1 + e^{X_i^TB})^{-1} + e^{X_i^TB} \frac{\partial (1 + e^{X_i^TB})^{-1}}{\partial B}</math>
+<math>\frac{\partial p_i}{\partial B} = e^{X_i^TB} \frac{\partial X_i^TB}{\partial B} (1 + e^{X_i^TB})^{-1} - e^{X_i^TB} (1 + e^{X_i^TB})^{-2} \frac{\partial (1 + e^{X_i^TB})}{\partial B}</math>
+<math>\frac{\partial p_i}{\partial B} = p_i X_i^T - p_i (1 + e^{X_i^TB})^{-1} e^{X_i^TB} \frac{\partial X_i^TB}{\partial B}</math>
+<math>\frac{\partial p_i}{\partial B} = p_i (1 - p_i) X_i^T</math>
+This simplifies the gradient of <math>f</math> into:
+<math>\nabla f = - \frac{1}{N} B^T\Sigma_B^{-1} + \frac{1}{N} \sum_i (y_i - p_i) X_i^T</math>
+To find <math>B^\star</math>, we set <math>\nabla f(B^\star) = 0</math>. However, in this equation, <math>B^\star</math> is present not only alone but also in the <math>p_i</math>'s. As <math>p_i</math> is a non-linear function of <math>B</math>, the equation can't be solved directly but an iterative procedure is required, typically a [http://en.wikipedia.org/wiki/Nonlinear_conjugate_gradient_method conjugate gradient method] (as in Guan & Stephens) or [http://en.wikipedia.org/wiki/Newton_method_in_optimization Newton's method]. The former only requires <math>f</math> and <math>\nabla f</math> while the latter also requires <math>H</math>. Remember that, in any case, we need <math>H</math> for the Laplace approximation, so let's calculate it:
+<math>H = - \frac{1}{N} \Sigma_B^{-1} - \frac{1}{N} \sum_i \frac{\partial p_i}{\partial B} X_i^T</math>
+<math>H = - \frac{1}{N} \Sigma_B^{-1} - \frac{1}{N} \sum_i X_i^T p_i (1-p_i) X_i</math>
+<math>H = - \frac{1}{N} (\Sigma_B^{-1} + X^T W X)</math>
+where <math>W</math> is the N x N matrix with <math>p_i(1-p_i)</math> on the diagonal.
+Note that all second derivatives of <math>f</math> are strictly negative. Therefore, <math>f</math> is globally convex, which means that it has a unique global maximum, at <math>B^\star</math>. As a consequence, we have the right to use Laplace's method to approximate the integral of <math>f</math> around its maximum.
+implementation in R -> to do
+finding the effect sizes and their std error: to do
+* '''Link between Bayes factor and P-value''': see Wakefield (2008)
+to do
+* '''Hierarchical model''': pooling genes, learn weights for grid and genomic annotations, see Veyrieras ''et al'' (PLoS Genetics, 2010)
+to do
+* '''Multiple SNPs with LD''': joint analysis of multiple SNPs, handle correlation between them, see Guan & Stephens (Annals of Applied Statistics, 2011) for MCMC, see Carbonetto & Stephens (Bayesian Analysis, 2012) for Variational Bayes
+to do
+* '''Confounders in phenotype''': it is well known in molecular biology that any experiment involving several assays (e.g. measuring gene expression levels with a DNA microarray) suffers from "unknown confounders", the most (in)famous being the so-called "batch effects".
+For instance, samples from individual 1 and 2 are correlated with each other because they were treated another day than all the other samples. Such a correlations has nothing to do with the genotype at a given SNP (in most cases). However, the core model, <math>y_i = \mu + \beta g_i + \epsilon_i</math> assumes that the errors are uncorrelated between individuals: <math>\epsilon_i \overset{\mathrm{i.i.d}}{\sim} \mathcal{N}(0,\tau^{-1})</math>. If this is not the case, i.e. if the <math>y_i</math>'s are correlated but this correlation has nothing to do with the <math>g_i</math>'s, then more variance in the errors won't be accounted for, and we'll loose power when trying to detect weak, yet non-zero <math>\beta</math>.
+An intuitive way of removing these confounders is to realize that we can use all gene expression levels to try to identify them. Indeed, batch effects are very likely to affect all genes in a sample (though possibly at different magnitudes). As the effect of the confounders are, as a first approximation, typically much bigger than the effect of a SNP genotype, we can try to learn the confounders using all gene expression levels, and only them. So let's put all of them into a <math>G \times N</math> matrix <math>Y_1</math> with genes in rows and individuals in columns.
+For the moment, the data are expressed in the [http://en.wikipedia.org/wiki/Standard_basis standard basis], i.e. the basis of the observations. But some confounders are present in these data, they contribute with noise and redundancy and hence dilute the signal. The idea is, first, to identify a new basis which will correspond to a "mix" of the original samples (e.g. one component of this "mix" may correspond to the day at which the samples were processed), and second, to remove these components from the data in order to only keep the signal.
+to be continued
+see also factor analysis, see Stegle ''et al'' (PLoS Computational Biology, 2010)
+* '''Confounders in genotype''': mainly pop structure and genetic relatedness, linear mixed model (LMM), see Zhou & Stephens (Nature Genetics, 2012)
+to do
+* '''Discrete phenotype''': count data (e.g. from RNA-seq), Poisson-like likelihood, generalized linear model (GLM), see Sun (Biometrics, 2012)
+to do
+* '''Multiple phenotypes''': matrix-variate distributions, see Flutre et al (PLoS Genetics, 2013), Wen & Stephens (Annals of Applied Statistics, 2014), Wen (Biometrics, 2014)
+to do
+* '''Non-independent genes''': enrichment in known pathways, learn "modules", distributions on networks
+to do
+* '''References''':
+** Servin & Stephens (PLoS Genetics, 2007)
+** Guan & Stephens (PLoS Genetics, 2008)
+** Stephens & Balding (Nature Reviews Genetics, 2009)
 <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->