Difference between revisions of "User:Timothee Flutre/Notebook/Postdoc/2011/11/10"
(→Bayesian model of univariate linear regression for QTL detection: finish BF) 
(→Bayesian model of univariate linear regression for QTL detection: add choice of hyperparam + list of future points) 

Line 9:  Line 9:  
−  ''See  +  ''This page aims at helping people like me, interested in quantitative genetics, to get a better understanding of some Bayesian models, most importantly the impact of the modeling assumptions as well as the underlying maths. It starts with a simple model, and gradually increases the scope to relax assumptions. See references to scientific articles at the end.'' 
Line 172:  Line 172:  
In Bayesian modeling, hypothesis testing is performed with a [http://en.wikipedia.org/wiki/Bayes_factor Bayes factor], which in our case can be written as:  In Bayesian modeling, hypothesis testing is performed with a [http://en.wikipedia.org/wiki/Bayes_factor Bayes factor], which in our case can be written as:  
−  <math>BF = \frac{\mathsf{P}(Y  X, a \neq 0, d \neq 0)}{\mathsf{P}(Y  X, a = 0, d = 0)}</math>  +  <math>\mathrm{BF} = \frac{\mathsf{P}(Y  X, a \neq 0, d \neq 0)}{\mathsf{P}(Y  X, a = 0, d = 0)}</math> 
We can shorten this into:  We can shorten this into:  
−  <math>BF = \frac{\mathsf{P}(Y  X)}{\mathsf{P}_0(Y)}</math>  +  <math>\mathrm{BF} = \frac{\mathsf{P}(Y  X)}{\mathsf{P}_0(Y)}</math> 
Note that, compare to frequentist hypothesis testing which focuses on the null, the Bayes factor requires to explicitly model the data under the alternative.  Note that, compare to frequentist hypothesis testing which focuses on the null, the Bayes factor requires to explicitly model the data under the alternative.  
+  This makes a big difference when interpreting the results (see below).  
Let's start with the numerator:  Let's start with the numerator:  
Line 210:  Line 211:  
We can therefore write the Bayes factor:  We can therefore write the Bayes factor:  
−  <math>BF = \left( \frac{\Omega}{\Omega_0} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY  Y^TX\Omega X^TY + \lambda}{Y^TY  \Omega_0 N^2 \bar{Y}^2 + \lambda} \right)^{\frac{N+\kappa}{2}}</math>  +  <math>\mathrm{BF} = \left( \frac{\Omega}{\Omega_0} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY  Y^TX\Omega X^TY + \lambda}{Y^TY  \Omega_0 N^2 \bar{Y}^2 + \lambda} \right)^{\frac{N+\kappa}{2}}</math> 
+  In genetics, effect sizes are usually small, therefore difficult to detect.  
+  Hence it is usual to be interested in Bayes factors only when they exceed <math>10^4</math> for instance.  
+  In such cases, we say that there is enough evidence in the data to ''support the alternative''.  
+  Indeed, the Bayesian testing procedure corresponds to measuring support for the specific alternative hypothesis compared to the null hypothesis.  
+  Importantly, note that, for a frequentist testing procedure, we would say that there is enough evidence in the data to ''reject the null''.  
+  However we wouldn't say anything about the alternative as we don't model it.  
−  
−  +  * '''Hyperparameters''': the model has 5 hyperparameters, <math>\{\kappa, \, \lambda, \, \sigma_{\mu}, \, \sigma_a, \, \sigma_d\}</math>. How should we choose them?  
+  Such a question is never easy to answer. But note that all hyperparameters are not that important, especially in typical quantitative genetics applications. For instance, we are mostly interested in those that determine the magnitude of the effects, <math>\sigma_a</math> and <math>\sigma_d</math>, so let's deal with the others first.  
−  average BF  +  As explained in Servin & Stephens, the posteriors for <math>\tau</math> and <math>B</math> change appropriately with shifts (<math>y+c</math>) and scaling (<math>y \times c</math>) in the phenotype when taking their limits. 
+  This also gives us a new Bayes factor, the one used in practice (see Guan & Stephens, 2008):  
+  
+  <math>\mathrm{lim}_{\sigma_{\mu} \rightarrow \infty \; ; \; \lambda \rightarrow 0 \; ; \; \kappa \rightarrow 0 } \; \mathrm{BF} = \left( \frac{N}{\Sigma_B^{1} + X^TX} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY  Y^TX (\Sigma_B^{1} + X^TX)^{1} X^TY}{Y^TY  N \bar{Y}^2} \right)^{\frac{N}{2}}</math>  
+  
+  Now, for the important hyperparameters, <math>\sigma_a</math> and <math>\sigma_d</math>, it is usual to specify a grid of values, i.e. <math>M</math> pairs <math>(\sigma_a, \sigma_d)</math>. For instance, Guan & Stephens used the following grid:  
+  
+  <math>M=4 \; ; \; \sigma_a \in \{0.05, 0.1, 0.2, 0.4\} \; ; \; \sigma_d = \frac{\sigma_a}{4}</math>  
+  
+  Then, we can average the Bayes factors obtained over the grid using, as a first approximation, equal weights:  
+  
+  <math>\mathrm{BF} = \sum_{m \, \in \, \text{grid}} \frac{1}{M} \, \mathrm{BF}(\sigma_a^{(m)}, \sigma_d^{(m)})</math>  
Line 223:  Line 241:  
to do  to do  
+  
+  
+  * '''Binary phenotype''': casecontrol like in GWAS, logistic regression, see Guan & Stephens (2008) for Laplace approximation  
+  
+  to do  
+  
+  
+  * '''Link between Bayes factor and Pvalue''': see Wakeley (2008)  
+  
+  to do  
+  
+  
+  * '''Hierarchical model''': pooling genes, learn weights for grid and genomic annotations, see Veyrieras ''et al'' (PLoS Genetics, 2010)  
+  
+  to do  
+  
+  
+  * '''Multiple SNPs with LD''': joint analysis of multiple SNPs, handle correlation between them, see Guan & Stephens (Annals of Applied Statistics, 2011) for MCMC, see Carbonetto & Stephens (Bayesian Analysis, 2012) for Variational Bayes  
+  
+  to do  
+  
+  
+  * '''Confounding factors in phenotype''': factor analysis, see Stegle ''et al'' (PLoS Computational Biology, 2010)  
+  
+  to do  
+  
+  
+  * '''Genetic relatedness''': linear mixed model  
+  
+  to do  
+  
+  
+  * '''Discrete phenotype''': count data as from RNAseq, Poissonlike likelihood  
+  
+  to do  
+  
+  
+  * '''Multiple phenotypes''': matrixvariate distributions, tensors  
+  
+  to do  
+  
+  
+  * '''Nonindependent genes''': enrichment in known pathways, learn "modules"  
+  
+  to do  
+  
+  
+  * '''References''':  
+  ** Servin & Stephens (PLoS Genetics, 2007)  
+  ** Guan & Stephens (PLoS Genetics, 2008)  
+  ** Stephens & Balding (Nature Reviews Genetics, 2009)  
<! ##### DO NOT edit below this line unless you know what you are doing. ##### >  <! ##### DO NOT edit below this line unless you know what you are doing. ##### > 
Revision as of 17:47, 22 November 2012
Project name  <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> 
Bayesian model of univariate linear regression for QTL detectionThis page aims at helping people like me, interested in quantitative genetics, to get a better understanding of some Bayesian models, most importantly the impact of the modeling assumptions as well as the underlying maths. It starts with a simple model, and gradually increases the scope to relax assumptions. See references to scientific articles at the end.
[math]\forall i \in \{1,\ldots,N\}, \; y_i = \mu + \beta_1 g_i + \beta_2 \mathbf{1}_{g_i=1} + \epsilon_i \text{ with } \epsilon_i \overset{i.i.d}{\sim} \mathcal{N}(0,\tau^{1})[/math] where [math]\beta_1[/math] is in fact the additive effect of the SNP, noted [math]a[/math] from now on, and [math]\beta_2[/math] is the dominance effect of the SNP, [math]d = a k[/math]. Let's now write the model in matrix notation: [math]Y = X B + E \text{ where } B = [ \mu \; a \; d ]^T[/math] This gives the following multivariate Normal distribution for the phenotypes: [math]Y  X, \tau, B \sim \mathcal{N}(XB, \tau^{1} I_N)[/math] Even though we can write the likelihood as a multivariate Normal, I still keep the term "univariate" in the title because the regression has a single response, [math]Y[/math]. It is usual to keep the term "multivariate" for the case where there is a matrix of responses (i.e. multiple phenotypes). The likelihood of the parameters given the data is therefore: [math]\mathcal{L}(\tau, B) = \mathsf{P}(Y  X, \tau, B)[/math] [math]\mathcal{L}(\tau, B) = \left(\frac{\tau}{2 \pi}\right)^{\frac{N}{2}} exp \left( \frac{\tau}{2} (Y  XB)^T (Y  XB) \right)[/math]
[math]\mathsf{P}(\tau, B) = \mathsf{P}(\tau) \mathsf{P}(B  \tau)[/math] A Gamma distribution for [math]\tau[/math]: [math]\tau \sim \Gamma(\kappa/2, \, \lambda/2)[/math] which means: [math]\mathsf{P}(\tau) = \frac{\frac{\lambda}{2}^{\kappa/2}}{\Gamma(\frac{\kappa}{2})} \tau^{\frac{\kappa}{2}1} e^{\frac{\lambda}{2} \tau}[/math] And a multivariate Normal distribution for [math]B[/math]: [math]B  \tau \sim \mathcal{N}(\vec{0}, \, \tau^{1} \Sigma_B) \text{ with } \Sigma_B = diag(\sigma_{\mu}^2, \sigma_a^2, \sigma_d^2)[/math] which means: [math]\mathsf{P}(B  \tau) = \left(\frac{\tau}{2 \pi}\right)^{\frac{3}{2}} \Sigma_B^{\frac{1}{2}} exp \left(\frac{\tau}{2} B^T \Sigma_B^{1} B \right)[/math]
[math]\mathsf{P}(\tau, B  Y, X) = \mathsf{P}(\tau  Y, X) \mathsf{P}(B  Y, X, \tau)[/math]
[math]\mathsf{P}(B  Y, X, \tau) = \frac{\mathsf{P}(B, Y  X, \tau)}{\mathsf{P}(Y  X, \tau)}[/math] Let's neglect the normalization constant for now: [math]\mathsf{P}(B  Y, X, \tau) \propto \mathsf{P}(B  \tau) \mathsf{P}(Y  X, \tau, B)[/math] Similarly, let's keep only the terms in [math]B[/math] for the moment: [math]\mathsf{P}(B  Y, X, \tau) \propto exp(B^T \Sigma_B^{1} B) exp((YXB)^T(YXB))[/math] We expand: [math]\mathsf{P}(B  Y, X, \tau) \propto exp(B^T \Sigma_B^{1} B  Y^TXB B^TX^TY + B^TX^TXB)[/math] We factorize some terms: [math]\mathsf{P}(B  Y, X, \tau) \propto exp(B^T (\Sigma_B^{1} + X^TX) B  Y^TXB B^TX^TY)[/math] Importantly, let's define: [math]\Omega = (\Sigma_B^{1} + X^TX)^{1}[/math] We can see that [math]\Omega^T=\Omega[/math], which means that [math]\Omega[/math] is a symmetric matrix. This is particularly useful here because we can use the following equality: [math]\Omega^{1}\Omega^T=I[/math]. [math]\mathsf{P}(B  Y, X, \tau) \propto exp(B^T \Omega^{1} B  (X^TY)^T\Omega^{1}\Omega^TB B^T\Omega^{1}\Omega^TX^TY)[/math] This now becomes easy to factorizes totally: [math]\mathsf{P}(B  Y, X, \tau) \propto exp((B^T  \Omega X^TY)^T\Omega^{1}(B  \Omega X^TY))[/math] We recognize the kernel of a Normal distribution, allowing us to write the conditional posterior as: [math]B  Y, X, \tau \sim \mathcal{N}(\Omega X^TY, \tau^{1} \Omega)[/math]
Similarly to the equations above: [math]\mathsf{P}(\tau  Y, X) \propto \mathsf{P}(\tau) \mathsf{P}(Y  X, \tau)[/math] But now, to handle the second term, we need to integrate over [math]B[/math], thus effectively taking into account the uncertainty in [math]B[/math]: [math]\mathsf{P}(\tau  Y, X) \propto \mathsf{P}(\tau) \int \mathsf{P}(B  \tau) \mathsf{P}(Y  X, \tau, B) \mathsf{d}B[/math] Again, we use the priors and likelihoods specified above (but everything inside the integral is kept inside it, even if it doesn't depend on [math]B[/math]!): [math]\mathsf{P}(\tau  Y, X) \propto \tau^{\frac{\kappa}{2}  1} e^{\frac{\lambda}{2} \tau} \int \tau^{3/2} \tau^{N/2} exp(\frac{\tau}{2} B^T \Sigma_B^{1} B) exp(\frac{\tau}{2} (Y  XB)^T (Y  XB)) \mathsf{d}B[/math] As we used a conjugate prior for [math]\tau[/math], we know that we expect a Gamma distribution for the posterior. Therefore, we can take [math]\tau^{N/2}[/math] out of the integral and start guessing what looks like a Gamma distribution. We also factorize inside the exponential: [math]\mathsf{P}(\tau  Y, X) \propto \tau^{\frac{N+\kappa}{2}  1} e^{\frac{\lambda}{2} \tau} \int \tau^{3/2} exp \left[\frac{\tau}{2} \left( (B  \Omega X^T Y)^T \Omega^{1} (B  \Omega X^T Y)  Y^T X \Omega X^T Y + Y^T Y \right) \right] \mathsf{d}B[/math] We recognize the conditional posterior of [math]B[/math]. This allows us to use the fact that the pdf of the Normal distribution integrates to one: [math]\mathsf{P}(\tau  Y, X) \propto \tau^{\frac{N+\kappa}{2}  1} e^{\frac{\lambda}{2} \tau} exp\left[\frac{\tau}{2} (Y^T Y  Y^T X \Omega X^T Y) \right][/math] We finally recognize a Gamma distribution, allowing us to write the posterior as: [math]\tau  Y, X \sim \Gamma \left( \frac{N+\kappa}{2}, \; \frac{1}{2} (Y^T Y  Y^T X \Omega X^T Y + \lambda) \right)[/math]
[math]B, \tau  Y, X \sim \mathcal{N}IG(\Omega X^TY, \; \tau^{1}\Omega, \; \frac{N+\kappa}{2}, \; \frac{\lambda^\ast}{2})[/math] where [math]\lambda^\ast = Y^T Y  Y^T X \Omega X^T Y + \lambda[/math]
[math]\mathsf{P}(B  Y, X) = \int \mathsf{P}(\tau) \mathsf{P}(B  Y, X, \tau) \mathsf{d}\tau[/math] [math]\mathsf{P}(B  Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}}}{(2\pi)^\frac{3}{2} \Omega^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \int \tau^{\frac{N+\kappa+3}{2}1} exp \left[\tau \left( \frac{\lambda^\ast}{2} + (B  \Omega X^TY)^T \Omega^{1} (B  \Omega X^TY) \right) \right] \mathsf{d}\tau[/math] Here we recognize the formula to integrate the Gamma function: [math]\mathsf{P}(B  Y, X) = \frac{\frac{\lambda^\ast}{2}^{\frac{N+\kappa}{2}} \Gamma(\frac{N+\kappa+3}{2})}{(2\pi)^\frac{3}{2} \Omega^{\frac{1}{2}} \Gamma(\frac{N+\kappa}{2})} \left( \frac{\lambda^\ast}{2} + (B  \Omega X^TY)^T \Omega^{1} (B  \Omega X^TY) \right)^{\frac{N+\kappa+3}{2}}[/math] And we now recognize a multivariate Student's tdistribution: [math]\mathsf{P}(B  Y, X) = \frac{\Gamma(\frac{N+\kappa+3}{2})}{\Gamma(\frac{N+\kappa}{2}) \pi^\frac{3}{2} \lambda^\ast \Omega^{\frac{1}{2}} } \left( 1 + \frac{(B  \Omega X^TY)^T \Omega^{1} (B  \Omega X^TY)}{\lambda^\ast} \right)^{\frac{N+\kappa+3}{2}}[/math] We hence can write: [math]B  Y, X \sim \mathcal{S}_{N+\kappa}(\Omega X^TY, \; (Y^T Y  Y^T X \Omega X^T Y + \lambda) \Omega)[/math]
We want to test the following null hypothesis: [math]H_0: \; a = d = 0[/math] In Bayesian modeling, hypothesis testing is performed with a Bayes factor, which in our case can be written as: [math]\mathrm{BF} = \frac{\mathsf{P}(Y  X, a \neq 0, d \neq 0)}{\mathsf{P}(Y  X, a = 0, d = 0)}[/math] We can shorten this into: [math]\mathrm{BF} = \frac{\mathsf{P}(Y  X)}{\mathsf{P}_0(Y)}[/math] Note that, compare to frequentist hypothesis testing which focuses on the null, the Bayes factor requires to explicitly model the data under the alternative. This makes a big difference when interpreting the results (see below). Let's start with the numerator: [math]\mathsf{P}(Y  X) = \int \mathsf{P}(\tau) \mathsf{P}(Y  X, \tau) \mathsf{d}\tau[/math] First, let's calculate what is inside the integral: [math]\mathsf{P}(Y  X, \tau) = \frac{\mathsf{P}(B  \tau) \mathsf{P}(Y  X, \tau, B)}{\mathsf{P}(B  Y, X, \tau)}[/math] Using the formula obtained previously and doing some algebra gives: [math]\mathsf{P}(Y  X, \tau) = \left( \frac{\tau}{2 \pi} \right)^{\frac{N}{2}} \left( \frac{\Omega}{\Sigma_B} \right)^{\frac{1}{2}} exp\left( \frac{\tau}{2} (Y^TY  Y^TX\Omega X^TY) \right)[/math] Now we can integrate out [math]\tau[/math] (note the small typo in equation 9 of supplementary text S1 of Servin & Stephens): [math]\mathsf{P}(Y  X) = (2\pi)^{\frac{N}{2}} \left( \frac{\Omega}{\Sigma_B} \right)^{\frac{1}{2}} \frac{\frac{\lambda}{2}^{\frac{\kappa}{2}}}{\Gamma(\frac{\kappa}{2})} \int \tau^{\frac{N+\kappa}{2}1} exp \left( \frac{\tau}{2} (Y^TY  Y^TX\Omega X^TY + \lambda) \right)[/math] Inside the integral, we recognize the almostcomplete pdf of a Gamma distribution. As it has to integrate to one, we get: [math]\mathsf{P}(Y  X) = (2\pi)^{\frac{N}{2}} \left( \frac{\Omega}{\Sigma_B} \right)^{\frac{1}{2}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY  Y^TX\Omega X^TY + \lambda}{2} \right)^{\frac{N+\kappa}{2}}[/math] We can use this expression also under the null. In this case, as we need neither [math]a[/math] nor [math]d[/math], [math]B[/math] is simply [math]\mu[/math], [math]\Sigma_B[/math] is [math]\sigma_{\mu}^2[/math] and [math]X[/math] is a vector of 1's. We can also defines [math]\Omega_0 = ((\sigma_{\mu}^2)^{1} + N)^{1}[/math]. In the end, this gives: [math]\mathsf{P}_0(Y) = (2\pi)^{\frac{N}{2}} \frac{\Omega_0^{\frac{1}{2}}}{\sigma_{\mu}} \left( \frac{\lambda}{2} \right)^{\frac{\kappa}{2}} \frac{\Gamma(\frac{N+\kappa}{2})}{\Gamma(\frac{\kappa}{2})} \left( \frac{Y^TY  \Omega_0 N^2 \bar{Y}^2 + \lambda}{2} \right)^{\frac{N+\kappa}{2}}[/math] We can therefore write the Bayes factor: [math]\mathrm{BF} = \left( \frac{\Omega}{\Omega_0} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY  Y^TX\Omega X^TY + \lambda}{Y^TY  \Omega_0 N^2 \bar{Y}^2 + \lambda} \right)^{\frac{N+\kappa}{2}}[/math] In genetics, effect sizes are usually small, therefore difficult to detect. Hence it is usual to be interested in Bayes factors only when they exceed [math]10^4[/math] for instance. In such cases, we say that there is enough evidence in the data to support the alternative. Indeed, the Bayesian testing procedure corresponds to measuring support for the specific alternative hypothesis compared to the null hypothesis. Importantly, note that, for a frequentist testing procedure, we would say that there is enough evidence in the data to reject the null. However we wouldn't say anything about the alternative as we don't model it.
Such a question is never easy to answer. But note that all hyperparameters are not that important, especially in typical quantitative genetics applications. For instance, we are mostly interested in those that determine the magnitude of the effects, [math]\sigma_a[/math] and [math]\sigma_d[/math], so let's deal with the others first. As explained in Servin & Stephens, the posteriors for [math]\tau[/math] and [math]B[/math] change appropriately with shifts ([math]y+c[/math]) and scaling ([math]y \times c[/math]) in the phenotype when taking their limits. This also gives us a new Bayes factor, the one used in practice (see Guan & Stephens, 2008): [math]\mathrm{lim}_{\sigma_{\mu} \rightarrow \infty \; ; \; \lambda \rightarrow 0 \; ; \; \kappa \rightarrow 0 } \; \mathrm{BF} = \left( \frac{N}{\Sigma_B^{1} + X^TX} \right)^{\frac{1}{2}} \frac{1}{\sigma_a \sigma_d} \left( \frac{Y^TY  Y^TX (\Sigma_B^{1} + X^TX)^{1} X^TY}{Y^TY  N \bar{Y}^2} \right)^{\frac{N}{2}}[/math] Now, for the important hyperparameters, [math]\sigma_a[/math] and [math]\sigma_d[/math], it is usual to specify a grid of values, i.e. [math]M[/math] pairs [math](\sigma_a, \sigma_d)[/math]. For instance, Guan & Stephens used the following grid: [math]M=4 \; ; \; \sigma_a \in \{0.05, 0.1, 0.2, 0.4\} \; ; \; \sigma_d = \frac{\sigma_a}{4}[/math] Then, we can average the Bayes factors obtained over the grid using, as a first approximation, equal weights: [math]\mathrm{BF} = \sum_{m \, \in \, \text{grid}} \frac{1}{M} \, \mathrm{BF}(\sigma_a^{(m)}, \sigma_d^{(m)})[/math]
to do
to do
to do
to do
to do
to do
to do
to do
to do
to do
