User:Timothee Flutre/Notebook/Postdoc/2011/06/28: Difference between revisions

Revision as of 12:04, 19 June 2012

Project name

<html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Simple linear regression

Data: Let's assume that we obtained data from [math]\displaystyle{ N }[/math] individuals. We note [math]\displaystyle{ y_1,\ldots,y_N }[/math] the (quantitative) phenotypes (eg. expression level at a given gene), and [math]\displaystyle{ g_1,\ldots,g_N }[/math] the genotypes at a given SNP. We want to assess their linear relationship.

Model: for this we use a simple linear regression (univariate phenotype, single predictor).

[math]\displaystyle{ \forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2) }[/math]

In matrix notation:

[math]\displaystyle{ y = X \theta + \epsilon }[/math] with [math]\displaystyle{ \epsilon \sim N_N(0,\sigma^2 I_N) }[/math] and [math]\displaystyle{ \theta^T = (\mu, \beta) }[/math]

Most importantly, we want the following estimates: [math]\displaystyle{ \hat{\beta} }[/math], [math]\displaystyle{ se(\hat{\beta}) }[/math] (its standard error) and [math]\displaystyle{ \hat{\sigma} }[/math]. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.

Here is the ordinary-least-square (OLS) estimator of [math]\displaystyle{ \theta }[/math]:

[math]\displaystyle{ \hat{\theta} = (X^T X)^{-1} X^T Y }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix} \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix} \begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix} }[/math]

[math]\displaystyle{ \begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} = \frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2} \begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix} }[/math]

Let's now define 4 summary statistics, very easy to compute:

[math]\displaystyle{ \bar{y} = \frac{1}{N} \sum_{n=1}^N y_n }[/math]

[math]\displaystyle{ \bar{g} = \frac{1}{N} \sum_{n=1}^N g_n }[/math]

[math]\displaystyle{ g^T g = \sum_{n=1}^N g_n^2 }[/math]

[math]\displaystyle{ g^T y = \sum_{n=1}^N g_n y_n }[/math]

This allows to obtain the estimate of the effect size only by having the summary statistics available:

[math]\displaystyle{ \hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2} }[/math]

The same works for the estimate of the standard deviation of the errors:

[math]\displaystyle{ \hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta}) }[/math]

We can also benefit from this for the standard error of the parameters:

[math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1} }[/math]

[math]\displaystyle{ V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2} \begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix} }[/math]

[math]\displaystyle{ V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2} }[/math]

@@ Line 6: / Line 6: @@
 | colspan="2"|
 <!-- ##### DO NOT edit above this line unless you know what you are doing. ##### -->
-==Calculate OLS estimates with summary statistics for simple linear regression==
+==Simple linear regression==
-We obtained data from <math>n</math> individuals. Let be <math>y_1,\ldots,y_n</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_n</math> the genotypes at a given SNP.
+* '''Data''': Let's assume that we obtained data from <math>N</math> individuals. We note <math>y_1,\ldots,y_N</math> the (quantitative) phenotypes (eg. expression level at a given gene), and <math>g_1,\ldots,g_N</math> the genotypes at a given SNP. We want to assess their linear relationship.
-We want to assess the linear relationship between phenotype and genotype. For this with use a simple linear regression:
+* '''Model''': for this we use a simple linear regression (univariate phenotype, single predictor).
-<math>y_i = \mu + \beta x_i + \epsilon_i</math> with <math>\epsilon_i \rightarrow N(0,\sigma^2)</math> and for <math>i \in {1,\ldots,n}</math>
+<math>\forall n \in {1,\ldots,N}, \; y_n = \mu + \beta g_n + \epsilon_n \text{ with } \epsilon_n \sim N(0,\sigma^2)</math>
-In vector-matrix notation:
+In matrix notation:
-<math>y = X \theta + \epsilon</math> with <math>\epsilon \rightarrow N_n(0,\sigma^2 I)</math> and <math>\theta^T = (\mu, \beta)</math>
+<math>y = X \theta + \epsilon</math> with <math>\epsilon \sim N_N(0,\sigma^2 I_N)</math> and <math>\theta^T = (\mu, \beta)</math>
+Most importantly, we want the following estimates: <math>\hat{\beta}</math>, <math>se(\hat{\beta})</math> (its standard error) and <math>\hat{\sigma}</math>. In the case where we don't have access to the original data (eg. because genotypes are confidential) but only to some summary statistics (see below), it is still possible to calculate the estimates.
 Here is the ordinary-least-square (OLS) estimator of <math>\theta</math>:
@@ Line 23: / Line 25: @@
 <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
-\left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_n \end{bmatrix}
+\left( \begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix}
-\begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_n \end{bmatrix} \right)^{-1}
+\begin{bmatrix} 1 & g_1 \\ \vdots & \vdots \\ 1 & g_N \end{bmatrix} \right)^{-1}
-\begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_n \end{bmatrix}
+\begin{bmatrix} 1 & \ldots & 1 \\ g_1 & \ldots & g_N \end{bmatrix}
-\begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix}
+\begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix}
 </math>
 <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
-\begin{bmatrix} n & \sum_i g_i \\ \sum_i g_i & \sum_i g_i^2 \end{bmatrix}^{-1}
+\begin{bmatrix} N & \sum_n g_n \\ \sum_n g_n & \sum_n g_n^2 \end{bmatrix}^{-1}
-\begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix}
+\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}
 </math>
 <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
-\frac{1}{n \sum_i g_i^2 - (\sum_i g_i)^2}
+\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
-\begin{bmatrix} \sum_i g_i^2 & - \sum_i g_i \\ - \sum_i g_i & n \end{bmatrix}
+\begin{bmatrix} \sum_n g_n^2 & - \sum_n g_n \\ - \sum_n g_n & N \end{bmatrix}
-\begin{bmatrix} \sum_i y_i \\ \sum_i g_i y_i \end{bmatrix}
+\begin{bmatrix} \sum_n y_n \\ \sum_n g_n y_n \end{bmatrix}
 </math>
 <math>\begin{bmatrix} \hat{\mu} \\ \hat{\beta} \end{bmatrix} =
-\frac{1}{n \sum_i g_i^2 - (\sum_i g_i)^2}
+\frac{1}{N \sum_n g_n^2 - (\sum_n g_n)^2}
-\begin{bmatrix} \sum_i g_i^2 \sum_i y_i - \sum_i g_i \sum_i g_i y_i \\ - \sum_i g_i \sum_i y_i + n \sum_i g_i y_i \end{bmatrix}
+\begin{bmatrix} \sum_n g_n^2 \sum_n y_n - \sum_n g_n \sum_n g_n y_n \\ - \sum_n g_n \sum_n y_n + N \sum_n g_n y_n \end{bmatrix}
 </math>
 Let's now define 4 summary statistics, very easy to compute:
-<math>\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i</math>
+<math>\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n</math>
-<math>\bar{g} = \frac{1}{n} \sum_{i=1}^n g_i</math>
+<math>\bar{g} = \frac{1}{N} \sum_{n=1}^N g_n</math>
-<math>g^T g = \sum_{i=1}^n g_i^2</math>
+<math>g^T g = \sum_{n=1}^N g_n^2</math>
-<math>g^T y = \sum_{i=1}^n g_i y_i</math>
+<math>g^T y = \sum_{n=1}^N g_n y_n</math>
 This allows to obtain the estimate of the effect size only by having the summary statistics available:
-<math>\hat{\beta} = \frac{g^T y - n \bar{g} \bar{y}}{g^T g - n \bar{g}^2}</math>
+<math>\hat{\beta} = \frac{g^T y - N \bar{g} \bar{y}}{g^T g - N \bar{g}^2}</math>
 The same works for the estimate of the standard deviation of the errors:
-<math>\hat{\sigma}^2 = \frac{1}{n-r}(y - X\hat{\theta})^T(y - X\hat{\theta})</math>
+<math>\hat{\sigma}^2 = \frac{1}{N-r}(y - X\hat{\theta})^T(y - X\hat{\theta})</math>
 We can also benefit from this for the standard error of the parameters:
@@ Line 67: / Line 69: @@
 <math>V(\hat{\theta}) = \hat{\sigma}^2 (X^T X)^{-1}</math>
-<math>V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{n g^T g - n^2 \bar{g}^2}
+<math>V(\hat{\theta}) = \hat{\sigma}^2 \frac{1}{N g^T g - N^2 \bar{g}^2}
-\begin{bmatrix} g^Tg & -n\bar{g} \\ -n\bar{g} & n \end{bmatrix}
+\begin{bmatrix} g^Tg & -N\bar{g} \\ -N\bar{g} & N \end{bmatrix}
 </math>
-<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - n\bar{g}^2}</math>
+<math>V(\hat{\beta}) = \frac{\hat{\sigma}^2}{g^Tg - N\bar{g}^2}</math>
 <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

User:Timothee Flutre/Notebook/Postdoc/2011/06/28: Difference between revisions

Revision as of 12:04, 19 June 2012

Simple linear regression

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools