# User:Hussein Alasadi/Notebook/stephens/2013/10/03

(Difference between revisions)
Jump to: navigation, search
 Revision as of 12:14, 16 October 2013 (view source) (→Notes from Meeting)← Previous diff Revision as of 22:04, 16 October 2013 (view source) (→Notes from Meeting)Next diff → (22 intermediate revisions not shown.) Line 36: Line 36: We assume a prior on our vector of frequencies based on our panel of SNPs $(M)$ of dimension $2mxp$ We assume a prior on our vector of frequencies based on our panel of SNPs $(M)$ of dimension $2mxp$ - $\vec{f_{i,k}}$ ~ $MVN(\mu, \sum)$ + $\vec{f_{i,k}}$ ~ $MVN(\mu, \Sigma)$ $\mu = (1-\theta)f^{panel} + \frac{\theta}{2} 1$ $\mu = (1-\theta)f^{panel} + \frac{\theta}{2} 1$ - $\sum = (1-\theta)^2 S + \frac{\theta}{2}(1 - \frac{\theta}{2})I$ + $\Sigma = (1-\theta)^2 S + \frac{\theta}{2}(1 - \frac{\theta}{2})I$ where $S_{i,j} = \sum_{i,j}^{panel}$ if i = j or $e^{-\frac{\rho_{i,j}}{2m} \sum_{i,j}^{panel}}$ if i not equal to j where $S_{i,j} = \sum_{i,j}^{panel}$ if i = j or $e^{-\frac{\rho_{i,j}}{2m} \sum_{i,j}^{panel}}$ if i not equal to j Line 51: Line 51: * '''conditional distribution''' * '''conditional distribution''' - $(f_{i,k,2}, .... , f_{i,k,p}) | f_{i,k,1}, M$ ~ $MVN(\bar{\mu}, \bar{\sum})$ + $(f_{i,k,2}, .... , f_{i,k,p}) | f_{i,k,1}, M$ ~ $MVN(\bar{\mu}, \bar{\Sigma})$ - The conditional distribution is easily obtained when we use a result derived [http://openwetware.org/wiki/User:Hussein_Alasadi/Notebook/stephens/2013/10/14 here]. + The conditional distribution is easily obtained when we use a result derived [http://openwetware.org/wiki/User:Hussein_Alasadi/Notebook/stephens/2013/10/14 here]. + + let $X_2 = (f_{i,k,2}, .... , f_{i,k,p})$ and $X_1 = f_{i,k,1}$ + + $X_2 | X_1, M$ ~ $N(\vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$ + + Thus $\bar{\mu} = \vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \bar{\Sigma} = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$ + + And equivalently we could derive the distribution $X_1 | X_2, M$ which is again $f_{i,k,1} | f_{i,k,2}, .... , f_{i,k,p}), M$ + + *'''Likelihood for frequency a the test SNP t given all data''' + + let $f_{obs} = \prod_{j \not= t} f_{i,k,j}$ + + $L(f_{i,k,t}^{true}) = P(f_{obs} | f_{i,k,t}^{true}, M) = \frac{P( f_{i,k,t}^{true} | M, f_{obs}) P(f^{obs}|M)}{P(f_{i,k,t}^{true} | M)}$ + + Confused here, can we just use the expression derived above for $P( f_{i,k,t}^{true} | M, f_{obs})$. Also, isn't $f_{i,k,t}^{true} | M$ ~ + $N(\mu_1, \Sigma_{11})$ and $f^{obs} | M$ ~ $N(\mu_2, \Sigma_{22})$. But, how do we then incorporate $\beta$ into the likelihood calculation? + + + But maybe we want to incorporate dispersion and measurement error parameters + + Then: + $f_{i,k,t}^{true} | M$ ~ $N(\mu, \sigma^2 \Sigma)$ The parameter $\sigma^2$ allows for over-dispersion + $f^{obs}| M$ ~ $N_{p-1} (\mu_2, \sigma^2 \Sigma_{22} + \epsilon^2 I)$ where $\epsilon^2$ allows for measurement error. + + and I don't understand $f_{obs} | f_{i,k,t}^{true}, M$. Shouldn't it come from (2.12) and not (2.13) - ask Matthew + + +

## Revision as of 22:04, 16 October 2013

analyzing pooled sequenced data with selection Main project page
Next entry

## Notes from Meeting

Consider a single lineage for now.

Xj = frequency of "1" allele at SNP j in the pool (i.e. the true frequency of the 1 allele in the pool)

• Data:

$(n_j^0, n_j^1)$ = number of "0", "1" alleles at SNP j ($n_j = n_j^0 + n_j^1$)

• Normal approximation

$n_j^1$ ~ $Bin(n_j, X_j) \approx N(n_jX_j, n_jX_j(1-X_j))$ Normal approximation to binomial

$\frac{n_j^1}{n_j} \approx N(X_j, \frac{X_j(1-X_j)}{n_j})$ The variance of this distribution results from error due to binomial sampling.

To simplify, we just plug in $\hat{X_j} = \frac{n_j^1}{n_j}$ for Xj

$\implies \frac{n_j^1}{n_j} | X_j \approx N(X_j, \frac{\hat{X_j}(1-\hat{X_j})}{n_j})$

• notation

fi,k,j = frequency of reference allele in group i, replicate and SNP j.

$\vec{f_{i,k}} =$ vector of frequencies

Without loss of generality, we assume that the putative selected site is site j = 1

• Model

We assume a prior on our vector of frequencies based on our panel of SNPs (M) of dimension 2mxp

$\vec{f_{i,k}}$ ~ MVN(μ,Σ)

$\mu = (1-\theta)f^{panel} + \frac{\theta}{2} 1$

$\Sigma = (1-\theta)^2 S + \frac{\theta}{2}(1 - \frac{\theta}{2})I$

where $S_{i,j} = \sum_{i,j}^{panel}$ if i = j or $e^{-\frac{\rho_{i,j}}{2m} \sum_{i,j}^{panel}}$ if i not equal to j

$\theta = \frac{(\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}}{2m + (\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}}$

• at selected site

$log \frac{f_{i,k,1}}{1-f_{i,k,1}} = \mu + \beta g_i + \epsilon_{i,k}$

• conditional distribution

(fi,k,2,....,fi,k,p) | fi,k,1,M ~ $MVN(\bar{\mu}, \bar{\Sigma})$ The conditional distribution is easily obtained when we use a result derived here.

let X2 = (fi,k,2,....,fi,k,p) and X1 = fi,k,1

X2 | X1,M ~ $N(\vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$

Thus $\bar{\mu} = \vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \bar{\Sigma} = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$

And equivalently we could derive the distribution X1 | X2,M which is again fi,k,1 | fi,k,2,....,fi,k,p),M

• Likelihood for frequency a the test SNP t given all data

let $f_{obs} = \prod_{j \not= t} f_{i,k,j}$

$L(f_{i,k,t}^{true}) = P(f_{obs} | f_{i,k,t}^{true}, M) = \frac{P( f_{i,k,t}^{true} | M, f_{obs}) P(f^{obs}|M)}{P(f_{i,k,t}^{true} | M)}$

Confused here, can we just use the expression derived above for $P( f_{i,k,t}^{true} | M, f_{obs})$. Also, isn't $f_{i,k,t}^{true} | M$ ~ N111) and fobs | M ~ N222). But, how do we then incorporate β into the likelihood calculation?

But maybe we want to incorporate dispersion and measurement error parameters

Then: $f_{i,k,t}^{true} | M$ ~ N(μ,σ2Σ) The parameter σ2 allows for over-dispersion

fobs | M ~ Np − 122Σ22 + ε2I) where ε2 allows for measurement error.

and I don't understand $f_{obs} | f_{i,k,t}^{true}, M$. Shouldn't it come from (2.12) and not (2.13) - ask Matthew