< User:Hussein Alasadi‎ | Notebook‎ | stephens‎ | 2013‎ | 10
analyzing pooled sequenced data with selection <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

## Notes from Meeting

Consider a single lineage for now.

$\displaystyle X_j$ = frequency of "1" allele at SNP j in the pool (i.e. the true frequency of the 1 allele in the pool)

• Data:

$\displaystyle (n_j^0, n_j^1)$ = number of "0", "1" alleles at SNP j ($\displaystyle n_j = n_j^0 + n_j^1$ )

• Normal approximation

$\displaystyle n_j^1$ ~ $\displaystyle Bin(n_j, X_j) \approx N(n_jX_j, n_jX_j(1-X_j))$ Normal approximation to binomial

$\displaystyle \frac{n_j^1}{n_j} \approx N(X_j, \frac{X_j(1-X_j)}{n_j})$ The variance of this distribution results from error due to binomial sampling.

To simplify, we just plug in $\displaystyle \hat{X_j} = \frac{n_j^1}{n_j}$ for $\displaystyle X_j$

$\displaystyle \implies \frac{n_j^1}{n_j} | X_j \approx N(X_j, \frac{\hat{X_j}(1-\hat{X_j})}{n_j})$

• notation

$\displaystyle f_{i,k,j} =$ frequency of reference allele in group i, replicate and SNP j.

$\displaystyle \vec{f_{i,k}} =$ vector of frequencies

Without loss of generality, we assume that the putative selected site is site $\displaystyle j = 1$

• Model

We assume a prior on our vector of frequencies based on our panel of SNPs $\displaystyle (M)$ of dimension $\displaystyle 2mxp$

$\displaystyle \vec{f_{i,k}}$ ~ $\displaystyle MVN(\mu, \Sigma)$

$\displaystyle \mu = (1-\theta)f^{panel} + \frac{\theta}{2} 1$

$\displaystyle \Sigma = (1-\theta)^2 S + \frac{\theta}{2}(1 - \frac{\theta}{2})I$

where $\displaystyle S_{i,j} = \sum_{i,j}^{panel}$ if i = j or $\displaystyle e^{-\frac{\rho_{i,j}}{2m} \sum_{i,j}^{panel}}$ if i not equal to j

$\displaystyle \theta = \frac{(\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}}{2m + (\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}}$

• at selected site

$\displaystyle log \frac{f_{i,k,1}}{1-f_{i,k,1}} = \mu + \beta g_i + \epsilon_{i,k}$

• conditional distribution

$\displaystyle (f_{i,k,2}, .... , f_{i,k,p}) | f_{i,k,1}, M$ ~ $\displaystyle MVN(\bar{\mu}, \bar{\Sigma})$ The conditional distribution is easily obtained when we use a result derived here.

let $\displaystyle X_2 = (f_{i,k,2}, .... , f_{i,k,p})$ and $\displaystyle X_1 = f_{i,k,1}$

$\displaystyle X_2 | X_1, M$ ~ $\displaystyle N(\vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12})$

Thus $\displaystyle \bar{\mu} = \vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \bar{\Sigma} = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}$

• Likelihood for frequency a the test SNP t given all data

let $\displaystyle f_{obs} = \prod_{j \not= t} f_{i,k,j}$

$\displaystyle L(f_{i,k,t}^{true}) = P(f_{obs} | f_{i,k,t}^{true}, M) = \frac{P( f_{i,k,t}^{true} | M, f_{obs}) P(f^{obs}|M)}{P(f_{i,k,t}^{true} | M)}$

where $\displaystyle f_{i,k,t}^{true} | M$ ~ $\displaystyle N(\mu, \sigma^2 \Sigma)$ The parameter $\displaystyle \sigma^2$ allows for over-dispersion

where $\displaystyle f^{obs}| M$ ~ $\displaystyle N_{p-1} (\mu_2, \sigma^2 \Sigma_{22} + \epsilon^2 I)$ where $\displaystyle \epsilon^2$ allows for measurement error.

and I don't understand $\displaystyle f_{obs} | f_{i,k,t}^{true}, M$