< User:Hussein Alasadi‎ | Notebook‎ | stephens‎ | 2013‎ | 10
analyzing pooled sequenced data with selection <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

## Notes from Meeting

Consider a single lineage for now.

${\displaystyle X_{j}}$ = frequency of "1" allele at SNP j in the pool (i.e. the true frequency of the 1 allele in the pool)

• Data:

${\displaystyle (n_{j}^{0},n_{j}^{1})}$ = number of "0", "1" alleles at SNP j (${\displaystyle n_{j}=n_{j}^{0}+n_{j}^{1}}$)

• Normal approximation

${\displaystyle n_{j}^{1}}$ ~ ${\displaystyle Bin(n_{j},X_{j})\approx N(n_{j}X_{j},n_{j}X_{j}(1-X_{j}))}$ Normal approximation to binomial

${\displaystyle {\frac {n_{j}^{1}}{n_{j}}}\approx N(X_{j},{\frac {X_{j}(1-X_{j})}{n_{j}}})}$ The variance of this distribution results from error due to binomial sampling.

To simplify, we just plug in ${\displaystyle {\hat {X_{j}}}={\frac {n_{j}^{1}}{n_{j}}}}$ for ${\displaystyle X_{j}}$

${\displaystyle \implies {\frac {n_{j}^{1}}{n_{j}}}|X_{j}\approx N(X_{j},{\frac {{\hat {X_{j}}}(1-{\hat {X_{j}}})}{n_{j}}})}$

• notation

${\displaystyle f_{i,k,j}=}$ frequency of reference allele in group i, replicate and SNP j.

${\displaystyle {\vec {f_{i,k}}}=}$ vector of frequencies

Without loss of generality, we assume that the putative selected site is site ${\displaystyle j=1}$

• Model

We assume a prior on our vector of frequencies based on our panel of SNPs ${\displaystyle (M)}$ of dimension ${\displaystyle 2mxp}$

${\displaystyle {\vec {f_{i,k}}}}$ ~ ${\displaystyle MVN(\mu ,\Sigma )}$

${\displaystyle \mu =(1-\theta )f^{panel}+{\frac {\theta }{2}}1}$

${\displaystyle \Sigma =(1-\theta )^{2}S+{\frac {\theta }{2}}(1-{\frac {\theta }{2}})I}$

where ${\displaystyle S_{i,j}=\sum _{i,j}^{panel}}$ if i = j or ${\displaystyle e^{-{\frac {\rho _{i,j}}{2m}}\sum _{i,j}^{panel}}}$ if i not equal to j

${\displaystyle \theta ={\frac {(\sum _{i=1}^{2m-1}{\frac {1}{i}})^{-1}}{2m+(\sum _{i=1}^{2m-1}{\frac {1}{i}})^{-1}}}}$

• at selected site

${\displaystyle log{\frac {f_{i,k,1}}{1-f_{i,k,1}}}=\mu +\beta g_{i}+\epsilon _{i,k}}$

• conditional distribution

${\displaystyle (f_{i,k,2},....,f_{i,k,p})|f_{i,k,1},M}$ ~ ${\displaystyle MVN({\bar {\mu }},{\bar {\Sigma }})}$ The conditional distribution is easily obtained when we use a result derived here.

let ${\displaystyle X_{2}=(f_{i,k,2},....,f_{i,k,p})}$ and ${\displaystyle X_{1}=f_{i,k,1}}$

${\displaystyle X_{2}|X_{1},M}$ ~ ${\displaystyle N({\vec {\mu _{2}}}+\Sigma _{21}\Sigma _{11}^{-1}(x_{1}-\mu _{1}),\Sigma _{22}-\Sigma _{21}\Sigma _{11}^{-1}\Sigma _{12})}$

Thus ${\displaystyle {\bar {\mu }}={\vec {\mu _{2}}}+\Sigma _{21}\Sigma _{11}^{-1}(x_{1}-\mu _{1}),{\bar {\Sigma }}=\Sigma _{22}-\Sigma _{21}\Sigma _{11}^{-1}\Sigma _{12}}$

And equivalently we could derive the distribution ${\displaystyle X_{1}|X_{2},M}$ ${\displaystyle (f_{i,k,1}|f_{i,k,2},....,f_{i,k,p},M)}$

• Likelihood for frequency a the test SNP t given all data

let ${\displaystyle f_{obs}=\prod _{j\not =t}f_{i,k,j}}$

${\displaystyle L(f_{i,k,t}^{true})=P(f_{obs}|f_{i,k,t}^{true},M)={\frac {P(f_{i,k,t}^{true}|M,f_{obs})P(f^{obs}|M)}{P(f_{i,k,t}^{true}|M)}}}$

Confused here, can we just use the expression derived above for ${\displaystyle P(f_{i,k,t}^{true}|M,f_{obs})}$. Also, isn't ${\displaystyle f_{i,k,t}^{true}|M}$ ~ ${\displaystyle N(\mu _{1},\Sigma _{11})}$ and ${\displaystyle f^{obs}|M}$ ~ ${\displaystyle N(\mu _{2},\Sigma _{22})}$. But, how do we then incorporate ${\displaystyle \beta }$ into the likelihood calculation?

But maybe we want to incorporate dispersion and measurement error parameters

Then: ${\displaystyle f_{i,k,t}^{true}|M}$ ~ ${\displaystyle N(\mu ,\sigma ^{2}\Sigma )}$ The parameter ${\displaystyle \sigma ^{2}}$ allows for over-dispersion

${\displaystyle f^{obs}|M}$ ~ ${\displaystyle N_{p-1}(\mu _{2},\sigma ^{2}\Sigma _{22}+\epsilon ^{2}I)}$ where ${\displaystyle \epsilon ^{2}}$ allows for measurement error.

and I don't understand ${\displaystyle f_{obs}|f_{i,k,t}^{true},M}$. Shouldn't it come from (2.12) and not (2.13) - ask Matthew