User:Hussein Alasadi/Notebook/stephens/2013/10/03

analyzing pooled sequenced data with selection

<html><img src="/images/9/94/Report.png" border="0" /></html> Main project page
Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html>

Notes from Meeting

Consider a single lineage for now.

[math]\displaystyle{ X_j }[/math] = frequency of "1" allele at SNP j in the pool (i.e. the true frequency of the 1 allele in the pool)

Data:

[math]\displaystyle{ (n_j^0, n_j^1) }[/math] = number of "0", "1" alleles at SNP j ([math]\displaystyle{ n_j = n_j^0 + n_j^1 }[/math])

Normal approximation

[math]\displaystyle{ n_j^1 }[/math] ~ [math]\displaystyle{ Bin(n_j, X_j) \approx N(n_jX_j, n_jX_j(1-X_j)) }[/math] Normal approximation to binomial

[math]\displaystyle{ \frac{n_j^1}{n_j} \approx N(X_j, \frac{X_j(1-X_j)}{n_j}) }[/math] The variance of this distribution results from error due to binomial sampling.

To simplify, we just plug in [math]\displaystyle{ \hat{X_j} = \frac{n_j^1}{n_j} }[/math] for [math]\displaystyle{ X_j }[/math]

[math]\displaystyle{ \implies \frac{n_j^1}{n_j} | X_j \approx N(X_j, \frac{\hat{X_j}(1-\hat{X_j})}{n_j}) }[/math]

notation

[math]\displaystyle{ f_{i,k,j} = }[/math] frequency of reference allele in group i, replicate and SNP j.

[math]\displaystyle{ \vec{f_{i,k}} = }[/math] vector of frequencies

Without loss of generality, we assume that the putative selected site is site [math]\displaystyle{ j = 1 }[/math]

Model

We assume a prior on our vector of frequencies based on our panel of SNPs [math]\displaystyle{ (M) }[/math] of dimension [math]\displaystyle{ 2mxp }[/math]

[math]\displaystyle{ \vec{f_{i,k}} }[/math] ~ [math]\displaystyle{ MVN(\mu, \Sigma) }[/math]

[math]\displaystyle{ \mu = (1-\theta)f^{panel} + \frac{\theta}{2} 1 }[/math]

[math]\displaystyle{ \Sigma = (1-\theta)^2 S + \frac{\theta}{2}(1 - \frac{\theta}{2})I }[/math]

where [math]\displaystyle{ S_{i,j} = \sum_{i,j}^{panel} }[/math] if i = j or [math]\displaystyle{ e^{-\frac{\rho_{i,j}}{2m} \sum_{i,j}^{panel}} }[/math] if i not equal to j

[math]\displaystyle{ \theta = \frac{(\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}}{2m + (\sum_{i=1}^{2m-1} \frac{1}{i})^{-1}} }[/math]

at selected site

[math]\displaystyle{ log \frac{f_{i,k,1}}{1-f_{i,k,1}} = \mu + \beta g_i + \epsilon_{i,k} }[/math]

conditional distribution

[math]\displaystyle{ (f_{i,k,2}, .... , f_{i,k,p}) | f_{i,k,1}, M }[/math] ~ [math]\displaystyle{ MVN(\bar{\mu}, \bar{\Sigma}) }[/math] The conditional distribution is easily obtained when we use a result derived here.

let [math]\displaystyle{ X_2 = (f_{i,k,2}, .... , f_{i,k,p}) }[/math] and [math]\displaystyle{ X_1 = f_{i,k,1} }[/math]

[math]\displaystyle{ X_2 | X_1, M }[/math] ~ [math]\displaystyle{ N(\vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12}) }[/math]

Thus [math]\displaystyle{ \bar{\mu} = \vec{\mu_2} + \Sigma_{21} \Sigma_{11}^{-1} (x_1 - \mu_1), \bar{\Sigma} = \Sigma_{22} - \Sigma_{21}\Sigma_{11}^{-1}\Sigma_{12} }[/math]

And equivalently we could derive the distribution [math]\displaystyle{ X_1 | X_2, M }[/math] which is again [math]\displaystyle{ f_{i,k,1} | f_{i,k,2}, .... , f_{i,k,p}), M }[/math]

Likelihood for frequency a the test SNP t given all data

let [math]\displaystyle{ f_{obs} = \prod_{j \not= t} f_{i,k,j} }[/math]

[math]\displaystyle{ L(f_{i,k,t}^{true}) = P(f_{obs} | f_{i,k,t}^{true}, M) = \frac{P( f_{i,k,t}^{true} | M, f_{obs}) P(f^{obs}|M)}{P(f_{i,k,t}^{true} | M)} }[/math]

Confused here, can we just use the expression derived above for [math]\displaystyle{ P( f_{i,k,t}^{true} | M, f_{obs}) }[/math]. Also, isn't [math]\displaystyle{ f_{i,k,t}^{true} | M }[/math] ~ [math]\displaystyle{ N(\mu_1, \Sigma_{11}) }[/math] and [math]\displaystyle{ f^{obs} | M }[/math] ~ [math]\displaystyle{ N(\mu_2, \Sigma_{22}) }[/math]. But, how do we then incorporate [math]\displaystyle{ \beta }[/math] into the likelihood calculation?

But maybe we want to incorporate dispersion and measurement error parameters

Then: [math]\displaystyle{ f_{i,k,t}^{true} | M }[/math] ~ [math]\displaystyle{ N(\mu, \sigma^2 \Sigma) }[/math] The parameter [math]\displaystyle{ \sigma^2 }[/math] allows for over-dispersion

[math]\displaystyle{ f^{obs}| M }[/math] ~ [math]\displaystyle{ N_{p-1} (\mu_2, \sigma^2 \Sigma_{22} + \epsilon^2 I) }[/math] where [math]\displaystyle{ \epsilon^2 }[/math] allows for measurement error.

and I don't understand [math]\displaystyle{ f_{obs} | f_{i,k,t}^{true}, M }[/math]. Shouldn't it come from (2.12) and not (2.13) - ask Matthew

User:Hussein Alasadi/Notebook/stephens/2013/10/03

Notes from Meeting

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools