Difference between revisions of "User:Hussein Alasadi/Notebook/stephens/2013/10/13"

analyzing pooled sequenced data with selection Main project page
Previous entry      Next entry

Intro to Wen & Stephens in 2D

Suppose we have only summary-level data for haplotypes $\displaystyle h_1, h_2, ..., h_{2n}$ . Specifically let the summary-level data be denoted by $\displaystyle y = (y_1, y_2)' = \frac{1}{2n} \sum_i^{2n} h_i$ . We assume in this two locus model, the first locus is typed and the second locus is untyped. We hope to predict what the allele frequency is at the untyped SNP using information from panel data (perhaps this can be interpreted as our prior). Formally, let $\displaystyle y_1$ denote the allele frequency at the typed SNP and $\displaystyle y_2$ the allele frequency at the untyped SNP. We assume that $\displaystyle h_1, h_2, ..., h_{2n}$ are independent and identically distributed from $\displaystyle P(M)$ (our prior).

We assume that $\displaystyle y = N(\mu, \Sigma)$ . By properties of bi-variate normal distributions $\displaystyle y_2/y_1,M$ ~ $\displaystyle N(\mu_2 + \rho \frac{\sigma_2}{\sigma_1}(y_1 - u_1), (1-\rho^2)\sigma_1^2)$ where $\displaystyle \rho = \frac{E[y_1y_2]}{\sigma_1 \sigma_2}$ . The genius of Wen & Stephens lies in the idea that the distribution of $\displaystyle y_2$ (assigned as the untyped SNP) is a function of both the panel data ($\displaystyle \mu_2$ ) and the typed SNPs $\displaystyle (y_1)$ .

Li & Stephens in 2D

We describe the Li & Stephens haplotype copying model: Let $\displaystyle h_1, h_2, ..., h_{k}$ denote the k sampled haplotypes at 2 loci. Thus there are 4 possible haplotypes. The first haplotype is randomly chosen with equal probability from the four possible haplotypes.

Consider now the conditional distribution of $\displaystyle h_{k+1}$ given $\displaystyle h_1, h_2,...,h_k$ . Recall the intuition is that $\displaystyle h_{k+1}$ is a mosaic of $\displaystyle h_1, h_2,..,h_k$ .

Let $\displaystyle X_j$ denote which hapolotype $\displaystyle h_{k+1}$ copies at site j (so $\displaystyle X_j \in {1,2,..,k}$ ).

We model $\displaystyle X_j$ as a markov chain on $\displaystyle {1,..,k}$ with $\displaystyle P(x_1 =x) = \frac{1}{k}$ . The transition probabilities are:

$\displaystyle P(X_{j+1}=x'/X_j = x) = e^{-\frac{\rho_jd_j}{k}} + (1-e^{-\rho_jd_j})(1/k)$ if $\displaystyle x'=x$ and

$\displaystyle (1-e^{-\frac{\rho_jd_j}{k}})(1/k)$ otherwise. $\displaystyle \rho_j$ and $\displaystyle d_j$ denote recombination and physical distances, respectively.

Now in a hidden markov model, there is also the transmission process. To mimic the effects of mutation, the copying process may be imperfect. $\displaystyle P(h_{k+1,j} =a / X_j = x, h_1,..,h_k) = \frac{k}{k+\theta} + \frac{\theta}{2(k+\theta)}$ if $\displaystyle h_{x,j} = a$ and $\displaystyle \frac{\theta}{2(k+\theta)}$ otherwise. $\displaystyle \theta = (\sum_{m=1}^{n-1} \frac{1}{m})^{-1}$ , where the motivation is the more haplotypes the less frequent mutation occurs.