User:Jonathan R. I. Coleman/Notebook/Notes and Protocols/2014/06/27: Difference between revisions

Revision as of 03:19, 26 January 2015

Imputation Clean-up

<html><img src="/images/9/94/Report.png" border="0" /></html> Main project page

Converting from IMPUTE2 .impute2 format to single-value genotype format

Joni Coleman, King's College London (Please send comments to jonathan[dot]coleman[at]kcl[dot]ac[dot]uk)

GUIDANCE FOR READING THIS FILE:

Comments look like this

   Commands (on the UNIX command line) look like this

PROBLEM:

IMPUTE2 provides genotypes as probabilities ranging from 0-1 for each genotype possibility (AA, AB, BB). Some downstream applications want a single-value format, ranging from 0-2, where 0 = AA, 1 = AB and 2 = BB.

SOLUTION:

Recode the three genotype probabilities from IMPUTE2 into a single value with this basic equation:

[0 * p(AA)] + [1 * p(AB)] + [2 * p(BB)]

which simplifies to:

p(AB) + [2 * p(BB)]

IMPLEMENTATION IN UNIX (Imputed to Phase3): (Credit: Tommy Carstensen)

   zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz

1. Unzips the impute2 file

2. Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles)

3. Iterates through the impute2 file and makes single-value dosage score for each line.

4. Gzips the output file.

IMPLEMENTATION IN UNIX (Imputed to Phase1 Integrated Haplotypes): (Credit: Tommy Carstensen)

   zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz

1. Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file.

@@ Line 15: / Line 15: @@
 '''GUIDANCE FOR READING THIS FILE:'''
+Comments look like this
-Instructions look like this
-''Comments look like this''
      Commands (on the UNIX command line) look like this
@@ Line 42: / Line 39: @@
-'''IMPLEMENTATION IN UNIX:'''
+'''IMPLEMENTATION IN UNIX (Imputed to Phase3): (Credit: Tommy Carstensen)'''
+    zcat impute2.gen.gz | awk '{printf $1"\t"$2; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz
-'''(WARNING: INELEGANT AND MESSY – RECOMMENDED THAT THIS IS PERFORMED ON A CHROMOSOME-BY-CHROMOSOME BASIS AT THE LARGEST)'''
+.	Unzips the impute2 file
+.	Prints the chromosome number (NB. "---" for unaltered imputed genotypes in impute2 file) and the SNP name (in Phase3 release, this is also the BP positions and alleles)
-.	Identify the sample size and call it n
+.	Iterates through the impute2 file and makes single-value dosage score for each line.
-''a.	If this is not already know, run:''
+.	Gzips the output file.
-    wc –l [file.name].impute2_info_by_sample
-''b.	The result of this is n + 1 (because there is a header row), so -1 to give n''
+'''IMPLEMENTATION IN UNIX (Imputed to Phase1 Integrated Haplotypes): (Credit: Tommy Carstensen)'''
+    zcat impute2.gen.gz | awk '{printf $1"\t"$2"\t"$3"\t"$4"\t"$5; for(i=6; i<NF; i+=3) {printf "\t"$(i+0)*0+$(i+1)*1+$(i+2)*2}; printf "\n"}' | bgzip > dosages.gz
-.	Identify the number of SNPs and call it m
+.	Implementation is as above, but the SNP name in earlier releases only contains rs ID, so this adds the BP and alleles to the file.
-''a.	If this is not already know, run:''
+<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
-    wc –l [file.name].impute2
-''b.	The result of this is m''
-.	Run this awk command:
-    awk '{for (i=1; i<=((NF-5)/3); i++) print $(3*(i+1)+1)+2*$(3*(i+1)+2)}' < [file.name].impute2 > [file.name].vector
-''a.	This command tells awk to read through the .impute2 file iteratively and print values for column 7 + [2 * column 8] (that is, p(AB) + [2 * p(BB)] for the first SNP), and then do the same for columns 10 and 11, and so on until the end of the file''
-''b.	The output file is a 1 x (n * m) vector, where n is sample size and m is number of SNPs, such that all the genotypes for SNP1 are listed, then all the genotypes for SNP2, etc.
-''c.	We want this to be a matrix of n x m''
-.	Split the vector into m pieces
-    split –a 5 -d –l [n] [file.name].vector [file.name]_pieces
-''a.	This breaks the genotype file into m files, each of which is a vector of length n, called [file.name]_genotypes[XXXXX], where [XXXXX] is a number ranging from 00000 to m-1''.
-''i.	That is, there is now one file of the genotypes for each SNP''
-''b.	–a x gives a suffix of x X’s (so –a 2 would give [file.name]_genotypes[XX], -a 3 would give [file.name]_genotypes[XXX], etc.)''
-.	Put the file back together
-    awk '{print $1, $2, $3, $4, $5}' [file.name].impute2 > front_piece
-''a.	Retrieves the non-genotype information from the .impute2 file''
-    paste –s -d " " [file.name]_pieces* > [file.name]_genotypes
-''b.	Creates a file where each row is the genotypes for each SNP''
-    paste –d " " front_piece [file.name]_genotypes > [file.name].genotypes
-''c.	Combines the non-genotype and genotype information together to make the genotype file''
-''i.	Note that the –s option was used in the first case, but is not used in the second''
-<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
-|}
 __NOTOC__

User:Jonathan R. I. Coleman/Notebook/Notes and Protocols/2014/06/27: Difference between revisions

Revision as of 03:19, 26 January 2015

Converting from IMPUTE2 .impute2 format to single-value genotype format

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools