User:Timothee Flutre/Notebook/Postdoc/2012/01/22: Difference between revisions
(→Apply PCA on genotypes with EIGENSOFT: add link hapmap2impute.py) |
(→Apply PCA on genotypes with EIGENSOFT: add commands to apply PCA on HapMap and project study samples on their PCs) |
||
Line 38: | Line 38: | ||
$ gnuplot < commands.gnuplot > newplot.ps; ps2pdf newplot.ps newplot.pdf | $ gnuplot < commands.gnuplot > newplot.ps; ps2pdf newplot.ps newplot.pdf | ||
* Let's now assume that we have some samples and that we want to see "where they are" compare to samples from the HapMap project | |||
* Let's now assume that we have some samples and that we want to see "where they are" compare to samples from the HapMap project, eg. are they closer to CEU or YRI samples? is there any admixed individual in my study? can I identify outliers? ... EIGENSOFT allows to do this easily. | |||
We start by retrieving the genotypes from the HapMap project: | |||
for pop in {CEU,CHB,JPT,YRI}; do for i in {1..22}; do \ | |||
wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/genotypes_chr${i}_${pop}_r28_nr.b36_fwd.txt.gz; \ | |||
done; done | |||
Then, we convert the SNP coordinates to the latest release of the reference human genome. For this, first, we extract coordinates of SNPs having a high quality into one big BED file. Second, we remove duplicated lines. Third, we use the UCSC program "liftOver" to convert the coordinates from hg18 to hg19. | |||
rm -f snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed; for p in {CEU,CHB,JPT,YRI}; do echo ${p}; for c in {1..22}; do echo "chr"${c}; \ | |||
zcat genotypes_chr${c}_${p}_r28_nr.b36_fwd.txt.gz | awk 'NR>1 {if($11 == "QC+") print $3"\t"$4-1"\t"$4"\t"$1"\t1000\t"$5 >> "snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed"}'; \ | |||
done; done | |||
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed | |||
15638438 | |||
sort -k 4 snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort | |||
uniq snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq | |||
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq | |||
4038524 | |||
gzip snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq | |||
liftOver snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq.gz hg18ToHg19.over.chain.gz snps_hapmap_r28_nr_b37.bed unMapped | |||
wc -l < snps_hapmap_r28_nr_b37.bed | |||
4037640 | |||
grep -c -v "#" unMapped | |||
884 | |||
cut -f4 snps_hapmap_r28_nr_b37.bed | uniq -d > list_snps_redundant.txt | |||
wc -l < list_snps_redundant.txt | |||
17 | |||
Then, we convert the genotypes from HapMap into the IMPUTE format thanks to [https://github.com/timflutre/quantgen/blob/master/hapmap2impute.py this script]: | |||
for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ | |||
hapmap2impute.py -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -o genotypes_hapmap_r28_b37_${pop}.impute.gz \ | |||
-b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done | |||
zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l | |||
3907899 | |||
zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l | |||
3933013 | |||
zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l | |||
3931282 | |||
zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l | |||
3862842 | |||
Then, we convert again the genotypes from the IMPUTE format into the EIGENSTRAT format, thus obtaining one set of 3 files per HapMap population: | |||
for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ | |||
impute2eigenstrat.py -i genotypes_hapmap_r28_b37_${pop}.impute.gz -o hapmap_r28_b37_${pop} -l HMr28_${pop}; done | |||
Then we merge these 4 sets of 3 files into one set of 3 files, using the program "mergeit" in the EIGENSFOT package (see the documentation for "mergeit" to write a proper parameter file): | |||
mergeit -p par_mergeCEUandCHB.txt | |||
mergeit -p par_mergeCEUCHBandJPT.txt | |||
mergeit -p par_mergeCEUCHBJPTandYRI.txt | |||
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_genotypes.txt | |||
3146166 | |||
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_indiv.txt | |||
638 | |||
We also need to do the same with the samples from our study: | |||
wc -l < mystudy_eigenstrat_genotypes.txt | |||
418564 | |||
mergeit -p par_mergeCEUCHBJPTYRIandMystudy.txt | |||
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt | |||
417251 | |||
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt | |||
723 | |||
And finally we can launch "smartpca.perl" with the option "-w" to apply the PCA only on the HapMap samples but to project our study samples on their PCs: | |||
echo -e "HMr28_CEU\nHMr28_CHB\nHMr28_JPT\nHMr28_YRI" > poplist | |||
smartpca.perl -i hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt \ | |||
-a hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_snp.txt \ | |||
-b hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt \ | |||
-o hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_pca.txt \ | |||
-p hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_plot \ | |||
-e hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_eigenvalues.txt \ | |||
-l hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_log.txt \ | |||
-w poplist -m 0 | |||
In case we want to easily identify outliers, we can improve the Gnuplot parameter file in order to see the labels of the individuals in our study (see [http://stackoverflow.com/a/9182412/597069 this answer] on SO). | |||
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> | <!-- ##### DO NOT edit below this line unless you know what you are doing. ##### --> |
Revision as of 07:56, 8 February 2012
Project name | <html><img src="/images/9/94/Report.png" border="0" /></html> Main project page <html><img src="/images/c/c3/Resultset_previous.png" border="0" /></html>Previous entry<html> </html>Next entry<html><img src="/images/5/5c/Resultset_next.png" border="0" /></html> |
Apply PCA on genotypes with EIGENSOFT
Assuming that the genotypes are in the IMPUTE format, we need to convert them into the EIGENSTRAT format thanks to this script: impute2eigenstrat.py -i genotypes_allchrs.impute.gz -o mystudy -l MyStudy Now we can run the PCA: smartpca.perl -i mystudy_eigenstrat_genotypes.txt -a mystudy_eigenstrat_snp.txt -b mystudy_eigenstrat_indiv.txt \ -o mystudy_eigenstrat_pca.txt -p mystudy_eigenstrat_plot -e mystudy_eigenstrat_eigenvalues.txt -l mystudy_eigenstrat_log.txt -s 6.0
As an example, let's assume that the output data file from EIGENSOFT containing the PCs is like this (3 individuals, 2 PCs): $ cat data.txt ind1 2.0 2.0 ind2 2.5 1.8 ind3 1.8 2.0 Here is the file with the commands ("+0.1" is an offset which needs to be tuned with respect to the data): $ cat commands.gnuplot set terminal postscript color set title "" set xlabel "eigenvector 1" set ylabel "eigenvector 2" #plot "mystudy_eigenstrat_pca.txt.evec" using 2:3 title "MyStudy" plot "data.txt" using 2:3 title "Data", "" using 2:($3+0.1):1 with labels notitle ## pause 9999 We can obtain the new plot like this: $ gnuplot < commands.gnuplot > newplot.ps; ps2pdf newplot.ps newplot.pdf
We start by retrieving the genotypes from the HapMap project: for pop in {CEU,CHB,JPT,YRI}; do for i in {1..22}; do \ wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/genotypes_chr${i}_${pop}_r28_nr.b36_fwd.txt.gz; \ done; done Then, we convert the SNP coordinates to the latest release of the reference human genome. For this, first, we extract coordinates of SNPs having a high quality into one big BED file. Second, we remove duplicated lines. Third, we use the UCSC program "liftOver" to convert the coordinates from hg18 to hg19. rm -f snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed; for p in {CEU,CHB,JPT,YRI}; do echo ${p}; for c in {1..22}; do echo "chr"${c}; \ zcat genotypes_chr${c}_${p}_r28_nr.b36_fwd.txt.gz | awk 'NR>1 {if($11 == "QC+") print $3"\t"$4-1"\t"$4"\t"$1"\t1000\t"$5 >> "snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed"}'; \ done; done wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed 15638438 sort -k 4 snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort uniq snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq 4038524 gzip snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq liftOver snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq.gz hg18ToHg19.over.chain.gz snps_hapmap_r28_nr_b37.bed unMapped wc -l < snps_hapmap_r28_nr_b37.bed 4037640 grep -c -v "#" unMapped 884 cut -f4 snps_hapmap_r28_nr_b37.bed | uniq -d > list_snps_redundant.txt wc -l < list_snps_redundant.txt 17 Then, we convert the genotypes from HapMap into the IMPUTE format thanks to this script: for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ hapmap2impute.py -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -o genotypes_hapmap_r28_b37_${pop}.impute.gz \ -b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l 3907899 zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l 3933013 zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l 3931282 zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l 3862842 Then, we convert again the genotypes from the IMPUTE format into the EIGENSTRAT format, thus obtaining one set of 3 files per HapMap population: for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ impute2eigenstrat.py -i genotypes_hapmap_r28_b37_${pop}.impute.gz -o hapmap_r28_b37_${pop} -l HMr28_${pop}; done Then we merge these 4 sets of 3 files into one set of 3 files, using the program "mergeit" in the EIGENSFOT package (see the documentation for "mergeit" to write a proper parameter file): mergeit -p par_mergeCEUandCHB.txt mergeit -p par_mergeCEUCHBandJPT.txt mergeit -p par_mergeCEUCHBJPTandYRI.txt wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_genotypes.txt 3146166 wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_indiv.txt 638 We also need to do the same with the samples from our study: wc -l < mystudy_eigenstrat_genotypes.txt 418564 mergeit -p par_mergeCEUCHBJPTYRIandMystudy.txt wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt 417251 wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt 723 And finally we can launch "smartpca.perl" with the option "-w" to apply the PCA only on the HapMap samples but to project our study samples on their PCs: echo -e "HMr28_CEU\nHMr28_CHB\nHMr28_JPT\nHMr28_YRI" > poplist smartpca.perl -i hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt \ -a hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_snp.txt \ -b hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt \ -o hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_pca.txt \ -p hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_plot \ -e hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_eigenvalues.txt \ -l hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_log.txt \ -w poplist -m 0 In case we want to easily identify outliers, we can improve the Gnuplot parameter file in order to see the labels of the individuals in our study (see this answer on SO). |