User:Timothee Flutre/Notebook/Postdoc/2012/01/22
Project name | Main project page Previous entry Next entry |
Apply PCA on genotypes with EIGENSOFT
Assuming that the genotypes are in the IMPUTE format, we need to convert them into the EIGENSTRAT format thanks to this script: impute2eigenstrat.py -i genotypes_allchrs.impute.gz -o mystudy -l MyStudy Now we can run the PCA: smartpca.perl -i mystudy_eigenstrat_genotypes.txt -a mystudy_eigenstrat_snp.txt -b mystudy_eigenstrat_indiv.txt \ -o mystudy_eigenstrat_pca.txt -p mystudy_eigenstrat_plot -e mystudy_eigenstrat_eigenvalues.txt -l mystudy_eigenstrat_log.txt -s 6.0
As an example, let's assume that the output data file from EIGENSOFT containing the PCs is like this (3 individuals, 2 PCs): $ cat data.txt ind1 2.0 2.0 ind2 2.5 1.8 ind3 1.8 2.0 Here is the file with the commands ("+0.1" is an offset which needs to be tuned with respect to the data): $ cat commands.gnuplot set terminal postscript color set title "" set xlabel "eigenvector 1" set ylabel "eigenvector 2" #plot "mystudy_eigenstrat_pca.txt.evec" using 2:3 title "MyStudy" plot "data.txt" using 2:3 title "Data", "" using 2:($3+0.1):1 with labels notitle ## pause 9999 We can obtain the new plot like this: $ gnuplot < commands.gnuplot > newplot.ps; ps2pdf newplot.ps newplot.pdf
We start by retrieving the genotypes from the HapMap project: for pop in {CEU,CHB,JPT,YRI}; do for i in {1..22}; do \ wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/genotypes_chr${i}_${pop}_r28_nr.b36_fwd.txt.gz; \ done; done Then, we convert the SNP coordinates to the latest release of the reference human genome. For this, first, we extract coordinates of SNPs having a high quality into one big BED file. Second, we remove duplicated lines. Third, we use the UCSC program "liftOver" to convert the coordinates from hg18 to hg19. rm -f snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed; for p in {CEU,CHB,JPT,YRI}; do echo ${p}; for c in {1..22}; do echo "chr"${c}; \ zcat genotypes_chr${c}_${p}_r28_nr.b36_fwd.txt.gz | awk 'NR>1 {if($11 == "QC+") print $3"\t"$4-1"\t"$4"\t"$1"\t1000\t"$5 >> "snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed"}'; \ done; done wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed 15638438 sort -k 4 snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort uniq snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq 4038524 gzip snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq liftOver snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq.gz hg18ToHg19.over.chain.gz snps_hapmap_r28_nr_b37.bed unMapped wc -l < snps_hapmap_r28_nr_b37.bed 4037640 grep -c -v "#" unMapped 884 cut -f4 snps_hapmap_r28_nr_b37.bed | uniq -d > list_snps_redundant.txt wc -l < list_snps_redundant.txt 17 Then, we convert the genotypes from HapMap into the IMPUTE format thanks to this script: for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ hapmap2impute.py -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -o genotypes_hapmap_r28_b37_${pop}.impute.gz \ -b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l 3907899 zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l 3933013 zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l 3931282 zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l 3862842 Then, we convert again the genotypes from the IMPUTE format into the EIGENSTRAT format, thus obtaining one set of 3 files per HapMap population: for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ impute2eigenstrat.py -i genotypes_hapmap_r28_b37_${pop}.impute.gz -o hapmap_r28_b37_${pop} -l HMr28_${pop}; done Then we merge these 4 sets of 3 files into one set of 3 files, using the program "mergeit" in the EIGENSFOT package (see the documentation for "mergeit" to write a proper parameter file): mergeit -p par_mergeCEUandCHB.txt mergeit -p par_mergeCEUCHBandJPT.txt mergeit -p par_mergeCEUCHBJPTandYRI.txt wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_genotypes.txt 3146166 wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_indiv.txt 638 We also need to do the same with the samples from our study: wc -l < mystudy_eigenstrat_genotypes.txt 418564 mergeit -p par_mergeCEUCHBJPTYRIandMystudy.txt wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt 417251 wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt 723 And finally we can launch "smartpca.perl" with the option "-w" to apply the PCA only on the HapMap samples but to project our study samples on their PCs: echo -e "HMr28_CEU\nHMr28_CHB\nHMr28_JPT\nHMr28_YRI" > poplist smartpca.perl -i hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt \ -a hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_snp.txt \ -b hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt \ -o hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_pca.txt \ -p hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_plot \ -e hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_eigenvalues.txt \ -l hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_log.txt \ -w poplist -m 0 In case we want to easily identify outliers, we can improve the Gnuplot parameter file in order to see the labels of the individuals in our study (see this answer on SO). |