User:Timothee Flutre/Notebook/Postdoc/2012/01/22

From OpenWetWare
Jump to navigationJump to search
Project name Main project page
Previous entry      Next entry

Apply PCA on genotypes with EIGENSOFT

  • Use EIGENSOFT to apply PCA on genotypes in order to check how individuals sampled in the study are related with each other.

Assuming that the genotypes are in the IMPUTE format, we need to convert them into the EIGENSTRAT format thanks to this script:

impute2eigenstrat.py -i genotypes_allchrs.impute.gz -o mystudy -l MyStudy

Now we can run the PCA:

smartpca.perl -i mystudy_eigenstrat_genotypes.txt -a mystudy_eigenstrat_snp.txt -b mystudy_eigenstrat_indiv.txt \
-o mystudy_eigenstrat_pca.txt -p mystudy_eigenstrat_plot -e mystudy_eigenstrat_eigenvalues.txt -l mystudy_eigenstrat_log.txt -s 6.0
  • If GNUPLOT is installed, we now have a plot of the first two principal components (in a pdf document). It may be useful to add the labels of the individuals on this plot. For this, one can modify the file "mystudy_eigenstrat_plot.xtxt" containing the commands for GNUPLOT.

As an example, let's assume that the output data file from EIGENSOFT containing the PCs is like this (3 individuals, 2 PCs):

$ cat data.txt
ind1	2.0	2.0
ind2	2.5	1.8
ind3	1.8	2.0

Here is the file with the commands ("+0.1" is an offset which needs to be tuned with respect to the data):

$ cat commands.gnuplot
set terminal postscript color
set title  "" 
set xlabel  "eigenvector 1" 
set ylabel  "eigenvector 2" 
#plot  "mystudy_eigenstrat_pca.txt.evec" using 2:3 title "MyStudy"
plot  "data.txt" using 2:3 title "Data", "" using 2:($3+0.1):1 with labels notitle
## pause 9999

We can obtain the new plot like this:

$ gnuplot < commands.gnuplot > newplot.ps; ps2pdf newplot.ps newplot.pdf


  • Let's now assume that we have some samples and that we want to see "where they are" compare to samples from the HapMap project, eg. are they closer to CEU or YRI samples? is there any admixed individual in my study? can I identify outliers? ... EIGENSOFT allows to do this easily.

We start by retrieving the genotypes from the HapMap project:

for pop in {CEU,CHB,JPT,YRI}; do for i in {1..22}; do \
    wget http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseII+III/forward/genotypes_chr${i}_${pop}_r28_nr.b36_fwd.txt.gz; \
    done; done

Then, we convert the SNP coordinates to the latest release of the reference human genome. For this, first, we extract coordinates of SNPs having a high quality into one big BED file. Second, we remove duplicated lines. Third, we use the UCSC program "liftOver" to convert the coordinates from hg18 to hg19.

rm -f snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed; for p in {CEU,CHB,JPT,YRI}; do echo ${p}; for c in {1..22}; do echo "chr"${c}; \
    zcat genotypes_chr${c}_${p}_r28_nr.b36_fwd.txt.gz | awk 'NR>1 {if($11 == "QC+") print $3"\t"$4-1"\t"$4"\t"$1"\t1000\t"$5 >> "snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed"}'; \
    done; done
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed
15638438
sort -k 4 snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort
uniq snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
4038524
gzip snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
liftOver snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq.gz hg18ToHg19.over.chain.gz snps_hapmap_r28_nr_b37.bed unMapped
wc -l < snps_hapmap_r28_nr_b37.bed
4037640
grep -c -v "#" unMapped
884
cut -f4 snps_hapmap_r28_nr_b37.bed | uniq -d > list_snps_redundant.txt
wc -l < list_snps_redundant.txt
17

Then, we convert the genotypes from HapMap into the IMPUTE format thanks to this script:

for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \
hapmap2impute.py -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -o genotypes_hapmap_r28_b37_${pop}.impute.gz \
    -b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done
zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l
3907899
zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l
3933013
zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l
3931282
zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l
3862842

Then, we convert again the genotypes from the IMPUTE format into the EIGENSTRAT format, thus obtaining one set of 3 files per HapMap population:

for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \
   impute2eigenstrat.py -i genotypes_hapmap_r28_b37_${pop}.impute.gz -o hapmap_r28_b37_${pop} -l HMr28_${pop}; done

Then we merge these 4 sets of 3 files into one set of 3 files, using the program "mergeit" in the EIGENSFOT package (see the documentation for "mergeit" to write a proper parameter file):

mergeit -p par_mergeCEUandCHB.txt
mergeit -p par_mergeCEUCHBandJPT.txt
mergeit -p par_mergeCEUCHBJPTandYRI.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_genotypes.txt
3146166
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_indiv.txt
638

We also need to do the same with the samples from our study:

wc -l < mystudy_eigenstrat_genotypes.txt
418564
mergeit -p par_mergeCEUCHBJPTYRIandMystudy.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt
417251
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt
723

And finally we can launch "smartpca.perl" with the option "-w" to apply the PCA only on the HapMap samples but to project our study samples on their PCs:

echo -e "HMr28_CEU\nHMr28_CHB\nHMr28_JPT\nHMr28_YRI" > poplist
smartpca.perl -i hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt \
    -a hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_snp.txt \
    -b hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt \
    -o hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_pca.txt \
    -p hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_plot \
    -e hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_eigenvalues.txt \
    -l hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_log.txt \
    -w poplist -m 0

In case we want to easily identify outliers, we can improve the Gnuplot parameter file in order to see the labels of the individuals in our study (see this answer on SO).