User:Timothee Flutre/Notebook/Postdoc/2012/01/22

From OpenWetWare
Jump to: navigation, search
Owwnotebook icon.png Project name Report.pngMain project page
Resultset previous.pngPrevious entry      Next entryResultset next.png

Apply PCA on genotypes with EIGENSOFT

  • Use EIGENSOFT to apply PCA on genotypes in order to check how individuals sampled in the study are related with each other.

Assuming that the genotypes are in the IMPUTE format, we need to convert them into the EIGENSTRAT format thanks to this script: -i genotypes_allchrs.impute.gz -o mystudy -l MyStudy

Now we can run the PCA:

smartpca.perl -i mystudy_eigenstrat_genotypes.txt -a mystudy_eigenstrat_snp.txt -b mystudy_eigenstrat_indiv.txt \
-o mystudy_eigenstrat_pca.txt -p mystudy_eigenstrat_plot -e mystudy_eigenstrat_eigenvalues.txt -l mystudy_eigenstrat_log.txt -s 6.0
  • If GNUPLOT is installed, we now have a plot of the first two principal components (in a pdf document). It may be useful to add the labels of the individuals on this plot. For this, one can modify the file "mystudy_eigenstrat_plot.xtxt" containing the commands for GNUPLOT.

As an example, let's assume that the output data file from EIGENSOFT containing the PCs is like this (3 individuals, 2 PCs):

$ cat data.txt
ind1	2.0	2.0
ind2	2.5	1.8
ind3	1.8	2.0

Here is the file with the commands ("+0.1" is an offset which needs to be tuned with respect to the data):

$ cat commands.gnuplot
set terminal postscript color
set title  "" 
set xlabel  "eigenvector 1" 
set ylabel  "eigenvector 2" 
#plot  "mystudy_eigenstrat_pca.txt.evec" using 2:3 title "MyStudy"
plot  "data.txt" using 2:3 title "Data", "" using 2:($3+0.1):1 with labels notitle
## pause 9999

We can obtain the new plot like this:

$ gnuplot < commands.gnuplot >; ps2pdf newplot.pdf

  • Let's now assume that we have some samples and that we want to see "where they are" compare to samples from the HapMap project, eg. are they closer to CEU or YRI samples? is there any admixed individual in my study? can I identify outliers? ... EIGENSOFT allows to do this easily.

We start by retrieving the genotypes from the HapMap project:

for pop in {CEU,CHB,JPT,YRI}; do for i in {1..22}; do \
    wget${i}_${pop}_r28_nr.b36_fwd.txt.gz; \
    done; done

Then, we convert the SNP coordinates to the latest release of the reference human genome. For this, first, we extract coordinates of SNPs having a high quality into one big BED file. Second, we remove duplicated lines. Third, we use the UCSC program "liftOver" to convert the coordinates from hg18 to hg19.

rm -f snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed; for p in {CEU,CHB,JPT,YRI}; do echo ${p}; for c in {1..22}; do echo "chr"${c}; \
    zcat genotypes_chr${c}_${p}_r28_nr.b36_fwd.txt.gz | awk 'NR>1 {if($11 == "QC+") print $3"\t"$4-1"\t"$4"\t"$1"\t1000\t"$5 >> "snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed"}'; \
    done; done
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed
sort -k 4 snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort
uniq snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort > snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
wc -l < snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
gzip snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq
liftOver snps_hapmap_r28_nr_b36_qc+_allchrs_allpops.bed.sort.uniq.gz hg18ToHg19.over.chain.gz snps_hapmap_r28_nr_b37.bed unMapped
wc -l < snps_hapmap_r28_nr_b37.bed
grep -c -v "#" unMapped
cut -f4 snps_hapmap_r28_nr_b37.bed | uniq -d > list_snps_redundant.txt
wc -l < list_snps_redundant.txt

Then, we convert the genotypes from HapMap into the IMPUTE format thanks to this script:

for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ -i genotypes_CHR_${pop}_r28_nr.b36_fwd.txt.gz -o genotypes_hapmap_r28_b37_${pop}.impute.gz \
    -b snps_hapmap_r28_nr_b37.bed.gz -s list_snps_redundant.txt; done
zcat genotypes_hapmap_r28_b37_CEU.impute.gz | wc -l
zcat genotypes_hapmap_r28_b37_CHB.impute.gz | wc -l
zcat genotypes_hapmap_r28_b37_JPT.impute.gz | wc -l
zcat genotypes_hapmap_r28_b37_YRI.impute.gz | wc -l

Then, we convert again the genotypes from the IMPUTE format into the EIGENSTRAT format, thus obtaining one set of 3 files per HapMap population:

for pop in {CEU,CHB,JPT,YRI}; do echo ${pop}; \ -i genotypes_hapmap_r28_b37_${pop}.impute.gz -o hapmap_r28_b37_${pop} -l HMr28_${pop}; done

Then we merge these 4 sets of 3 files into one set of 3 files, using the program "mergeit" in the EIGENSFOT package (see the documentation for "mergeit" to write a proper parameter file):

mergeit -p par_mergeCEUandCHB.txt
mergeit -p par_mergeCEUCHBandJPT.txt
mergeit -p par_mergeCEUCHBJPTandYRI.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_genotypes.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI_eigenstrat_indiv.txt

We also need to do the same with the samples from our study:

wc -l < mystudy_eigenstrat_genotypes.txt
mergeit -p par_mergeCEUCHBJPTYRIandMystudy.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt
wc -l < hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt

And finally we can launch "smartpca.perl" with the option "-w" to apply the PCA only on the HapMap samples but to project our study samples on their PCs:

echo -e "HMr28_CEU\nHMr28_CHB\nHMr28_JPT\nHMr28_YRI" > poplist
smartpca.perl -i hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_genotypes.txt \
    -a hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_snp.txt \
    -b hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_indiv.txt \
    -o hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_pca.txt \
    -p hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_plot \
    -e hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_eigenvalues.txt \
    -l hapmap_r28_b37_CEU-CHB-JPT-YRI-Mystudy_eigenstrat_log.txt \
    -w poplist -m 0

In case we want to easily identify outliers, we can improve the Gnuplot parameter file in order to see the labels of the individuals in our study (see this answer on SO).