LDhelmet workflow

From OpenWetWare
Jump to navigationJump to search

Steps to building a recombination map with LDhelmet

0. Check if pseudorefs for other Xiphophorus species mapped to your focal genome already exist, if not you should run those along with step 1.

1. Map, variant call, and make pseudorefs

Get all high coverage data for the species of interest

  • For X.bir be sure to include the Family data for later mendelian error correction


Follow the steps of Mapping variant calling with GATK and g.vcf files to pseudo-fasta file from the main workflow page

Schumer lab: Commonly used workflows

You will also want to mask the repeat regions. You'll need a bed formatted file with regions found in the repeatmasker and a final column of Ns.

/home/groups/schumer/shared_bin/seqtk mutfa sampleID_pseudoref_focalGenome.fasta focalGenome_repeats.bed > sampleID_pseudoref_focalGenome_repeatMasked.fasta

This is a slurm submit script to do this in a loop given a text file of tab delimited shortened sample ID and full insnp name and the reference genome.

sbatch submitPseudoFASTA.sh insnpKeyList.txt focalGenome

(You should have a file called focalGenome.fa and focalGenome_repeats.bed in you folder for this to work)

#!/bin/bash
#SBATCH --job-name=pseudoFASTA
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=32000
#SBATCH -p schumer,hns,owners

IFS=$'\n' read -d ' ' -r -a file_names < $1
for i in "${file_names[@]}"
do
    ID=$(echo "$i" | cut -f1)
    echo $ID
    INSNP=$(echo "$i" | cut -f2)
    /home/groups/schumer/shared_bin/seqtk mutfa "$2".fa $INSNP > "$ID"_pseudoref_"$2".fasta
    /home/groups/schumer/shared_bin/seqtk mutfa "$ID"_pseudoref_"$2".fasta "$2"_repeats.bed > "$ID"_pseudoref_"$2"_repeatMasked.fasta
done

From here you should split combine these psuedorefs and split by chromosome. With a tab delimited key of the sample ID and repeatMaskedPseudoref you can use this slurm submit script.

sbatch submitSplitConcatFASTAbyChr.sh fastaKey.txt chrName
#!/bin/bash
#SBATCH --job-name=splitFASTA
#SBATCH --time=4:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=32000
#SBATCH -p schumer,hns,owners

CHR=$2
IFS=$'\n' read -d ' ' -r -a file_names < $1
for i in "${file_names[@]}"
do
   ID=$(echo "$i" | cut -f1)
   FASTA=$(echo "$i" | cut -f2)
   echo $ID
   perl /home/groups/schumer/shared_bin/Lab_shared_scripts/getScaffold_samtools.pl $FASTA $CHR > "$ID"_"$CHR".fa	
   sed -i "s|"$CHR"|"$ID":"$CHR"|g" "$ID"_"$CHR".fa
   cat "$ID"_"$CHR".fa >> allsamples_"$CHR".fa
done


You can also submit these in a loop with a text file listing the chromosomes

./runSubmitSplitConcatFASTAbyChr.sh chrList.txt fastaKey
#!/bin/bash
#To read a file and store each line as variable in an array called "list"
IFS=$'\n' read -d ' ' -r -a list < $1
#loop through all elements
for i in "${list[@]}"
do
   echo "$i"
   sbatch submitSplitConcatFASTAbyChr.sh $2 "$i"
done


2. Pseudorefs to phasing input

You'll then need to get these pseudorefs into PLINK format either for identifying mendelian errors or going straight to phasing. Before running make the plink formatted fam input. If you have family data this needs to be formatted to correctly to reflect that.

ml py-biopython/1.70_py27
python /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/scripts/FASTAtoPlink_v5.py --fastaName allsamples_chrName.fa --famInput samples.fam --propMissing 1.0 --outputPrefix allSNPs_allsamples_chrName.fa

This will retain any SNP that exists beween two genomes, even if all other genomes are missing data there. If you want to limit the amount of missing data you can change the --propMissing flag.

If you have parent and children data you can identify mendelian errors to exclude

plink --file allSNPs_allsamples_chrName.fa --mendel --out allSNPs_chrName --allow-extra-chr

From this output you'll need a list of unique mendelian errors to exclude per chromosome

Rscript getUniqueMendelErrors.R chromosomeList.txt

Then you'll want to remove these errors from your ped files before phasing

plink --file allSNPs_allsamples_chrName.fa --exclude allSNPs_chrName.uniMendelSNPs --out allSNPs_allsamples_mendelRemoved_chrName --allow-extra-chr --recode ped

If you don't have family data you'll just need to reformat your ped before phasing

plink --file allSNPs_allsamples_chrName.fa --out allSNPs_allsamples_reformat_chrName --allow-extra-chr --recode ped


3. Phasing with shapeit2

To phase you'll need the plink formatted files and recombination information. If there are no existing previous versions of recombination maps for this species you can use a uniform recombination rate, given as --rho below. If you have a previous recombination map you can lift it over using cactus to use with shapeit, see below.

/home/groups/schumer/shared_bin/shapeit.v2.904.2.6.32-696.18.7.el6.x86_64/bin/shapeit --input-ped allSNPs_allsamples_reformat_chrName.ped allSNPs_allsamples_reformat_chrName.map --rho 0.006 --output-max allSNPs_allsamples_chrName.shapeit2.phased.haps allSNPs_allsamples_chrName.shapeit2.phased.sample --duohmm -W 5 --output-graph allSNPs_allsamples_chrName.graph --force

4. Getting ancestral sequence

LD helmet needs ancestral sequence inference and a mutational matrix. You can get this with phylofit and prequel. You'll need pseudoreferences mapped to your genome of interest for other Xiphophorus species, which usually already exist here somewhere /home/groups/schumer/shared_bin/shared_resources/

Again this is easiest to do by chromosome, so you should combine and split by chromosome. Then phyloFit, which needs a newick formatted tree passed to it, below is an example with Xbir as the focal species and including, a sampling of nothern swordtails, platyfish and one southern swordtail for an out group:

/home/groups/schumer/shared_bin/phast/bin/phyloFit --tree "((((((Xbir_chrName,Xmal_chrName),Xcor_chrName),((Xmon_chrName,Xcon_chrName),Xnez_chrName)),(Xnig_chrName,Xmul_chrName)),(Xvar_chrName,Xmac_chrName)),Xhel_chrName)" -o phyloFit_chrName allspecies_chrName.fa

Then prequel to get the ancestral sequence

/home/groups/schumer/shared_bin/phast/bin/prequel allspecies_chrName.fa phyloFit_chrName.mod ancestral

All of these steps can be done with this slurm submit script submitAncSeq.sh, which will run one chromosome and needs a key to the pseudofastas as a tab delmitied file with a short species ID and the full pseudofasta name.

#!/bin/bash
#SBATCH --job-name=ancSeq
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=32000
#SBATCH -p schumer,hns,owners

CHR=$2

IFS=$'\n' read -d ' ' -r -a file_names < $1
for i in "${file_names[@]}"
do
   ID=$(echo "$i" | cut -f1)
   echo $ID
   FASTA=$(echo "$i" | cut -f2)
   perl /home/groups/schumer/shared_bin/Lab_shared_scripts/getScaffold_samtools.pl $FASTA $CHR > "$ID"_"$CHR".fa	
   sed -i "s|"$CHR"|"$ID"_"$CHR"|g" "$ID"_"$CHR".fa
   cat "$ID"_"$CHR".fa >> allspecies_"$CHR".fa
done

/home/groups/schumer/shared_bin/phast/bin/phyloFit --tree "((((((Xbir_"$CHR",Xmal_"$CHR"),Xcor_"$CHR"),((Xmon_"$CHR",Xcon_"$CHR"),Xnez_"$CHR")),(Xnig_"$CHR",Xmul_"$CHR")),(Xvar_"$CHR",Xmac_"$CHR")),Xhel_"$CHR")" -o phyloFit_"$CHR" allspecies_"$CHR".fa

/home/groups/schumer/shared_bin/phast/bin/prequel allspecies_"$CHR".fa phyloFit_"$CHR".mod ancestral


This can be submitted as a loop with this script

./runSubmitAncSeq.sh chromsomeList.txt
#!/bin/bash
IFS=$'\n' read -d ' ' -r -a list < $1
#loop through all elements
for i in "${list[@]}"
do
   echo "$i"
   sbatch submitAncSeq.sh allspecies_fastas.txt "$i"
done

You'll then the the output files that are ancestral.speciesName_chrName-Xhel_chrName.probs

Downstream scripts are written with Xhel as the outgroup, but this could be modified.

5. Preparing for LDhelmet

Your phased files need to be turned back into fasta type files. This script with print a text file with the SNP positions, a fasta of just the phased SNPs and a full length chromosome file with the phased haplotypes. If there are any "offspring" individuals these will not be printed since they cannot be used for LDhelmet. If you have a list of unique mendelian errors these should also be included to be masked (alternatively you could use seqtk mutfa to mask them upstream).


With a list of unique mendelian errors

ml py-biopython/1.70_py27
python /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/scripts/shapeit_output_to_haplotypes_v4.py --inputFASTA allsamples_chrName.fa --shapeitInput allSNPs_allsamples_mendelRemoved_chrName.shapeit2.phased --mendelInput allSNPs_allsamples_chrName.uniMendelSNPs


Without a list of unique mendelian errors

ml py-biopython/1.70_py27
python /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/shapeit_output_to_haplotypes_v4.py --inputFASTA allsamples_chrName.fa --shapeitInput allSNPs_allsamples_chrName.shapeit2.phased


Processing the ancestral sequences. This script will create several files the last step of LD helmet needs.

I. A SNP positions file in a base 0 format, and it will break if scientific notation is used (eg 8e+05), so that is controlled for in the Rscript.

II. The ancestral probabilities at each SNP postions, and it will break if any of these are 0 (the script will turn any that are 0 into 0.0000001, but using Xhel as an outgroup also helps).

III. And it needs a mutation matrix file. This script creates it for a single chromosome. It also creates an output file that can be used to great a whole genome mutation matrix once all chromsomes have been run. Finally if a mutation matrix already exists for a previous version of the genome this could be used with LD helmet.


Be sure to use the speciesName that in the "ancestral.speciesName_chrName-Xhel_chrName.probs" files. This script defaults to using the Xhel as the outgroup, but that could be modified. And it needs to shapeit haps output to get the SNP positions.

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/ancestral_priors_mutational_matrix_anySpecies_XhelOutgroup.R chrName speciesName


Once this has run for each chromosome you can make a whole genome mutation matrix

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/ancestral_mutational_matrix_combineChrs_anySpecies.R speciesName chromsomeList.txt


6. Running LD helmet

Once all this has been done you should be good to go! The submit script below should get you through all the steps to a final human readable *post.txt for each chromsome. The first step can be run on the whole genome haplotype (as described in the LDhelmet manual), but it also seems to work just fine passing the snpSeq fasta. The block penalty "smooths" rjMCMC where lower block penalties have more variation. We mostly use block50 but also produce a block5 version. Finally you should change "uniqueIDforSpeciesGenomeVersion" to something that ties it to the genome it is being done for since these are the final files to be kept.

sbatch submitLDhelmet_generic.sh chrName
#!/bin/bash
#SBATCH --job-name=LDhelmet
#SBATCH --time=48:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=64000
#SBATCH -p schumer,hns,owners

module load gsl/2.3
module load boost/1.64.0 

echo $1

#/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet find_confs --num_threads 10 -w 50 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1".conf haplotype_allsamples_"$1".fa

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet find_confs --num_threads 10 -w 50 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1".conf snpSeq_allsamples_"$1".fa

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet table_gen --num_threads 10 -c LD_map_uniqueIDforSpeciesGenomeVersion-"$1".conf -t 0.001 -r 0.0 0.01 1.0 1.0 10.0 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1".lk

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet pade --num_threads 10 -c LD_map_uniqueIDforSpeciesGenomeVersion-"$1".conf -t 0.001 -x 11 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1".pade

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet rjmcmc --num_threads 10 -w 50 -l LD_map_uniqueIDforSpeciesGenomeVersion-"$1".lk -p LD_map_uniqueIDforSpeciesGenomeVersion-"$1".pade -b 50.0 --snps_file snpSeq_allsamples_"$1".fa --pos_file snp_positions_base0_speciesName_"$1".txt -a ancestral_speciesName_"$1".probs_snps_probs.mod -m ancestral_speciesName_allChrs.probs_mutation_matrix --burn_in 100000 -n 1000000 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block50.post

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet rjmcmc --num_threads 10 -w 50 -l LD_map_uniqueIDforSpeciesGenomeVersion-"$1".lk -p LD_map_uniqueIDforSpeciesGenomeVersion-"$1".pade -b 5.0 --snps_file snpSeq_allsamples_"$1".fa --pos_file snp_positions_base0_speciesName_"$1".txt -a ancestral_speciesName_"$1".probs_snps_probs.mod -m ancestral_speciesName_allChrs.probs_mutation_matrix --burn_in 100000 -n 1000000 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block5.post

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet post_to_text -m -p 0.025 -p 0.50 -p 0.0975 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block50.post.txt LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block50.post

/home/groups/schumer/shared_bin/LDhelmet_v1.10/ldhelmet post_to_text -m -p 0.025 -p 0.50 -p 0.0975 -o LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block5.post.txt LD_map_uniqueIDforSpeciesGenomeVersion-"$1"_block5.post


This workflow should account for the most common errors that will kill LDhelmet, but if run into any please contact Quinn.

All these scripts, including the slurm submit scripts and loops to run them should be here: /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/


3b. Using Cactus to lift over a previous rec map for phasing

If you want to use a previous rec map liftover for the phasing step you can align with cactus and then do a lifterover of the previous LDhelmet rec map. Again this is faster doing it one chromosome at a time, so the below example if for an example chromosome, assuming you've already pulled the chromsome sequence out of the fasta.

The input file is formated like this

(Xbir10X_chr2,Xbir2023pacbio_chr2);
Xbir10X_chr2	Xbir10X_ScyDAA6-1196-HRSCAF-1406.fa
Xbir2023pacbio_chr2	Xbir2023pacbio_chr_02.fa

Then running cactus (and checking that the hal is valid)

ml python/3.9.0
virtualenv -p python3.9 /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env
echo "export PATH=/home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/bin:\$PATH" >> /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env/bin/activate
echo "export PYTHONPATH=/home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/lib:\$PYTHONPATH" >> /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env/bin/activate
source /home/groups/schumer/shared_bin/cactus/cactus-bin-v2.2.3/cactus_env/bin/activate

cactus ./jobstoreChr2 ./xbirChr2_cactusInput.txt ./xbirChr2.hal --realTimeLogging

halValidate xbirChr2.hal

halLiftover xbirChr2.hal Xbir10X_chr2 LD_map_xbirchmanni-ScyDAA6-1196-HRSCAF-1406.post.txt_mod.bed XbirPacbio_chr2 LD_map_XbirPacbio_chr_02.bed --bedType 3


Then to convert to genetic distances using the 10X map.

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/scriptsForLDhelmet/convertLDtoGeneticDistance_cM_using10Xmap_anyGenome.R chr_02 Xbir2023pacbio xbir_10x-pacbio2023_chrs_key.txt