Schumer lab: Important simulation tools

From OpenWetWare
Jump to navigationJump to search

Coalescent simulations with macs

Introduction to macs

macs uses the same command line format at ms but is much faster due to how it tracks recombination.

A useful guide to command line in ms can be found here: File:Msdoc.pdf

For example, to simulate divergence between birchmanni and malinche (assuming the ancestral theta equals birchmanni present-day theta), I would use the following command:

/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -t 0.001 -r 0.0006 -ej 2 2 1

This simulates a 1 Mb sequence with theta of 0.001 per bp and rho of 0.0006 per bp. The time at which lineage 1 and 2 join is 2 (in units of 4 Ne) generations ago.

This outputs data in macs format, but to convert it to ms format you can pipe the results to the msformatter program:

/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -t 0.001 -r 0.0006 -ej 2 2 1 | /home/groups/schumer/shared_bin/msformatter

Note: to use msformatter on Sherlock you must first load boost

module load boost

Using macs to simulate sequences

macs output in certain formats can be convert nucleotide sequences using the program seq-gen. To convert the above command to nucleotide sequences using seq-gen, simply add the -T command to decode the output as phylogenetic trees. Then these trees can be used as input for seq-gen:


/home/groups/schumer/shared_bin/macs 2 1000000 -I 2 1 1 0 -T -t 0.001 -r 0.0006 -ej 2 2 1 | /home/groups/schumer/shared_bin/msformatter | grep '\[' > macs_treefile


count the number of trees generated by the simulation:

wc -l macs_treefile

then substitute this value for [PARTITIONS] in the below command line:

/home/groups/schumer/shared_bin/seq-gen -mHKY -l 1000000 -p [PARTITIONS] -s 0.001 -a 0.597 -f 0.305566 0.194762 0.194557 0.305115 <macs_treefile >macs_seqfile.phy

Note: the simulation here uses -s 0.001 because that is the theta simulated above, and also uses particular base composition and transition/transversion ratios. Make sure to adapt this to your simulations

seq-gen also has a useful option of simulating sequences given an "ancestral" sequence, which can be useful in many scenarios. See the seq-gen documentation here

Simulate reads from sequences

You can also simulate illumina reads from these [or any other sequences] once converted to fasta format:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/Phylip2Fasta.pl macs_seqfile.phy macs_seqfile.fa

For example, simulate 1000 paired-end 100 bp reads (see wgsim documentation here for options of SNP, indel, and error rates):

/home/groups/schumer/shared_bin/wgsim -N1000 -1100 -2100 -e0.01 -r0.0006 -R1 macs_seqfile.fa myread1.fq myread2.fq

Admixture simulations with admix'em

Admix'em is a very useful and flexible admixture simulator but is a lot slower than coalescent simulators. Admix'em can simulate arbitrary selection scenarios but is somewhat limited in its ability to simulate complex demographic scenarios. For complete documentation of Admix'em see here

Here I outline some common uses of admix'em in the lab and associated scripts.

Generate input marker and recombination files

There are two methods for generating the recombination and marker input files for admix'em:

1) Run admix'em interactively and follow the prompts to generate the desired number of markers and candidate sites for recombination events:

admixemp

Select Option 1 and follow the prompts


2) Use lab scripts to generate markers and recombination rates defined by your data:

2a) Generate markers based on a list of markers (e.g. from your AncestryHMM run):

perl convert_markerlist_to_admixem_markers.pl markers_list_one_per_line chromosome_lengths_file

The script expects that the markers file will be in the format chromosome:site and that there will be one marker per line. The chromosome lengths file should also have one entry per line with the lengths in order of the chromosomes in the marker input file

2b) Generate recombination file based on LDhelmet rates:

Rscript LDres_to_admixem_rec_map.R LDresmod_file out_file_name chr_length_bp_for_simulation chr_number_for_simulation

Run this script once per chromosome you'd like to simulate and combine the files, e.g:

cat group1_rec.txt group2_rec.txt group3_rec.txt > group1-3_recrates_sim.txt

res files for the X. birchmanni 10x map can be found on Sherlock in this directory and have the extension .post.txt_mod:

/home/groups/schumer/shared_bin/shared_resources/Xbirchmanni_LD_recombination_map_10xgenome_March2019/block_penalty_50_version


Example admix'em marker files, configuration files, and selection files can be found on Sherlock here:

/home/groups/schumer/shared_bin/admixem_example_files

Run admix'em

To run admix'em fill out all sections of the configuration file relevant to your simulation (again see documentation) and then enter:


admixemp myconfig.cfg

Post process simulations

To convert admix'em output into a format similar to ancestry tsv files you can run the following script:

1) Convert the admix'em markers to a header for your ancestry tsv file:

perl convert_admixem_markers_to_header.pl admixem_markers.txt admixed_sim_header.txt

2) Generate ancestry tsv files for your simulation using this header and the admixem results files of interest. These can be find inside your simulation folder and will have the file extension: _markers.txt

perl generate_msg_data_header_locisim.pl GenX_markers.txt outfile_name admixed_sim_header.txt

This will generate a genotypes format file that can be used with other common scripts

3) If needed for downstream applications, you can convert this genotypes file to ancestry tsv files:

perl genotypes_to_ancestrytsv.pl genotypes_file

Writes output files called infile_ancestry_par1.tsv and infile_ancestry_par2.tsv

Complete examples of workflows

Complete examples of these workflows including configuration files, selection files, and wrapper scripts to run the simulations and processing can be found here:


/home/groups/schumer/shared_bin/admixem_example_files

The two complete examples include a neutral simulation of early generation hybrids:

/home/groups/schumer/shared_bin/admixem_example_files/F2_simulations


and a simulation of selection agains hybrid incompatibilities in F2 hybrids:

/home/groups/schumer/shared_bin/admixem_example_files/simulate_selection_against_DMI


Simulations of one-locus selection are fairly straightforward and more details can be found in the admixem wiki here

There is a document in box that has a cheat sheet for simulating two-locus selection: convert_DMIs_to_admixem_code.xlsx


There is also an example here of a collection of scripts and admix'em files for generating and tracking pedigree information:

/home/groups/schumer/shared_bin/admixem_example_files/Pedigree_scripts_config

To run these admix'em simulations, simply run the shell script:

perl simulated_ancestry_pedigree_neutral.pl admixsimul_complex.cfg admix_simulation_complex [NUMBER OF SIMULATIONS] focal_genes_indivs header_markers_group1-24 pedigree-sims


There is also an example of a collection of scripts and admix'em files for generating time series data with selection:

/home/groups/schumer/shared_bin/admixem_example_files/time_series_admixem_scripts

To run these admix'em simulations, simply run the shell script:


perl simulated_ancestry_timeseries.pl cfg_file out_folder_name num_iterations generations_file indivs_file results_file_tag markers_file_header selection_coefficient

example:

perl simulated_ancestry_timeseries.pl admixsimul_test.cfg admix_simulation_complex 2 gens_file indivs_file results_sim_gen80 header_bir10x_group1-3 0.05

Simulations with SELAM and Macs integrated into hybrid genome and read simulator

We've developed a simulator to simulate hybrid genomic data with or without selection. This pipeline has a number of uses:

  • Evaluate the accuracy of local ancestry inference under a range of demographic scenarios (in hybrids, parentals or both)
  • Evaluate the impact of different data collection and analysis choices - number of reads, length of reads, error rates, number of parental haplotypes collected for defining ancestry informative markers, among others
  • Generate hybrid genomes and reads with or with out selection for other applications (proof of principle detecting selection, inferring hybrid recombination maps, etc)

On Sherlock the simulator folder can be found here:

/home/groups/schumer/shared_bin/Simulate_hybrid_genomes


1) Copy the configuration file:

cp /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/hybrid_simulation_configuration.cfg hybrid_simulation_configuration_mix0.75_55gen.cfg


2) Edit the configuration file to input simulation-specific parameters using a text editor. See more information on the GitHub about each configurable parameter [here]


3) To simulate hybrid population demography, include a SELAM parameter file in the configuration file (Note: input a maximum of two parental species). Examples can be found in the Simulate_hybrid_genomes folder or in the SELAM documentation File:SELAM_manual.pdf


4) Set up a slurm script with dependencies and the simulation command:

#SBATCH --ntasks=1

#SBATCH --mem=32000

#SBATCH --cpus-per-task=1

#SBATCH -p schumer

#SBATCH --time=04:00:00

#SBATCH --job-name=sim-hybrid-genomes

#SBATCH --mail-user=youremail@stanford.edu


module load biology py-biopython

module load R

module load perl

module load gsl

module load boost

export PATH="/home/groups/schumer/shared_bin/lib:$PATH"

export PATH="/home/groups/schumer/shared_bin:$PATH"

export LD_LIBRARY_PATH="/home/groups/schumer/shared_bin/lib:$LD_LIBRARY_PATH"


perl simulate_admixed_genomes_v3.pl hybrid_simulation_configuration_mix0.75_55gen.cfg


5) Submit the job, e.g.:

sbatch simulate_job_submit_sherlock_v3.sh


Output files and next steps

The output files of the simulator will be the following:

parent 1 simulated genome:

macs_simulated_parent1.fa

parent 2 simulated genome:

macs_simulated_parent2.fa

read list file, e.g.:

simulated_hybrids_readlist_gen55_prop_par1_0.75

reads folder (including bed files with true ancestry), e.g.:

simulated_hybrids_reads_gen55_prop_par1_0.75

and simulated AIMs and counts:

simulated_AIMs_for_AncestryHMM

simulated_parental_counts_for_AncestryHMM


These files can be used as input to the AncestryHMM pipeline.


After the AncestryHMM pipeline finishes running, you can summarize accuracy relative to the true ancestry by running the following script:

perl /groups/schumer/shared_bin/Simulate_hybrid_genomes/post_hmm_accuracy_shell.pl ancestry-probs1-file ancestry-probs2-file simulation_folder_name path_to_simulator_install

This will generate a file called results_simulation_folder_name which summarizes accuracy by individual, tract, and ancestry type'

example:

perl /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/post_hmm_accuracy_shell.pl ancestry-probs-par1_allchrs.tsv ancestry-probs-par2_allchrs.tsv simulated_hybrids_reads_gen45_prop_par1_0.75 /home/groups/schumer/shared_bin/Simulate_hybrid_genomes