Schumer lab: Commonly used workflows

From OpenWetWare
Jump to navigationJump to search


Please add more workflows and pipelines as you develop them!

Parsing Illumina data - Tn5 libraries

  • Note: this content is also in dropbox in the guide called "parsing_tn5_data.txt"


Before you can use Tn5 data, it needs to be parsed by both i5 and i7 barcode. Check out this helpful document from Patrick Reilly for more background File:IlluminaParsing v1-2.pdf

i5 barcode parsing

First, parse by i5 barcode:

1) generate a file with the plate id and i5 name \t the i5 sequence (see i5_indices.tsv or i5_indices_1-24_revcomp.tsv in Dropbox). For example:

i5-23_COACVI2018 GGCTCGAA

i5-2_CHAFV2018-ACUAV2018 TTGGCGTT

i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015 ATCAGCGC

i5-4_ACUAVI2015-CHAFXI2017-COACXI2017 TAAATAGG

  • Note: whether you need the forward or reverse complement i5 sequences depends on which Illumina sequencer you used - currently NextSeq and HiSeq 4000 need the reverse complement


2) Make sure to save this file as a windows formatted text file (or Tab delimited text)


3) copy the file to the server


4) convert the file to Unix text format

dos2unix [filename]

  • Note: if you look at your file and see ^M characters, run the following as well:

mac2unix [filename]


i7 barcode generation

  • Note: you need to follow these steps once for every plate

1) open "Tn5_i7s_convert_plate_to_barcodefile.xls" from Dropbox

2) paste your plate layout in the space indicated

3) copy columns A & B which are automatically generated and paste using into a new excel document by selecting:

Edit->paste special->Values

4) change the file format to "windows formatted text". Save the file name as Platename_i7_barcodes

  • for example "COAC-V1_2018_CHAF-V-V1_2018_Tn5_data_July2018_i7_barcodes.txt"

5) spot check several cells spread throughout the document to ensure that the right sample name, i7 id, and i7 sequence have been matched up

6) copy the file to the server

7) convert the file to unix text format

dos2unix [filename]

parse by i5 and i7 barcodes

1) Submit a slurm job to run parsing by i5 barcode. The usage of this job is:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i5 barcode file] [which position in FASTQ list is i5 index read]

example:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 4 "TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R2_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I2_alllanes_combined.fastq.gz" 10 i5_library_COACVI2018_CHAFV2018_ACUAV2018_ACUAVI2015 4 1

  • Note: parsing by i5 is not necessary if you have only used a single i5 in the sequencing run, you can move directly to parsing by i7
  • Note: make sure the number of cores in your slurm job matches the number requested in the command line. For example, for the above an appropriate slurm header would be:

#!/bin/bash

#SBATCH --job-name=parse_i5

#SBATCH --time=24:00:00

#SBATCH --ntasks=10

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000


2) After this job has finished, you can move on the parsing by i7 barcode. You will need to run one i7 job for *each* plate. For example, if the i5 barcode file had five lines, you will need to run five i7 parsing jobs.

If you look at the output of the i5 parsing you will notice that each parsed set generated three files, so in this next round of parsing you will have three files instead of four but otherwise the usage is the same:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i7 barcode file] [which position in FASTQ list is i7 index read]

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 3 "i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_1.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_2.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_3.fastq.gz" 10 20180716_Tn5_library_ACUA_V_2018_CHAF_I_II_2018_ACUA_VI_2015_i7_barcodes.txt 3 1

Quick quality checks after parsing

1) Check the Part*logs files

  • ensure that there are few reads in the control wells (<0.01% in a good library)
  • check that coverage is relatively even among individuals (few individuals <0.3% or >2%)


2) Check duplication levels

  • check duplication levels with picardtools for a few individuals per run
  • acceptable levels are <20% (0.2 in picard output)

module load biology

module load bwa

bwa mem -t 3 -M xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa COACVI1801_i7-13_read_1.fastq.gz COACVI1801_i7-13_read_2.fastq.gz > COACVI1801.sam

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=COACVI1801.sam OUTPUT=COACVI1801.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=COACVI1801.sorted.bam

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar I=COACVI1801.sorted.bam O=COACVI1801.dedup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=COACVI1801.metrics

Parsing Illumina data - Quail libraries or similar

Quail libraries are parsed in a similar way, but only have a single index so only need a single matching index file.

generate a file with the individual id and index

This files needs to have the individual id and i7/FC-1 name \t the i7/FC-1 sequence (see Quail_FC1_index_sequence.xls). For example:

FC1-9_HUICXI17M01 GATCAG

FC1-10_HUICXI17M04 TAGCTT

FC1-12_HUICXI17JM06 CTTGTA

FC1-13_HUICXI17M02 AGTCAA

FC1-14_HUICXI17M03 AGTTCC

FC1-15_HUICXI17M05 ATGTCA

FC1-16_HUICXI17JM07 CCGTCC

parse by i7 index

Because these libraries are only parsed by i7 barcode, you can drop the second index read in the parsing

Usage:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i7 barcode file] [which position in FASTQ list is i7 index read]

For example:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 3 "Xcortezi_resequence_HUIC_sc_wt_November2018_S0_R1_allcombined.fastq.gz Xcortezi_resequence_HUIC_sc_wt_November2018_S0_R2_allcombined.fastq.gz Xcortezi_resequence_HUIC_sc_wt_November2018_S0_I1_allcombined.fastq.gz" 10 Xcortezi_spottedcaudal_November2018 3

Ancestry analysis with AncestryHMM

Local ancestry inference is one of the most common workflows in the lab. The steps we use to run local ancestry inference for birchmanni x malinche are outlined below and is also available in dropbox ("Shared_lab_resources/Common_commands_and_pipelines/Running_Ancestry_HMM_on_Sherlock.txt").

For first time use

1) Make a folder within your personal lab member folder to perform the analysis, e.g.:

mkdir Ancestry_HMM_runs

cd Ancestry_HMM_runs


2) make a copy of the sample configuration file for your specific analysis

cp hmm_configuration_file_sherlock.cfg hmm_configuration_file_runCHAF_samples2017.cfg


3) use a text editor or emacs to edit the parameters in this file (see parameter details in Dropbox guide, README, or at: https://github.com/Schumerlab/Ancestry_HMM_pipeline)


4) link to the genome assemblies and ancestry informative sites you plan to use, e.g:

ln -s /home/groups/schumer/shared_bin/shared_resources/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta ./

ln -s /home/groups/schumer/shared_bin/shared_resources/Xmalinche_illumina-dovetail_assembly.fa ./

ln -s /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni10xgenome_ancestry_informative_sites_filterF1

ln -s /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni10xgenome_Xmalinche_observed_parental_counts_filterF1


5) load required packages and modules:

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam/0.14.1_py27

module load bwa

module load boost

module load R

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH


6) make a batch script for submission, e.g.:

#!/bin/bash

#SBATCH --job-name=run_hmm

#SBATCH --time=01:30:00

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000

#SBATCH --mail-user=youremail@stanford.edu

perl Ancestry_HMM_parallel_v5.pl hmm_configuration_file_runCHAF_samples2017.cfg


you can also include the module/export commands in your slurm script after the #SBATCH lines

See example here:

/home/groups/schumer/shared_bin/Ancestry_HMM_pipeline/run_hmm.sh


7) submit your job, e.g.:

sbatch run_hmm.sh

For subsequent use

1) edit your configuration files and job submit script as needed


2) export dependencies or make sure they are present in your slurm submission file

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam/0.14.1_py27

module load bwa

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH


3) submit your job, e.g.:

sbatch run_hmm.sh

Simulating hybrid genomes with Simulate_hybrid_genomes

It's often useful to simulate admixed genomes to evaluate efficacy of ancestry calling under different scenarios.


1) For first time use:

module load perl

cpan Math::Random

Make sure the following programs are in your path:

fastahack

seqtk

wgsim

bedtools2

They are all available in:

/home/groups/schumer/shared_bin

You can either export these each time you use them:

export PATH="/home/groups/schumer/shared_bin:$PATH"

or put this export command in your .bashrc file

make a directory for the analysis:

mkdir Simulate_hybrid_genomes_runs

cd Simulate_hybrid_genomes_runs

ln -s /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/* ./

cp hybrid_simulation_configuration.cfg myrun_hybrid_simulation_configuration.cfg


2) After step 1 or for subsequent use:

Export dependencies:

export PATH="/home/groups/schumer/shared_bin:$PATH"

Edit the configuration file:

emacs myrun_hybrid_simulation_configuration.cfg


3) Run the simulation

perl simulate_admixed_genomes_v2.pl myrun_hybrid_simulation_configuration.cfg


4) You can use these simulated reads as input into the Ancestry_HMM_pipeline


5) Afterwards you can summarize accuracy with the following script (part of the Simulate_hybrid_genomes GitHub):

perl post_hmm_accuracy_shell.pl ancestry-probs-par1_file ancestry-probs_par2_file path_to_simulation_reads_folder

Ancestry tsv files to admixture mapping input

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name


1a) It's often a good idea to filter your tsv files to remove markers violating hardy-weingberg equilibrium. This script can be used to do that. This script is set up to run this filter before you convert to genotypes format:

perl determine_HWE_filter_markers.pl ancestry-par1 ancestry-par2 bonferonni_pval_thresh path_to:transpose_nameout.pl

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv


1b) Optionally filter your genotypes file with a script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt


1c) Another option for filtering is to filter based on physical distance rather than redundancy. You may lose more information this way but it is unbiased with respect to errors. Be careful to chose a reasonable threshold based on admixture LD decay in the specific population you are looking at.

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt


2) Generate a hybrid index file for these individuals with mixture proportions for each individual

perl parsetsv_ancestry_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file > hybrid_index_file


3) Match your previously generated phenotypes file with genotypes and hybrid indices for each individual

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file_name genotypes_file hybrid_index_file

This will generate output files appended with _matched_to_genotypes _matched_to_phenotypes _matched_to_hybrid_index


4) Check file output carefully to make sure each has the appropriate number of lines and that individuals are listed in the same order in each file


5) By default the admixture mapping scripts test the likelihood of a model of phenotype~hybrid_index versus a model of phenotype~genotype_focal_site + hybrid_index. If you want to include more covariates (i.e. body size) you will need to modify the script.

For binomial traits (0/1) Rscript perform_glm_admixture_mapping_v2_binomialtrait.R genotypes_file hybrid_index_file phenotypes_file focal_column_number outfile_name_tag

For continuous traits that follow a gaussian distribution Rscript perform_glm_admixture_mapping_v2_gaussian.R genotypes_file hybrid_index_file phenotypes_file focal_column_number outfile_name_tag

The outfile name will be: genotypes_file_results_gaussian_v2_outfile_name_tag


6) To plot your results (as long as you used the X. birchmanni 10x genome), convert the file format:

perl convert_birchmanni10x_mapping_output_manhattan_plot_input.pl genotypes_file_results_gaussian_v2_outfile_name_tag


7) Then you can make a nice Rplot of these results using the adapted_qqman.R script

in R, source the adapted_qqman.R script:

source("adapted_qqman.R")

Load the data in R:

data<-read.csv(file="genotypes_file_results_gaussian_v2_outfile_name_tag",sep="\t",head=FALSE)

Reformat the data:

data_trim<-{}

data_trim$CHR<-data$chrom

data_trim$BP<-data$marker

data_trim$P<- data$likelihood.diff

data_trim<-as.data.frame(data_trim)

data_trim<-na.omit(data_trim)

Plot the data, e.g.:

manhattan(data_trim,genomewideline=8,suggestiveline=6)

Ancestry tsv files to rQTL input

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name


1a) It's often a good idea to filter your tsv files to remove markers violating hardy-weingberg equilibrium. This script can be used to do that. This script is set up to run this filter before you convert to genotypes format:

perl determine_HWE_filter_markers.pl ancestry-par1 ancestry-par2 bonferonni_pval_thresh path_to:transpose_nameout.pl

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv


2) Optionally filter your genotypes file with a script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt'


2b) Another option for filtering is to filter based on physical distance rather than redundancy. You may lose more information this way but it is unbiased with respect to errors:

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt


3) Generate a hybrid index file for these individuals with mixture proportions for each individual

perl parsetsv_ancestry_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file > hybrid_index_file


4) Match your previously generated phenotypes file with genotypes and hybrid indices for each individual

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file_name genotypes_file hybrid_index_file

This will generate output files appended with _matched_to_genotypes _matched_to_phenotypes _matched_to_hybrid_index


5) Check file output carefully to make sure each has the appropriate number of lines and that individuals are listed in the same order in each file


6) Convert the genotypes to rQTL format

perl genotypes_to_rqtl_April2019_v4.pl genotypes_file.identicalfilter.txt_matched_to_phenos phenotypes_file_phenos_second_column

writes an output file named: infile.rqtl.csv


7) Now you're ready to run rQTL! See rQTL documentation for details: http://www.rqtl.org/manual/qtl-manual.pdf

Use the following structure to read in this data format properly:

data<-read.cross("csv",dir="./","infile.rqtl.csv",na.strings=c("NA"),estimate.map=FALSE)

Ancestry tsv files to average ancestry in windows

1) Generate an file with average ancestry at every site that has greater than a user-defined threshold number of individuals and user-defined posterior probability cutoff (recommended >=0.9)

perl calculate_avg_ancestry_by_site_tsv_files_v5.pl infilepar1 infilepar2 num_ind_thresh posterior_prob_thresh path_to:transpose_nameout.pl > outfile


1a) It's often a good idea to filter your tsv files to remove markers violating hardy-weingberg equilibrium. This script can be used to do that. This script is set up to run this filter before you convert to genotypes format:

perl determine_HWE_filter_markers.pl ancestry-par1 ancestry-par2 bonferonni_pval_thresh path_to:transpose_nameout.pl

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv


2) Summarize average ancestry in windows based on the intervals in a bed file:

Rscript average_ancestry_bybedintervals.R infile bins.bed chr_name

The script will write an output file called average_infile_ancestry_bins.bed


2a) If you want to summarize aver ancestry in windows for all chromosomes in the bed file:

Rscript average_ancestry_bybedintervals_wholeGenome.R infile bins.bed

The script will write an output file called average_infile_ancestry_bins_WG

Thin Ancestry tsv files by genetic distance

1) After running AncestryHMM, extract the marker names:

head -n 1 ancestry-probs-par1_allchrs.tsv > my_markers

transpose these markers and reformat:

perl transpose_nameout.pl my_markers

perl -pi -e 's/:/\t/g' my_markers_transposed


2) Thin the markers from the X. birchmanni 10x assembly to the desired genetic distance. It's a good idea to also impose a physical distance threshold because of inaccuracies in the recombination map at a fine scale:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R my_markers_transposed cMthresh physical_dist_thresh

This script will write an output file called infile_cMdistances_thinned_distancethresh

Note that there may be greater distance between markers if the region is marker poor, this will be recorded in the output file


3) Select these markers from your tsv file

cut -f 1,2 infile_cMdistances_thinned_distancethresh | perl -p -e 's/\t/:/g' > my_thinned_markers_reformat

perl select_passing_markers_multi_geno_files_v2.pl my_thinned_markers_reformat ancestry-probs-par1_allchrs.tsv path_to:transpose_nameout.pl outfile_name_par1

perl select_passing_markers_multi_geno_files_v2.pl my_thinned_markers_reformat ancestry-probs-par2_allchrs.tsv path_to:transpose_nameout.pl outfile_name_par2

Ancestry tsv to identifying ancestry transitions

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name


2) Run intervals script:

Rscript identify_intervals_10x_genomes.R genotypes_outputfile_name path_to:transpose_nameout.pl

Writes output to: genotypes_outputfile_name_focal_chr, with one file for each chromosome

Ancestry tsv files to admixture LD decay

We often want to quantify the change in admixture LD over genetic or physical distance. This is important for dating hybrid populations, understanding what regions are independent from each other in hybrid populations, selection, and much more.

To quantify admixture LD, first convert genotypes files to plink input files. It's a good idea to thin the data first either by uniqueness or by physical or genetic distance.

1) Convert genotypes files to plink input files

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_msg_genotypes_to_plink.pl genotypes_ACUA_2018.txt.identical_filter.txt


1a) You can filter your genotypes files to generate a less redundant input dataset. Either the filter_identical_columns_threshold.pl or thin_genotypes_by_physical_distance.pl will work. I recommend thinning by physical distance for this application:

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt'


2a) You can filter your data with plink before quantifying LD. This command excludes AIMs that are missing in 50% or more of individuals and that are >98% one ancestry or another

/home/groups/schumer/shared_bin/plink --file genotypes_ACUA_2018.txt.identical_filter.txt --recode --tab --geno 0.5 --maf 0.02 --allow-extra-chr --out genotypes_ACUA_2018.txt.identical_filter.txt.trim


2b) Run plink to quantify admixture LD. The downstream scripts assume that you are outputting R2 and D so modify accordingly if needed

/home/groups/schumer/shared_bin/plink --file genotypes_ACUA_2018.txt.identical_filter.txt.trim --ld-window-kb 3000 --ld-window 20000 --r2 d --hardy --hwe 0.00001 --out ACUA --ld-window-r2 0 --allow-extra-chr

See plink documentation for details on these parameters


3a) Convert plink physical distance into genetic distance based on the X. birchmanni 10x genome (do not use this script for other genome assembly versions):

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_plink_LD_genetic_distance_xbir10x.R ACUA.ld ScyDAA6-2-HRSCAF-26

This will output a file called ACUA.ld_ScyDAA6-2-HRSCAF-26_cMdistances


3b) If you want a genome-wide picture combine files from multiple chromosomes. e.g.:

cat ACUA.ld_ScyDAA6-*_cMdistances | grep -v totalcM > ACUA.ld_allscaffolds_combined_cMdistances

Note that to use the downstream scripts you'll need to add back in the original header


4) To quantify admixture decay over genetic distance in bins of N cM:

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_cM_version_plink_admixture_decay.R ACUA.ld_allscaffolds_combined_cMdistances 0.2

Will produce a file called: ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist


5) Use the decay in admixture LD over genetic distance to estimate the time of initial admixture. Note that there are many assumptions that go into this estimate, including large population sizes, a single pulse of migration, etc, so interpret with caution:


Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/estimateage_plink_admixture_decay_v2.R ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist 0.1


The last argument here is for excluding distances smaller than mincM - here 0.1 - since the recombination map is not as reliable at small scales


This command will produce a pdf plotting the admixture decay named ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist_lddecay.pdf and will print the estimated mixture time and standard deviation to the screen


a is the coefficient of the model, b is the estimated admixture time, and c is the LD decay asymptote

Merging multiple ancestry tsv files run separately

  • Warning!!! it is only appropriate to do this if the runs were performed with the exact same parameters in the cfg file
  • It is always better to run all individuals through the HMM together if possible, but sometimes this is not practically for very large numbers of individuals


The example shown here merges three ancestry tsv files, but can be done for any number:


1) generate lists markers shared between all files

head -n 1 ancestry-probs-par1_allchrs_batch1.tsv > marker_list_batch1

head -n 1 ancestry-probs-par1_allchrs_batch2.tsv > marker_list_batch2

head -n 1 ancestry-probs-par1_allchrs_batch3.tsv > marker_list_batch3

transpose files so there is one marker per line:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch1

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch2

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch3

this will output files named e.g. marker_list_batch1_transposed


2) make a list of marker files:

ls marker_list_batch*_transposed > all_markers_list


3) identify markers that are covered in all tsv files:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/generate_list_of_passing_markers_multi_geno_files.pl all_markers_list

this will write an output file with passing_markers_FILENAME, i.e. for this example: passing_markers_all_markers_list


4) select these markers from your tsv files:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch1.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch1_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch1.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch1_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch2.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch2_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch2.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch2_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch3.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch3_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch3.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch3_selectshared.tsv


5) merge these outfiles:

  • Warning!! carefully check all file names when doing this merging, it is easy to make mistakes such as swapping par1 and par2 files which will have serious consequences


We need to retain the header from one file of each the parent 1 and parent 2 files, so we will leave these two files unmodified:

ancestry-probs-par1_allchrs_batch1_selectshared.tsv

ancestry-probs-par2_allchrs_batch1_selectshared.tsv

For the other files we need to trim the header:

tail -n +2 ancestry-probs-par1_allchrs_batch2_selectshared.tsv > ancestry-probs-par1_allchrs_batch2_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch2_selectshared.tsv > ancestry-probs-par2_allchrs_batch2_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch3_selectshared.tsv > ancestry-probs-par1_allchrs_batch3_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch3_selectshared.tsv > ancestry-probs-par2_allchrs_batch3_selectshared_noheader.tsv

Combine all files for parent 1 and parent 2:

cat ancestry-probs-par1_allchrs_batch1_selectshared.tsv ancestry-probs-par1_allchrs_batch2_selectshared_noheader.tsv ancestry-probs-par1_allchrs_batch3_selectshared_noheader.tsv > ancestry-probs-par1_allchrs_combinedindividuals_sharedmarkers.tsv

cat ancestry-probs-par2_allchrs_batch1_selectshared.tsv ancestry-probs-par2_allchrs_batch2_selectshared_noheader.tsv ancestry-probs-par2_allchrs_batch3_selectshared_noheader.tsv > ancestry-probs-par2_allchrs_combinedindividuals_sharedmarkers.tsv

Mapping and variant calling with GATK

One of the most common workflows in the lab is moving from fastq.gz files to vcf files. This involves a large number of steps and can differ between programs used for mapping and variant calling. Below we outline the most commonly used steps in our lab and two shell scripts that can be used to run them in an automated way:


0) before you start, index your reference genome bwa index, gatk, and samtools. This only needs to be done the first time you run this analysis in a particular directory.

#load modules

module load biology

module load bwa

module load samtools

#index

bwa index ref_genome.fa

java -jar /home/groups/schumer/shared_bin/CreateSequenceDictionary.jar R=ref_genome.fa O=ref_genome.dict

samtools faidx ref_genome.fa


1) mapping reads to the reference genome:

module load biology

module load bwa

bwa mem -t 3 -M -R '@RG\tID:id\tSM:quail_libraryprep\tPL:illumina\tLB:lib1\tPU:illuminaHiSeq' ref_genome.fa read_1.fq.gz read_2.fq.gz > file.sam


2) process and sort sam file

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=file.sam OUTPUT=file.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.bam


3) mark read duplicates in the bam file

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar INPUT=file.sorted.bam OUTPUT=file.sorted.dedup.bam METRICS_FILE=file.sorted.metrics

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.dedup.bam


4) re-align reads around indels

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref_genome.fa -I file.sorted.dedup.bam -o file.sorted.dedup.bam.list

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T IndelRealigner -R ref_genome.fa -I file.sorted.dedup.bam -targetIntervals file.sorted.dedup.bam.list -o file.sorted.dedup.realigned.bam


5) perform variant calling

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref_genome.fa -I file.sorted.dedup.realigned.bam --genotyping_mode DISCOVERY -L chromosome_targets.list -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -o file.sorted.dedup.realigned.bam.rawvariants.g.vcf

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref_genome.fa --variant file.sorted.dedup.realigned.bam.rawvariants.g.vcf --sample_ploidy 2 --max_alternate_alleles 4 --includeNonVariantSites --standard_min_confidence_threshold_for_calling 30 -o file.sorted.dedup.realigned.bam.g.vcf


We have two shell scripts that will run these steps for lists of reads.

The first runs the mapping, sorting, de-duplicates, and realigns indels:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_bwa_jobs_list.pl list_of_fastq_files PE_or_SE slurm_submission_file path_to_picard_tools_and_gatk_jars genome_to_map_to file_name_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_bwa_jobs_list.pl combined_read_list_coac_ld_map PE /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_slurm_bwa.sh /home/groups/schumer/shared_bin xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta xbir-10x

The reads file should be in the format: id1_read1.fq.gz\tid1_read2.fq.gz\n


The second shell script, which can be run after the first finishes, runs the actual variant calling steps (see Submitting slurm jobs for tips). It requires a list of sorted, realigned bam files and a chromosome targets list, among other parameters. This can include all chromosomes but splitting into several batches is helpful for running in parallel.

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc_jobs_list.pl list_of_sorted_realigned_bam_files slurm_submission_file_to_use absolute_path_to_GATK_jars genome_assembly_used vcf_target_chromosome_list output_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc_jobs_list.pl xmac_mapped_bam_list /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc.sh /home/groups/schumer/shared_bin/ xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa xmac_targets.group1.list group1

Case/Control GWAS from low coverage data

Genome-wide association studies (GWAS) are a common way to identify the genetic basis of polymorphisms that exist within a single (unstructured) population. For simple (ie binary) traits that can be classified as a "case" or as a "control" (eg spotted caudal vs wildtype), it is possible to identify SNPs associated with one trait or the other from low coverage data, given a large enough sample size.

Because there are multiple steps involved, the lab has a number of shell scripts that simplify the process:


1) Getting set up


1a) make a reads list for input in the next step. The format should be the entire path to the file (get this with the "pwd" command) and the file name. For PE reads, this will have two columns, one for read1 and one for read2. This file should be tab delimited:


tail -n 5 gwas_higher_coverage_indivs.txt

/path/to/file/COAC_98SC_J_combined_read_1.fastq.gz /path/to/file/COAC_98SC_J_combined_read_2.fastq.gz

/path/to/file/COAC_98wt_M_combined_read_1.fastq.gz /path/to/file/COAC_98wt_M_combined_read_2.fastq.gz

/path/to/file/COAC_99SC_J_combined_read_1.fastq.gz /path/to/file/COAC_99SC_J_combined_read_2.fastq.gz

/path/to/file/COAC_99wt_M_combined_read_1.fastq.gz /path/to/file/COAC_99wt_M_combined_read_2.fastq.gz

/path/to/file/COAC_9SC_J_combined_read_1.fastq.gz /path/to/file/COAC_9SC_J_combined_read_2.fastq.gz


generate a case control list which is simply a list of 0s and 1s, in the same order and corresponding to the reads list. For example, for the part of the reads list posted above, the case control list for a SC vs WT GWAS would be:


tail -n 5 case_control_status.txt

1

0

1

0

1


1b) the relevant scrips are all stored in /home/groups/schumer/shared_bin/Lab_shared_scripts/

run_samtools_only_parental_v2.pl

run_mpileup_bcftools_GWAS.pl

print_alleles_depth_freq_chi_per_site_GWAS.pl

merge_files_using_two_columns_sharing_values.pl #optional: copy these to your working directory


1c) Load appropriate modules. This is a list of the one's you might need:

module load biology

module load bwa

module load samtools

module load bcftools

module load java

1d) If you haven't already done so, index your reference genome with bwa index, gatk, and samtools. This only needs to be done the first time you run this analysis in a particular directory.

bwa index ref_genome.fa

java -jar /home/groups/schumer/shared_bin/CreateSequenceDictionary.jar R=ref_genome.fa O=ref_genome.dict

samtools faidx ref_genome.fa


2) generate sam file of reads mapped to reference genome (most data in lab is PE [paired end]). reformat sam file to bam file, sort file, remove duplicates, and realigns around indels:

usage: perl run_samtools_only_parental_v2.pl read_list genome SE_or_PE

example: perl run_samtools_only_parental_v2.pl gwas_higher_coverage_indivs.txt ref_genome.fasta PE


3) run samtools mpileup by scaffold (increases speed of run by parallelizing by scaffold). generates a .vcf file

usage: perl run_samtools_multi_indiv_parental.pl read_list case_control_status genome1 samtools_legacy_path focal_scaff

example: perl run_samtools_multi_indiv_parental.pl gwas_higher_coverage_indivs.txt case_control_status.txt ref_genome.fasta /home/groups/schumer/shared_bin/ ScyDAA6-1508-HRSCAF-1794


4) reformats .vcf output for each chromosome into a more easily readable summary output

usage: perl print_alleles_depth_freq_chi_per_site_GWAS.pl mpileup_file_scaff.vcf > mpileup_file_scaff.vcf.summary

example: perl print_alleles_depth_freq_chi_per_site_GWAS.pl gwas_higher_coverage_indivs.txtindivs.allindiv.ScyDAA6-1508-HRSCAF-1794.vcf > gwas_higher_coverage_indivs.txtindivs.allindiv.ScyDAA6-1508-HRSCAF-1794.vcf.summary

4a) check if files are the right size. sometimes step 4a doesn't work: ls -sh *summary


5) combine the summary file for each scaffold into single file containing all scaffolds. precise steps will depend based on analysis. gwas_higher_coverage_indivs.txtindivs.allindiv.allchroms.vcf.summary


6) *optional step*: only keep known, high confidence SNPs

usage: perl merge_files_using_two_columns_sharing_values.pl file1 0 1 file2 0 1

example: perl merge_files_using_two_columns_sharing_values.pl COAC_population_confirmed_snps_Xbirchmanni_10X_genome_flipped_sorted.bed 0 1 gwas_higher_coverage_indivs.txtindivs.allindiv.allchroms.vcf.summary 0 1


7) convert scaffold names to chromosome numbers. there are currently perl scripts for the xbir_10x, xbir_pacbio, and xvar_scaff genomes. output is a file named *_formanhattan with numbers in column 1.

usage: perl convert_birchmanni10x_mapping_output_manhattan_plot_input.pl summary_file

example: perl convert_birchmanni10x_mapping_output_manhattan_plot_input.pl gwas_higher_coverage_indivs.txtindivs.allindiv.allchroms.vcf.summary


You are now ready to plot the output! The significance line will vary and can be determined using permutation, but for the X. birchmanni COAC population, it's in the ballpark of 6.5.

g.vcf files to Admixtools input

Admixtools is a program developed by the Reich lab that can perform a large number of analyses [[1]], including calculating D-statistics for four populations.

The following is a workflow for converting from .g.vcf files generated by GATK to input for Admixtools.


1) for each .g.vcf file, generate an insnp file with masked and variant basepairs. See documentation of insnp_v9_gatk3.4_gvcf.py for hard-call filter options.

python /home/groups/schumer/shared_bin/insnp_v9_gatk3.4_gvcf.py file.g.vcf file.g.vcf.insnp 20 5 40 10 10 4 -12.5 -8.0 5


2) make a list of all insnp files you would like summarized for analysis using Admixtools, one on each line


3) using this file, run the following script:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/join_insnp_files_to_admixtools_v5.pl insnp_list outfile_name_tag 1 5 500 /home/groups/schumer/shared_bin/Lab_shared_scripts

where insnp_list is the list generated in step 2, 1 indicates that the reference individual should be printed [0 to omit], 5 indicates that a site will be excluded if more than 5 individuals have missing/masked basepairs, 500 indicates the physical distance over which to thin adjacent markers, and /home/groups/schumer/shared_bin/ is the path to the dependency script: overlap_list_retain_unmatched-for-join-insnp.pl

This will generate three output files that can be used as input for Admixtools:

outfile_name_tag.geno outfile_name_tag.snp outfile_name_tag.ind


4) edit outfile_name_tag.ind to contain the species names of interest in the third column

5) Admixtools complains if the individual names are too long and if the chromosome names are not chr1, chr2, etc. This might need modification depending on the analysis being run. A quick fix is a perl regex find and replace:

perl -pi -e 's/group1/chr1/g' outfile_name_tag.snp

Batch script to generate insnp files

If you have a lot of g.vcf files to process, it's easier to use a batch script we have in the shared folder to submit all of the inns jobs.

1) make a list of the gvcf files you want to process, if they aren't in the same folder you are working in make sure to include the path


2) run the shell script

usage is: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_insnp_jobs_list.pl vcf_list path_to_dependency_scripts submission_header.sh

example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_insnp_jobs_list.pl COAC_vcf_list /home/groups/schumer/shared_bin/Lab_shared_scripts /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_slurm_insnp.sh

g.vcf files to ped input for use with PLINK

This follows steps 1-5 of the previous section g.vcf files to Admixtools input, then uses the convertf package from Admixtools to convert to ped format:


6) make a parameter file for convertf

cp /home/groups/schumer/shared_bin/AdmixTools/convertf/par.EIGENSTRAT.PED parLDmapbir_eigen_to_ped

Use a text editor to edit the input parameters to match your file names (which are in eigenstrat format):

genotypename: example.eigenstratgeno

snpname: example.snp

indivname: example.ind

and specify the names of the output file for plink:

genotypeoutname: example.ped

snpoutname: example.pedsnp

indivoutname: example.pedind


7) Load the modules required for Admixtools and run the convertf program

module load openblas

module load gsl

/home/groups/schumer/shared_bin/AdmixTools/bin/convertf -p parLDmapbir_eigen_to_ped

D-statistic analysis with Admixtools

Once you have .geno, .snp, and .ind files, you can run D-statistic analysis with Admixtools

1) Generate a parameter file for Admixtools by copying the example file:

cp /home/groups/schumer/shared_bin/AdmixTools/examples/parqpDstat parameter_file_forDstat


2) Edit the parameter file lines:

DIR: ../data/ #path to the directory where your files are, can be ./

SSS: allmap #if your files have a common prefix you can put it here

indivname: DIR/SSS.ind #the path and name of your .ind file, this string corresponds to a file ../data/allmap.ind

snpname: DIR/SSS.snp #the path and name of your snp file

genotypename: DIR/SSS.geno #the path and name of your geno file

poplistname: list_qpDstat #the list of populations you are looking at, one on each line


3) Load the required modules and run Admixtools:

module load openblas

module load gsl

/home/groups/schumer/shared_bin/AdmixTools/bin/qpDstat -p parameter_file_forDstat

F4 ratio tests from fasta files

We have a shell script to convert a list of fasta files to input compatible with admixtools. These scripts and example input files can be found in the folder:

F4_ratio_window_analysis_fasta_alignments_v2

This folder contains all the scripts needed to run F4 ratios in windows based on aligned fasta. The shell script to run these steps is:

run_eigenstrat_to_F4_fasta_alignment_v4.pl

Usage of this script is:

perl run_eigenstrat_to_F4_fasta_alignment_v4.pl aligned_fastas.fa individual_file_for_convertf snp_window_size

See test.indiv for an example of the individual_file_for_convertf and test.fa for an example of the aligned_fastas.fa file

g.vcf files to pseudo-fasta files

Mapping RNAseq data and transcript quantification

There are two options for mapping RNAseq data:

1) transcriptome-based mapping


2) mapping to a genome with an exon-aware mapper

Mapping to a transcriptome

For mapping to a transcriptome I recommend bwa or kallisto. Note that results from both programs will be dependent on the quality of the reference transcriptome.

For bwa:

1) format transcriptome

module load biology

module load bwa

module load samtools

bwa index transcriptome.fa


2) map to transcriptome

bwa mem transcriptome.fa read1.fq.gz read2.fq.gz > out.sam


3) convert to bam

samtools sort out.sam -o out.bam


For kallisto:

1) format transcriptome

module load kallisto

kallisto index -I my_name_tag my_transcriptome.fa


2) run kallisto for each individual. See kallisto documentation for parameter details, here is an example command line:

kallisto quant -I my_name_tag -o outfile_name --single -l 300 -s 50 --rf-stranded --bootstrap-samples=50 mysample.read1.fq.gz mysample.read2.fq.gz


3) output of kallisto should be analyzed with the sleuth bioconductor package

Mapping to a genome

For mapping to a genome with exon-awareness I recommend using STAR:

1) format reference genome

/home/groups/schumer/shared_bin/STAR/bin/Linux_x86_64_static/STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles genome.fa --runThreadN 2 --sjdbGTFfile file.gtf --sjdbOverhang 99


2) Map reads

/home/groups/schumer/shared_bin/STAR/bin/Linux_x86_64_static/STAR --genomeDir ./ --readFilesIn reads1.fq.gz reads2.fq.gz --readFilesCommand zcat --outFileNamePrefix name_prefix --runThreadN 10 --outSAMtype BAM SortedByCoordinate

Transcript quantification

There are lots of options for expression quantification and testing for differential expression after you have generated bam files. Some recommended tools include:

Note: sleuth should be used with kallisto output

Allele specific expression from RNAseq data

Our lab pipeline for ASE is packaged into a parallelized pipeline: ncASE_pipeline

For first time use

For subsequent use

Genome assembly with supernova - 10x data

Below is a quick guide but check out Patrick Reilly's guide (also in box) File:SupernovaAssembly_v1.pdf

1) Make sure supernova is in your path, e.g.:

export PATH=/home/groups/schumer/shared_bin/supernova-2.1.1:$PATH

It's a good idea to embed this in your supernova script

2) set up your slurm script for supernova

  • supernova has high memory and time requirements
  • recommended: at least 60 hours, 20 cpus, and high memory (~10G)

3) Example supernova commands:

supernova run --id Xbirchmanni_10X_ref --fastqs /home/data/Xbirchmanni_10Xchromium_Hudsonalpha_July2018_raw_data/ --description run1 --maxreads 280000000

  • optimal number of reads targets 40-56X coverage

supernova mkoutput --asmdir=/home/data/Xbirchmanni_10Xchromium_Hudsonalpha_July2018_raw_data/Xbirchmanni_10X_ref/outs/assembly --outprefix=Xbirchmanni_10X_assembly --style=pseudohap

Motif and binding site predictions

1) Generate a position weight matrix for your zinc finger sequence at zf.princeton.edu and save as a text file


2) Convert this output into a file format compatible with MEME

Rscript /home/groups/schumer/shared_bin/motiflogo.R cat_prdm9_PWM_zfp.txt


3) Build a background model for the genome of interest

/home/groups/schumer/shared_bin/meme-5.0.2/src/fasta-get-markov -m 2 -dna GCF_000181335.3_Felis_catus_9.0_genomic.fna fcatus.bg


4) Generate a MEME formatted position weight matrix

/home/groups/schumer/shared_bin/meme-5.0.2/scripts/uniprobe2meme -numseqs 1 -bg fcatus.bg -pseudo 1 cat_prdm9_PWM_zfp.txt.PWM.txt > cat_prdm9_PWM_zfp.meme.txt


5) Run FIMO on the genome and position weight matrix:

/home/groups/schumer/shared_bin/meme-5.0.2/src/fimo -bgfile fcatus.bg -oc ./FIMO-results_cat_predictedPRDM9 -thresh 0.00001 --max-stored-scores 300000 -verbosity 5 cat_prdm9_PWM_zfp.meme.txt GCF_000181335.3_Felis_catus_9.0_genomic.fna

Liftovers using cactus alignments

If you are running liftovers using cactus alignments for the first time, make sure to make a local copy of Bernard's alignments:

cp /home/groups/schumer/data/cactus_alignments/xip_complete.hal ./


For example, to liftover between maculatus coordinates and birchmanni coordinates:

1) load required modules

ml gcc


2) run liftover:

/home/groups/schumer/tools/hal/bin/halLiftover xip_complete.hal mac_coords.bed birchmanii bir_liftover_coords.bed


Other useful basic commands with halTools

See basic stats including names of the aligned sequences:

/home/groups/schumer/tools/hal/bin/halStats xip_complete.hal


Pulls out genomes in fasta format, some contigs might be named with NCBI names:

/home/groups/schumer/tools/hal/bin/hal2fasta xip_complete.hal


Generate synteny blocks, useful for looking at genomic rearrangements:

/home/groups/schumer/tools/hal/bin/halSynteny xip_complete.hal


Demographic Inference with ABC

We can estimate hybridization timing, population size, ancestry proportion, and parental migration rates with Approximate Bayesian Computation (ABC) simulating a single swordtail chromosome with SLiM and fitting population specific parameters of average ancestry, variance in ancestry, and median minor parent tract length.

Simulations setup

To make csv of input parameters for 1 million simulations:

Rscript /oak/stanford/groups/schumer/data/ABC_simulation_trees/simParameterInput.R

Example to run a single SLiM simulation:

slim -d SEED=1 -d POPSIZE=2000 -d GEN=120 -d INIT_PROP=0.75 -d PAR1MIG=4.5e-06 -d PAR2MIG=0.002 /home/groups/schumer/shared_bin/Lab_shared_scripts/neutral_admixture_ABC_migration.slim


Post simulations population specific inference

You can also use the 1 million simulation .tree files that have already been created and reside here:

/oak/stanford/groups/schumer/data/ABC_simulation_trees/treeFiles/

To make demographic inferences you will need population specific parameters: mean hybrid index, coefficient of variation in hybrid index, median minor parent tract length. It's best to get these just from chromosome 2 since that is what's simulated with the SLiM script. Finally you will also need the number of individuals you used to get these parameters.

For each simulated tree for each population you can create a population specific simulated tsv matching the number of individuals used to get population parameters.

python3 /oak/stanford/groups/schumer/data/ABC_simulation_trees/slim_genetree_to_indiv_ancestries.py dem_abc_sim*seed*.trees *number_of_individuals_in_population* dem_abc_sim_mig*seed*.tsv

The output name can be change to whatever you want.

These tsvs can be summarized to get the simulation mean and coefficient of variation of ancestry and length of minor tracts. Rscript /oak/stanford/groups/schumer/data/ABC_simulation_trees/abc_summary_stats_chr_win.R *seed* dem_abc_sim_mig*seed*.tsv >> summary_params.txt

Once this has been done for a large set of simulations you can then determine which simulations fall within 5% of the empirical population values.