Schumer lab: Commonly used workflows

From OpenWetWare
Jump to navigationJump to search


Please add more workflows and pipelines as you develop them!

Parsing Illumina data - Tn5 libraries

  • Note: this content is also in dropbox in the guide called "parsing_tn5_data.txt"


Before you can use Tn5 data, it needs to be parsed by both i5 and i7 barcode. Check out this helpful document from Patrick Reilly for more background File:IlluminaParsing v1-2.pdf

i5 barcode parsing

First, parse by i5 barcode:

1) generate a file with the plate id and i5 name \t the i5 sequence (see i5_indices.tsv or i5_indices_1-24_revcomp.tsv in Dropbox). For example:

i5-23_COACVI2018 GGCTCGAA

i5-2_CHAFV2018-ACUAV2018 TTGGCGTT

i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015 ATCAGCGC

i5-4_ACUAVI2015-CHAFXI2017-COACXI2017 TAAATAGG

  • Note: whether you need the forward or reverse complement i5 sequences depends on which Illumina sequencer you used - currently NextSeq and HiSeq 4000 need the reverse complement


2) Make sure to save this file as a windows formatted text file (or Tab delimited text)

3) copy the file to the server

4) convert the file to Unix text format

dos2unix [filename]

  • Note: if you look at your file and see ^M characters, run the following as well:

mac2unix [filename]


i7 barcode generation

  • Note: you need to follow these steps once for every plate

1) open "Tn5_i7s_convert_plate_to_barcodefile.xls" from Dropbox

2) paste your plate layout in the space indicated

3) copy columns A & B which are automatically generated and paste using into a new excel document by selecting:

Edit->paste special->Values

4) change the file format to "windows formatted text". Save the file name as Platename_i7_barcodes

  • for example "COAC-V1_2018_CHAF-V-V1_2018_Tn5_data_July2018_i7_barcodes.txt"

5) spot check several cells spread throughout the document to ensure that the right sample name, i7 id, and i7 sequence have been matched up

6) copy the file to the server

7) convert the file to unix text format

dos2unix [filename]

parse by i5 and i7 barcodes

1) Submit a slurm job to run parsing by i5 barcode. The usage of this job is:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i5 barcode file] [which position in FASTQ list is i5 index read]

example:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 4 "TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R2_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I2_alllanes_combined.fastq.gz" 10 i5_library_COACVI2018_CHAFV2018_ACUAV2018_ACUAVI2015 4 1

  • Note: parsing by i5 is not necessary if you have only used a single i5 in the sequencing run, you can move directly to parsing by i7
  • Note: make sure the number of cores in your slurm job matches the number requested in the command line. For example, for the above an appropriate slurm header would be:

#!/bin/bash

#SBATCH --job-name=parse_i5

#SBATCH --time=24:00:00

#SBATCH --ntasks=10

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000

2) After this job has finished, you can move on the parsing by i7 barcode. You will need to run one i7 job for *each* plate. For example, if the i5 barcode file had five lines, you will need to run five i7 parsing jobs.

If you look at the output of the i5 parsing you will notice that each parsed set generated three files, so in this next round of parsing you will have three files instead of four but otherwise the usage is the same:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i7 barcode file] [which position in FASTQ list is i7 index read]

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 3 "i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_1.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_2.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_3.fastq.gz" 10 20180716_Tn5_library_ACUA_V_2018_CHAF_I_II_2018_ACUA_VI_2015_i7_barcodes.txt 3 1

Quick quality checks after parsing

1) Check the Part*logs files

  • ensure that there are few reads in the control wells (<0.01% in a good library)
  • check that coverage is relatively even among individuals (few individuals <0.3% or >2%)

2) Check duplication levels

  • check duplication levels with picardtools for a few individuals per run
  • acceptable levels are <20% (0.2 in picard output)

module load biology

module load bwa

bwa mem -t 3 -M xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa COACVI1801_i7-13_read_1.fastq.gz COACVI1801_i7-13_read_2.fastq.gz > COACVI1801.sam

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=COACVI1801.sam OUTPUT=COACVI1801.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=COACVI1801.sorted.bam

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar I=COACVI1801.sorted.bam O=COACVI1801.dedup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=COACVI1801.metrics

Parsing Illumina data - Quail libraries or similar

Quail libraries are parsed in a similar way, but only have a single index so only need a single matching index file.

generate a file with the individual id and index

This files needs to have the individual id and i7/FC-1 name \t the i7/FC-1 sequence (see Quail_FC1_index_sequence.xls). For example:

FC1-9_HUICXI17M01 GATCAG

FC1-10_HUICXI17M04 TAGCTT

FC1-12_HUICXI17JM06 CTTGTA

FC1-13_HUICXI17M02 AGTCAA

FC1-14_HUICXI17M03 AGTTCC

FC1-15_HUICXI17M05 ATGTCA

FC1-16_HUICXI17JM07 CCGTCC

parse by i7 index

Because these libraries are only parsed by i7 barcode, you can drop the second index read in the parsing

Usage:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i7 barcode file] [which position in FASTQ list is i7 index read]

For example:

/home/groups/schumer/shared_bin/Lab_shared_scripts/divideConquerParser.sh 3 "Xcortezi_resequence_HUIC_sc_wt_November2018_S0_R1_allcombined.fastq.gz Xcortezi_resequence_HUIC_sc_wt_November2018_S0_R2_allcombined.fastq.gz Xcortezi_resequence_HUIC_sc_wt_November2018_S0_I1_allcombined.fastq.gz" 10 Xcortezi_spottedcaudal_November2018 3

Ancestry analysis with AncestryHMM

Local ancestry inference is one of the most common workflows in the lab. The steps we use to run local ancestry inference for birchmanni x malinche are outlined below and is also available in dropbox ("Shared_lab_resources/Common_commands_and_pipelines/Running_Ancestry_HMM_on_Sherlock.txt").

For first time use

1) Make a folder within your personal lab member folder to perform the analysis, e.g.:

mkdir Ancestry_HMM_runs cd Ancestry_HMM_runs

2) link to all of the scripts used in the analysis from our lab bin

ln -s /home/groups/schumer/shared_bin/Ancestry_HMM_pipeline/* ./

3) make a copy of the sample configuration file for your specific analysis

cp hmm_configuration_file_sherlock.cfg hmm_configuration_file_runCHAF_samples2017.cfg

4) use a text editor or emacs to edit the parameters in this file (see parameter details in Dropbox guide, README, or at: https://github.com/Schumerlab/Ancestry_HMM_pipeline)

5) link to the genome assemblies you plan to use, e.g:

ln -s /home/groups/schumer/shared_bin/shared_resources/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta ./

ln -s /home/groups/schumer/shared_bin/shared_resources/Xmalinche_dovetail_assembly.fa ./

6) load required packages and modules:

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam

module load bwa

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH

7) make a batch script for submission, e.g.:

#!/bin/bash

#SBATCH --job-name=run_hmm

#SBATCH --time=01:30:00

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000

#SBATCH --mail-user=youremail@stanford.edu

perl Ancestry_HMM_parallel_v5.pl hmm_configuration_file_runCHAF_samples2017.cfg

you can also include the module/export commands in your slurm script after the #SBATCH lines

8) submit your job, e.g.:

sbatch run_hmm.sh

For subsequent use

1) edit your configuration files and job submit script as needed

2) export dependencies or make sure they are present in your slurm submission file

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam

module load bwa

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH

3) submit your job, e.g.:

sbatch run_hmm.sh

Simulating hybrid genomes with Simulate_hybrid_genomes

It's often useful to simulate admixed genomes to evaluate efficacy of ancestry calling under different scenarios.

1) For first time use:

module load perl

cpan Math::Random

Make sure the following programs are in your path:

fastahack

seqtk

wgsim

bedtools2

They are all available in:

/home/groups/schumer/shared_bin

You can either export these each time you use them:

export PATH="/home/groups/schumer/shared_bin:$PATH"

or put this export command in your .bashrc file

make a directory for the analysis:

mkdir Simulate_hybrid_genomes_runs

cd Simulate_hybrid_genomes_runs

ln -s /home/groups/schumer/shared_bin/Simulate_hybrid_genomes/* ./

cp hybrid_simulation_configuration.cfg myrun_hybrid_simulation_configuration.cfg

2) After step 1 or for subsequent use:

Export dependencies:

export PATH="/home/groups/schumer/shared_bin:$PATH"

Edit the configuration file:

emacs myrun_hybrid_simulation_configuration.cfg

3) Run the simulation

perl simulate_admixed_genomes_v2.pl myrun_hybrid_simulation_configuration.cfg

4) You can use these simulated reads as input into the Ancestry_HMM_pipeline

5) Afterwards you can summarize accuracy with the following script (part of the Simulate_hybrid_genomes GitHub):

perl post_hmm_accuracy_shell.pl ancestry-probs-par1_file ancestry-probs_par2_file path_to_simulation_reads_folder

Ancestry tsv files to admixture mapping input

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name

1b) Optionally filter your genotypes file with a script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt

1c) Another option for filtering is to filter based on physical distance rather than redundancy. You may lose more information this way but it is unbiased with respect to errors. Be careful to chose a reasonable threshold based on admixture LD decay in the specific population you are looking at.

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt

2) Generate a hybrid index file for these individuals with mixture proportions for each individual

perl parsetsv_ancestry_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file > hybrid_index_file

3) Match your previously generated phenotypes file with genotypes and hybrid indices for each individual

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file_name genotypes_file hybrid_index_file

This will generate output files appended with _matched_to_genotypes _matched_to_phenotypes _matched_to_hybrid_index

4) Check file output carefully to make sure each has the appropriate number of lines and that individuals are listed in the same order in each file

5) By default the admixture mapping scripts test the likelihood of a model of phenotype~hybrid_index versus a model of phenotype~genotype_focal_site + hybrid_index. If you want to include more covariates (i.e. body size) you will need to modify the script.

For binomial traits (0/1) Rscript perform_glm_admixture_mapping_v2_binomialtrait.R genotypes_file hybrid_index_file phenotypes_file focal_column_number outfile_name_tag

For continuous traits that follow a gaussian distribution Rscript perform_glm_admixture_mapping_v2_gaussian.R genotypes_file hybrid_index_file phenotypes_file focal_column_number outfile_name_tag

The outfile name will be: genotypes_file_results_gaussian_v2_outfile_name_tag

6) To plot your results (as long as you used the X. birchmanni 10x genome), convert the file format:

perl convert_birchmanni10x_mapping_output_manhattan_plot_input.pl genotypes_file_results_gaussian_v2_outfile_name_tag

7) Then you can make a nice Rplot of these results using the adapted_qqman.R script

in R, source the adapted_qqman.R script:

source("adapted_qqman.R")

Load the data in R:

data<-read.csv(file="genotypes_file_results_gaussian_v2_outfile_name_tag",sep="\t",head=FALSE)

Reformat the data:

data_trim<-{}

data_trim$CHR<-data$chrom

data_trim$BP<-data$marker

data_trim$P<- data$likelihood.diff

data_trim<-as.data.frame(data_trim)

data_trim<-na.omit(data_trim)

Plot the data, e.g.:

manhattan(data_trim,genomewideline=8,suggestiveline=6)

Ancestry tsv files to rQTL input

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name

2) Optionally filter your genotypes file with a script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt'

2b) Another option for filtering is to filter based on physical distance rather than redundancy. You may lose more information this way but it is unbiased with respect to errors:

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt

3) Generate a hybrid index file for these individuals with mixture proportions for each individual

perl parsetsv_ancestry_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file > hybrid_index_file

4) Match your previously generated phenotypes file with genotypes and hybrid indices for each individual

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file_name genotypes_file hybrid_index_file

This will generate output files appended with _matched_to_genotypes _matched_to_phenotypes _matched_to_hybrid_index

5) Check file output carefully to make sure each has the appropriate number of lines and that individuals are listed in the same order in each file

6) Convert the genotypes to rQTL format

perl genotypes_to_rqtl_April2019_v4.pl genotypes_file.identicalfilter.txt_matched_to_phenos phenotypes_file_phenos_second_column

writes an output file named: infile.rqtl.csv

7) Now you're ready to run rQTL! See rQTL documentation for details: http://www.rqtl.org/manual/qtl-manual.pdf

Use the following structure to read in this data format properly:

data<-read.cross("csv",dir="./","infile.rqtl.csv",na.strings=c("NA"),estimate.map=FALSE)

Ancestry tsv files to average ancestry in windows

1) Generate an file with average ancestry at every site that has greater than a user-defined threshold number of individuals and user-defined posterior probability cutoff (recommended >=0.9)

perl calculate_avg_ancestry_by_site_tsv_files_v5.pl infilepar1 infilepar2 num_ind_thresh posterior_prob_thresh path_to:transpose_nameout.pl > outfile

2) Summarize average ancestry in windows based on the intervals in a bed file:

Rscript average_ancestry_bybedintervals.R infile bins.bed chr_name

The script will write an output file called average_infile_ancestry_bins.bed

Thin Ancestry tsv files by genetic distance

1) After running AncestryHMM, extract the marker names:

head -n 1 ancestry-probs-par1_allchrs.tsv > my_markers

transpose these markers and reformat:

perl transpose_nameout.pl my_markers

perl -pi -e 's/:/\t/g' my_markers_transposed

2) Thin the markers from the X. birchmanni 10x assembly to the desired genetic distance. It's a good idea to also impose a physical distance threshold because of inaccuracies in the recombination map at a fine scale:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R my_markers_transposed cMthresh physical_dist_thresh

This script will write an output file called infile_cMdistances_thinned_distancethresh

Note that there may be greater distance between markers if the region is marker poor, this will be recorded in the output file

3) Select these markers from your tsv file

cut -f 1,2 infile_cMdistances_thinned_distancethresh | perl -p -e 's/\t/:/g' > my_thinned_markers_reformat

perl select_passing_markers_multi_geno_files_v2.pl my_thinned_markers_reformat ancestry-probs-par1_allchrs.tsv path_to:transpose_nameout.pl outfile_name_par1

perl select_passing_markers_multi_geno_files_v2.pl my_thinned_markers_reformat ancestry-probs-par2_allchrs.tsv path_to:transpose_nameout.pl outfile_name_par2

Ancestry tsv to identifying ancestry transitions

1) After running AncestryHMM, convert your ancestry tsv files to hard calls using parsetsv_to_genotypes_Dec2017_v2.pl

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1_file ancestry-probs-par2_file genotypes_outputfile_name

2) Run intervals script:

Rscript identify_intervals_10x_genomes.R genotypes_outputfile_name path_to:transpose_nameout.pl

Writes output to: genotypes_outputfile_name_focal_chr, with one file for each chromosome

Ancestry tsv files to admixture LD decay

We often want to quantify the change in admixture LD over genetic or physical distance. This is important for dating hybrid populations, understanding what regions are independent from each other in hybrid populations, selection, and much more.

To quantify admixture LD, first convert genotypes files to plink input files. It's a good idea to thin the data first either by uniqueness or by physical or genetic distance.

1) Convert genotypes files to plink input files

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_msg_genotypes_to_plink.pl genotypes_ACUA_2018.txt.identical_filter.txt

1a) You can filter your genotypes files to generate a less redundant input dataset. Either the filter_identical_columns_threshold.pl or thin_genotypes_by_physical_distance.pl will work. I recommend thinning by physical distance for this application:

perl thin_genotypes_by_physical_distance.pl genotypes_file distance_thresh_bp path_to:transpose_nameout.pl

Writes an output file named genotypes_file_thinned_physical_dist.txt

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt'

2a) You can filter your data with plink before quantifying LD. This command excludes AIMs that are missing in 50% or more of individuals and that are >98% one ancestry or another

/home/groups/schumer/shared_bin/plink --file genotypes_ACUA_2018.txt.identical_filter.txt --recode --tab --geno 0.5 --maf 0.02 --allow-extra-chr --out genotypes_ACUA_2018.txt.identical_filter.txt.trim


2b) Run plink to quantify admixture LD. The downstream scripts assume that you are outputting R2 and D so modify accordingly if needed

/home/groups/schumer/shared_bin/plink --file genotypes_ACUA_2018.txt.identical_filter.txt.trim --ld-window-kb 3000 --ld-window 20000 --r2 d --hardy --hwe 0.00001 --out ACUA --ld-window-r2 0 --allow-extra-chr

See plink documentation for details on these parameters


3a) Convert plink physical distance into genetic distance based on the X. birchmanni 10x genome (do not use this script for other genome assembly versions):

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_plink_LD_genetic_distance_xbir10x.R ACUA.ld ScyDAA6-2-HRSCAF-26

This will output a file called ACUA.ld_ScyDAA6-2-HRSCAF-26_cMdistances


3b) If you want a genome-wide picture combine files from multiple chromosomes. e.g.:

cat ACUA.ld_ScyDAA6-*_cMdistances | grep -v totalcM > ACUA.ld_allscaffolds_combined_cMdistances

Note that to use the downstream scripts you'll need to add back in the original header


4) To quantify admixture decay over genetic distance in bins of 0.1 cM:

Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/convert_cM_version_plink_admixture_decay.R ACUA.ld_allscaffolds_combined_cMdistances 0.1

Will produce a file called: ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist


5) Use the decay in admixture LD over genetic distance to estimate the time of initial admixture. Note that there are many assumptions that go into this estimate, including large population sizes, a single pulse of migration, etc, so interpret with caution:


Rscript /home/groups/schumer/shared_bin/Lab_shared_scripts/estimateage_plink_admixture_decay.R ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist


This will produce a pdf plotting the admixture decay named ACUA.ld_allscaffolds_cMdistances_admixture_ld_decay_cMdist_lddecay.pdf and will print the estimated mixture time and standard deviation to the screen

Merging multiple ancestry tsv files run separately

  • Warning!!! it is only appropriate to do this if the runs were performed with the exact same parameters in the cfg file
  • It is always better to run all individuals through the HMM together if possible, but sometimes this is not practically for very large numbers of individuals


The example shown here merges three ancestry tsv files, but can be done for any number:


1) generate lists markers shared between all files

head -n 1 ancestry-probs-par1_allchrs_batch1.tsv > marker_list_batch1

head -n 1 ancestry-probs-par1_allchrs_batch2.tsv > marker_list_batch2

head -n 1 ancestry-probs-par1_allchrs_batch3.tsv > marker_list_batch3

transpose files so there is one marker per line:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch1

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch2

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/transpose_nameout.pl marker_list_batch3

this will output files named e.g. marker_list_batch1_transposed


2) make a list of marker files:

ls marker_list_batch*_transposed > all_markers_list


3) identify markers that are covered in all tsv files:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/generate_list_of_passing_markers_multi_geno_files.pl all_markers_list

this will write an output file with passing_markers_FILENAME, i.e. for this example: passing_markers_all_markers_list


4) select these markers from your tsv files:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch1.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch1_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch1.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch1_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch2.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch2_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch2.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch2_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par1_allchrs_batch3.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par1_allchrs_batch3_selectshared.tsv

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/select_passing_markers_list_genotype_files_v2.pl passing_markers_all_markers_list ancestry-probs-par2_allchrs_batch3.tsv /home/groups/schumer/shared_bin/Lab_shared_scripts/ ancestry-probs-par2_allchrs_batch3_selectshared.tsv


5) merge these outfiles:

  • Warning!! carefully check all file names when doing this merging, it is easy to make mistakes such as swapping par1 and par2 files which will have serious consequences


We need to retain the header from one file of each the parent 1 and parent 2 files, so we will leave these two files unmodified:

ancestry-probs-par1_allchrs_batch1_selectshared.tsv

ancestry-probs-par2_allchrs_batch1_selectshared.tsv

For the other files we need to trim the header:

tail -n +2 ancestry-probs-par1_allchrs_batch2_selectshared.tsv > ancestry-probs-par1_allchrs_batch2_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch2_selectshared.tsv > ancestry-probs-par2_allchrs_batch2_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch3_selectshared.tsv > ancestry-probs-par1_allchrs_batch3_selectshared_noheader.tsv

tail -n +2 ancestry-probs-par1_allchrs_batch3_selectshared.tsv > ancestry-probs-par2_allchrs_batch3_selectshared_noheader.tsv

Combine all files for parent 1 and parent 2:

cat ancestry-probs-par1_allchrs_batch1_selectshared.tsv ancestry-probs-par1_allchrs_batch2_selectshared_noheader.tsv ancestry-probs-par1_allchrs_batch3_selectshared_noheader.tsv > ancestry-probs-par1_allchrs_combinedindividuals_sharedmarkers.tsv

cat ancestry-probs-par2_allchrs_batch1_selectshared.tsv ancestry-probs-par2_allchrs_batch2_selectshared_noheader.tsv ancestry-probs-par2_allchrs_batch3_selectshared_noheader.tsv > ancestry-probs-par2_allchrs_combinedindividuals_sharedmarkers.tsv

Mapping and variant calling

One of the most common workflows in the lab is moving from fastq.gz files to vcf files. This involves a large number of steps and can differ between programs used for mapping and variant calling. Below we outline the most commonly used steps in our lab and two shell scripts that can be used to run them in an automated way:

0) before you start, index your reference genome bwa index, gatk, and samtools. This only needs to be done the first time you run this analysis in a particular directory.

#load modules

module load biology

module load bwa

module load samtools

#index

bwa index ref_genome.fa

java -jar /home/groups/schumer/shared_bin/CreateSequenceDictionary.jar R=ref_genome.fa O=ref_genome.dict

samtools faidx ref_genome.fa

1) mapping reads to the reference genome:

module load biology

module load bwa

bwa mem -t 3 -M -R '@RG\tID:id\tSM:quail_libraryprep\tPL:illumina\tLB:lib1\tPU:illuminaHiSeq' ref_genome.fa read_1.fq.gz read_2.fq.gz > file.sam

2) process and sort sam file

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=file.sam OUTPUT=file.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.bam


3) mark read duplicates in the bam file

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar INPUT=file.sorted.bam OUTPUT=file.sorted.dedup.bam METRICS_FILE=file.sorted.metrics

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.dedup.bam

4) re-align reads around indels

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref_genome.fa -I file.sorted.dedup.bam -o file.sorted.dedup.bam.list

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T IndelRealigner -R ref_genome.fa -I file.sorted.dedup.bam -targetIntervals file.sorted.dedup.bam.list -o file.sorted.dedup.realigned.bam

5) perform variant calling

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref_genome.fa -I file.sorted.dedup.realigned.bam --genotyping_mode DISCOVERY -L chromosome_targets.list -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -o file.sorted.dedup.realigned.bam.rawvariants.g.vcf

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref_genome.fa --variant file.sorted.dedup.realigned.bam.rawvariants.g.vcf --sample_ploidy 2 --max_alternate_alleles 4 --includeNonVariantSites --standard_min_confidence_threshold_for_calling 30 -o file.sorted.dedup.realigned.bam.g.vcf


We have two shell scripts that will run these steps for lists of reads.

The first runs the mapping, sorting, de-duplicates, and realigns indels:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_bwa_jobs_list.pl list_of_fastq_files PE_or_SE slurm_submission_file path_to_picard_tools_and_gatk_jars genome_to_map_to file_name_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_bwa_jobs_list.pl combined_read_list_coac_ld_map PE /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_slurm_bwa.sh /home/groups/schumer/shared_bin xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta xbir-10x

The reads file should be in the format: id1_read1.fq.gz\tid1_read2.fq.gz\n


The second shell script, which can be run after the first finishes, runs the actual variant calling steps (see Submitting slurm jobs for tips). It requires a list of sorted, realigned bam files and a chromosome targets list, among other parameters. This can include all chromosomes but splitting into several batches is helpful for running in parallel.

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc_jobs_list.pl list_of_sorted_realigned_bam_files slurm_submission_file_to_use absolute_path_to_GATK_jars genome_assembly_used vcf_target_chromosome_list output_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc_jobs_list.pl xmac_mapped_bam_list /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_gatk-indel-hc.sh /home/groups/schumer/shared_bin/ xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa xmac_targets.group1.list group1

g.vcf files to Admixtools input

Admixtools is a program developed by the Reich lab that can perform a large number of analyses [[1]], including calculating D-statistics for four populations.

The following is a workflow for converting from .g.vcf files generated by GATK to input for Admixtools.

1) for each .g.vcf file, generate an insnp file with masked and variant basepairs. See documentation of insnp_v9_gatk3.4_gvcf.py for hard-call filter options.

python /home/groups/schumer/shared_bin/insnp_v9_gatk3.4_gvcf.py file.g.vcf file.g.vcf.insnp 20 5 40 10 10 4 -12.5 -8.0 5

2) make a list of all insnp files you would like summarized for analysis using Admixtools, one on each line

3) using this file, run the following script:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/join_insnp_files_to_admixtools_v5.pl insnp_list outfile_name_tag 1 5 500 /home/groups/schumer/shared_bin/Lab_shared_scripts

where insnp_list is the list generated in step 2, 1 indicates that the reference individual should be printed [0 to omit], 5 indicates that a site will be excluded if more than 5 individuals have missing/masked basepairs, 500 indicates the physical distance over which to thin adjacent markers, and /home/groups/schumer/shared_bin/ is the path to the dependency script: overlap_list_retain_unmatched-for-join-insnp.pl

This will generate three output files that can be used as input for Admixtools:

outfile_name_tag.geno outfile_name_tag.snp outfile_name_tag.ind

4) edit outfile_name_tag.ind to contain the species names of interest in the third column

5) Admixtools complains if the individual names are too long and if the chromosome names are not chr1, chr2, etc. This might need modification depending on the analysis being run. A quick fix is a perl regex find and replace:

perl -pi -e 's/group1/chr1/g' outfile_name_tag.snp

Batch script to generate insnp files

If you have a lot of g.vcf files to process, it's easier to use a batch script we have in the shared folder to submit all of the inns jobs.

1) make a list of the gvcf files you want to process, if they aren't in the same folder you are working in make sure to include the path

2) run the shell script

usage is: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_insnp_jobs_list.pl vcf_list path_to_dependency_scripts submission_header.sh

example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_insnp_jobs_list.pl COAC_vcf_list /home/groups/schumer/shared_bin/Lab_shared_scripts /home/groups/schumer/shared_bin/Lab_shared_scripts/submit_slurm_insnp.sh

g.vcf files to ped input for use with PLINK

This follows steps 1-5 of the previous section g.vcf files to Admixtools input, then uses the convertf package from Admixtools to convert to ped format:

6) make a parameter file for convertf

cp /home/groups/schumer/shared_bin/AdmixTools/convertf/par.EIGENSTRAT.PED parLDmapbir_eigen_to_ped

Use a text editor to edit the input parameters to match your file names (which are in eigenstrat format):

genotypename: example.eigenstratgeno

snpname: example.snp

indivname: example.ind

and specify the names of the output file for plink:

genotypeoutname: example.ped

snpoutname: example.pedsnp

indivoutname: example.pedind

7) Load the modules required for Admixtools and run the convertf program

module load openblas

module load gsl

/home/groups/schumer/shared_bin/AdmixTools/bin/convertf -p parLDmapbir_eigen_to_ped

D-statistic analysis with Admixtools

Once you have .geno, .snp, and .ind files, you can run D-statistic analysis with Admixtools

1) Generate a parameter file for Admixtools by copying the example file:

cp /home/groups/schumer/shared_bin/AdmixTools/examples/parqpDstat parameter_file_forDstat


2) Edit the parameter file lines:

DIR: ../data/ #path to the directory where your files are, can be ./

SSS: allmap #if your files have a common prefix you can put it here

indivname: DIR/SSS.ind #the path and name of your .ind file, this string corresponds to a file ../data/allmap.ind

snpname: DIR/SSS.snp #the path and name of your snp file

genotypename: DIR/SSS.geno #the path and name of your geno file

poplistname: list_qpDstat #the list of populations you are looking at, one on each line


3) Load the required modules and run Admixtools:

module load openblas

module load gsl

/home/groups/schumer/shared_bin/AdmixTools/bin/qpDstat -p parameter_file_forDstat

g.vcf files to pseudo-fasta files

Mapping RNAseq data and transcript quantification

There are two options for mapping RNAseq data:

1) transcriptome-based mapping

2) mapping to a genome with an exon-aware mapper

Mapping to a transcriptome

For mapping to a transcriptome I recommend bwa or kallisto. Note that results from both programs will be dependent on the quality of the reference transcriptome.

For bwa:

1) format transcriptome

module load biology

module load bwa

module load samtools

bwa index transcriptome.fa

2) map to transcriptome

bwa mem read1.fq.gz read2.fq.gz > out.sam

3) convert to bam

samtools sort out.sam -o out.bam


For kallisto:

1) format transcriptome

module load kallisto

kallisto index -I my_name_tag my_transcriptome.fa

2) run kallisto for each individual. See kallisto documentation for parameter details, here is an example command line:

kallisto quant -I my_name_tag -o outfile_name --single -l 300 -s 50 --rf-stranded --bootstrap-samples=50 mysample.read1.fq.gz mysample.read2.fq.gz

3) output of kallisto should be analyzed with the sleuth bioconductor package

Mapping to a genome

For mapping to a genome with exon-awareness I recommend using STAR:

1) format reference genome

/home/groups/schumer/shared_bin/STAR/bin/Linux_x86_64_static/STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles genome.fa --runThreadN 2 --sjdbGTFfile file.gtf --sjdbOverhang 99

2) Map reads

/home/groups/schumer/shared_bin/STAR/bin/Linux_x86_64_static/STAR --genomeDir ./ --readFilesIn reads1.fq.gz reads2.fq.gz --readFilesCommand zcat --outFileNamePrefix name_prefix --runThreadN 10 --outSAMtype BAM SortedByCoordinate

Transcript quantification

RSEM

htseq-count

Allele specific expression from RNAseq data

Our lab pipeline for ASE is packaged into a parallelized pipeline: ncASE_pipeline

For first time use

For subsequent use

Genome assembly with supernova - 10x data

Motif and binding site predictions

1) Generate a position weight matrix for your zinc finger sequence at zf.princeton.edu and save as a text file

2) Convert this output into a file format compatible with MEME

Rscript /home/groups/schumer/shared_bin/motiflogo.R cat_prdm9_PWM_zfp.txt

3) Build a background model for the genome of interest

/home/groups/schumer/shared_bin/meme-5.0.2/src/fasta-get-markov -m 2 -dna GCF_000181335.3_Felis_catus_9.0_genomic.fna fcatus.bg

4) Generate a MEME formatted position weight matrix

/home/groups/schumer/shared_bin/meme-5.0.2/scripts/uniprobe2meme -numseqs 1 -bg fcatus.bg -pseudo 1 cat_prdm9_PWM_zfp.txt.PWM.txt > cat_prdm9_PWM_zfp.meme.txt

5) Run FIMO on the genome and position weight matrix:

/home/groups/schumer/shared_bin/meme-5.0.2/src/fimo -bgfile fcatus.bg -oc ./FIMO-results_cat_predictedPRDM9 -thresh 0.00001 --max-stored-scores 300000 -verbosity 5 cat_prdm9_PWM_zfp.meme.txt GCF_000181335.3_Felis_catus_9.0_genomic.fna