Schumer lab: Commonly used workflows

From OpenWetWare
Jump to navigationJump to search

Please add more workflows and pipelines as you develop them!

Parsing Illumina data - Tn5 libraries

  • Note: this content is also in dropbox in the guide called "parsing_tn5_data.txt"

Before you can use Tn5 data, it needs to be parsed by both i5 and i7 barcode. Check out this helpful document from Patrick Reilly for more background File:IlluminaParsing v1-2.pdf

i5 barcode parsing

First, parse by i5 barcode:

1) generate a file with the plate id and i5 name \t the i5 sequence (see i5_indices.tsv or i5_indices_1-24_revcomp.tsv in Dropbox). For example:





  • Note: whether you need the forward or reverse complement i5 sequences depends on which Illumina sequencer you used - currently NextSeq and HiSeq 4000 need the reverse complement

2) Make sure to save this file as a windows formatted text file (or Tab delimited text)

3) copy the file to the server

4) convert the file to Unix text format

dos2unix [filename]

  • Note: if you look at your file and see ^M characters, run the following as well:

mac2unix [filename]

i7 barcode generation

  • Note: you need to follow these steps once for every plate

1) open "Tn5_i7s_convert_plate_to_barcodefile.xls" from Dropbox

2) paste your plate layout in the space indicated

3) copy columns A & B which are automatically generated and paste using into a new excel document by selecting:

Edit->paste special->Values

4) change the file format to "windows formatted text". Save the file name as Platename_i7_barcodes

  • for example "COAC-V1_2018_CHAF-V-V1_2018_Tn5_data_July2018_i7_barcodes.txt"

5) spot check several cells spread throughout the document to ensure that the right sample name, i7 id, and i7 sequence have been matched up

6) copy the file to the server

7) convert the file to unix text format

dos2unix [filename]

parse by i5 and i7 barcodes

1) Submit a slurm job to run parsing by i5 barcode. The usage of this job is:

/home/groups/schumer/shared_bin/Lab_shared_scripts/ [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i5 barcode file] [which position in FASTQ list is i5 index read]


/home/groups/schumer/shared_bin/Lab_shared_scripts/ 4 "TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_R2_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I1_alllanes_combined.fastq.gz TLMCI2015_COACVI2018-2017_CHAFV2018-2017_ACUAV2018_ACUAVI2015_S0_I2_alllanes_combined.fastq.gz" 10 i5_library_COACVI2018_CHAFV2018_ACUAV2018_ACUAVI2015 4 1

  • Note: parsing by i5 is not necessary if you have only used a single i5 in the sequencing run, you can move directly to parsing by i7
  • Note: make sure the number of cores in your slurm job matches the number requested in the command line. For example, for the above an appropriate slurm header would be:


#SBATCH --job-name=parse_i5

#SBATCH --time=24:00:00

#SBATCH --ntasks=10

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000

2) After this job has finished, you can move on the parsing by i7 barcode. You will need to run one i7 job for *each* plate. For example, if the i5 barcode file had five lines, you will need to run five i7 parsing jobs.

If you look at the output of the i5 parsing you will notice that each parsed set generated three files, so in this next round of parsing you will have three files instead of four but otherwise the usage is the same:

/home/groups/schumer/shared_bin/Lab_shared_scripts/ [# read files] [List of FASTQs IN QUOTES] [# cores to use] [i7 barcode file] [which position in FASTQ list is i7 index read]

/home/groups/schumer/shared_bin/Lab_shared_scripts/ 3 "i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_1.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_2.fastq.gz i5-3_ACUAV2018-CHAFI-II2018-ACUAVI2015_read_3.fastq.gz" 10 20180716_Tn5_library_ACUA_V_2018_CHAF_I_II_2018_ACUA_VI_2015_i7_barcodes.txt 3 1

Quick quality checks after parsing

1) Check the Part*logs files

  • ensure that there are few reads in the control wells (<0.01% in a good library)
  • check that coverage is relatively even among individuals (few individuals <0.3% or >2%)

2) Check duplication levels

  • check duplication levels with picardtools for a few individuals per run
  • acceptable levels are <20% (0.2 in picard output)

module load biology

module load bwa

bwa mem -t 3 -M xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa COACVI1801_i7-13_read_1.fastq.gz COACVI1801_i7-13_read_2.fastq.gz > COACVI1801.sam

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=COACVI1801.sam OUTPUT=COACVI1801.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=COACVI1801.sorted.bam

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar I=COACVI1801.sorted.bam O=COACVI1801.dedup.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=COACVI1801.metrics

Parsing Illumina data - Quail libraries or similar

Ancestry analysis with AncestryHMM

Local ancestry inference is one of the most common workflows in the lab. The steps we use to run local ancestry inference for birchmanni x malinche are outlined below and is also available in dropbox ("Shared_lab_resources/Common_commands_and_pipelines/Running_Ancestry_HMM_on_Sherlock.txt").

For first time use

1) Make a folder within your personal lab member folder to perform the analysis, e.g.:

mkdir Ancestry_HMM_runs cd Ancestry_HMM_runs

2) link to all of the scripts used in the analysis from our lab bin

ln -s /home/groups/schumer/shared_bin/Ancestry_HMM_pipeline/* ./

3) make a copy of the sample configuration file for your specific analysis

cp hmm_configuration_file_sherlock.cfg hmm_configuration_file_runCHAF_samples2017.cfg

4) use a text editor or emacs to edit the parameters in this file (see parameter details in Dropbox guide, README, or at:

5) link to the genome assemblies you plan to use, e.g:

ln -s /home/groups/schumer/shared_bin/shared_resources/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta ./

ln -s /home/groups/schumer/shared_bin/shared_resources/Xmalinche_dovetail_assembly.fa ./

6) load required packages and modules:

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam

module load bwa

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH

7) make a batch script for submission, e.g.:


#SBATCH --job-name=run_hmm

#SBATCH --time=01:30:00

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --mem=32000


perl hmm_configuration_file_runCHAF_samples2017.cfg

you can also include the module/export commands in your slurm script after the #SBATCH lines

8) submit your job, e.g.:


For subsequent use

1) edit your configuration files and job submit script as needed

2) export dependencies or make sure they are present in your slurm submission file

module load armadillo

module load biology

module load samtools

module load bcftools

module load py-pysam

module load bwa

export PATH="/home/groups/schumer/shared_bin/ngsutils/bin:$PATH"

export PATH="/home/groups/schumer/shared_bin/Ancestry_HMM/src:$PATH"

export PYTHONPATH=/home/groups/schumer/shared_bin:$PYTHONPATH

3) submit your job, e.g.:


Mapping and variant calling

One of the most common workflows in the lab is moving from fastq.gz files to vcf files. This involves a large number of steps and can differ between programs used for mapping and variant calling. Below we outline the most commonly used steps in our lab and two shell scripts that can be used to run them in an automated way:

0) before you start, index your reference genome bwa index, gatk, and samtools. This only needs to be done the first time you run this analysis in a particular directory.

#load modules

module load biology

module load bwa

module load samtools


bwa index ref_genome.fa

java -jar /home/groups/schumer/shared_bin/CreateSequenceDictionary.jar R=ref_genome.fa O=ref_genome.dict

samtools faidx ref_genome.fa

1) mapping reads to the reference genome:

module load biology

module load bwa

bwa mem -t 3 -M -R '@RG\tID:id\tSM:quail_libraryprep\tPL:illumina\tLB:lib1\tPU:illuminaHiSeq' ref_genome.fa read_1.fq.gz read_2.fq.gz > file.sam

2) process and sort sam file

java -jar /home/groups/schumer/shared_bin/SortSam.jar INPUT=file.sam OUTPUT=file.sorted.bam SORT_ORDER=coordinate

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.bam

3) mark read duplicates in the bam file

java -jar /home/groups/schumer/shared_bin/MarkDuplicates.jar INPUT=file.sorted.bam OUTPUT=file.sorted.dedup.bam METRICS_FILE=file.sorted.metrics

java -jar /home/groups/schumer/shared_bin/BuildBamIndex.jar INPUT=file.sorted.dedup.bam

4) re-align reads around indels

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref_genome.fa -I file.sorted.dedup.bam -o file.sorted.dedup.bam.list

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T IndelRealigner -R ref_genome.fa -I file.sorted.dedup.bam -targetIntervals file.sorted.dedup.bam.list -o file.sorted.dedup.realigned.bam

5) perform variant calling

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R ref_genome.fa -I file.sorted.dedup.realigned.bam --genotyping_mode DISCOVERY -L chromosome_targets.list -stand_emit_conf 10 -stand_call_conf 30 -ERC GVCF -o file.sorted.dedup.realigned.bam.rawvariants.g.vcf

java -jar /home/groups/schumer/shared_bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ref_genome.fa --variant file.sorted.dedup.realigned.bam.rawvariants.g.vcf --sample_ploidy 2 --max_alternate_alleles 4 --includeNonVariantSites --standard_min_confidence_threshold_for_calling 30 -o file.sorted.dedup.realigned.bam.g.vcf

We have two shell scripts that will run these steps for lists of reads.

The first runs the mapping, sorting, de-duplicates, and realigns indels:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ list_of_fastq_files PE_or_SE slurm_submission_file path_to_picard_tools_and_gatk_jars genome_to_map_to file_name_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ combined_read_list_coac_ld_map PE /home/groups/schumer/shared_bin/Lab_shared_scripts/ /home/groups/schumer/shared_bin xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta xbir-10x

The reads file should be in the format: id1_read1.fq.gz\tid1_read2.fq.gz\n

The second shell script, which can be run after the first finishes, runs the actual variant calling steps (see Submitting slurm jobs for tips). It requires a list of sorted, realigned bam files and a chromosome targets list, among other parameters. This can include all chromosomes but splitting into several batches is helpful for running in parallel.

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ list_of_sorted_realigned_bam_files slurm_submission_file_to_use absolute_path_to_GATK_jars genome_assembly_used vcf_target_chromosome_list output_tag

Example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ xmac_mapped_bam_list /home/groups/schumer/shared_bin/Lab_shared_scripts/ /home/groups/schumer/shared_bin/ xma_washu_4.4.2-jhp_0.1_combined-unplaced-mito.fa xmac_targets.group1.list group1

g.vcf files to Admixtools input

Admixtools is a program developed by the Reich lab that can perform a large number of analyses [[1]], including calculating D-statistics for four populations.

The following is a workflow for converting from .g.vcf files generated by GATK to input for Admixtools.

1) for each .g.vcf file, generate an insnp file with masked and variant basepairs. See documentation of for hard-call filter options.

python /home/groups/schumer/shared_bin/ file.g.vcf file.g.vcf.insnp 20 5 40 10 10 4 -12.5 -8.0 5

2) make a list of all insnp files you would like summarized for analysis using Admixtools, one on each line

3) using this file, run the following script:

perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ insnp_list outfile_name_tag 1 5 500 /home/groups/schumer/shared_bin/Lab_shared_scripts

where insnp_list is the list generated in step 2, 1 indicates that the reference individual should be printed [0 to omit], 5 indicates that a site will be excluded if more than 5 individuals have missing/masked basepairs, 500 indicates the physical distance over which to thin adjacent markers, and /home/groups/schumer/shared_bin/ is the path to the dependency script:

This will generate three output files that can be used as input for Admixtools:

outfile_name_tag.geno outfile_name_tag.snp outfile_name_tag.ind

4) edit outfile_name_tag.ind to contain the species names of interest in the third column

5) Admixtools complains if the individual names are too long and if the chromosome names are not chr1, chr2, etc. This might need modification depending on the analysis being run. A quick fix is a perl regex find and replace:

perl -pi -e 's/group1/chr1/g' outfile_name_tag.snp

g.vcf files to ped input for use with PLINK

This follows steps 1-5 of the previous section g.vcf files to Admixtools input, then uses the convertf package from Admixtools to convert to ped format:

6) make a parameter file for convertf

cp /home/groups/schumer/shared_bin/AdmixTools/convertf/par.EIGENSTRAT.PED parLDmapbir_eigen_to_ped

Use a text editor to edit the input parameters to match your file names (which are in eigenstrat format):

genotypename: example.eigenstratgeno

snpname: example.snp

indivname: example.ind

and specify the names of the output file for plink:

genotypeoutname: example.ped

snpoutname: example.pedsnp

indivoutname: example.pedind

7) Load the modules required for Admixtools and run the convertf program

module load openblas

module load gsl

/home/groups/schumer/shared_bin/AdmixTools/bin/convertf -p parLDmapbir_eigen_to_ped

Batch script to generate insnp files

If you have a lot of g.vcf files to process, it's easier to use a batch script we have in the shared folder to submit all of the inns jobs.

1) make a list of the gvcf files you want to process, if they aren't in the same folder you are working in make sure to include the path

2) run the shell script

usage is: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ vcf_list path_to_dependency_scripts

example: perl /home/groups/schumer/shared_bin/Lab_shared_scripts/ COAC_vcf_list /home/groups/schumer/shared_bin/Lab_shared_scripts /home/groups/schumer/shared_bin/Lab_shared_scripts/

D-statistic analysis with Admixtools

Once you have .geno, .snp, and .ind files, you can run D-statistic analysis with Admixtools

1) Generate a parameter file for Admixtools by copying the example file:

cp /home/groups/schumer/shared_bin/AdmixTools/examples/parqpDstat parameter_file_forDstat

2) Edit the parameter file lines:

DIR: ../data/ #path to the directory where your files are, can be ./

SSS: allmap #if your files have a common prefix you can put it here

indivname: DIR/SSS.ind #the path and name of your .ind file, this string corresponds to a file ../data/allmap.ind

snpname: DIR/SSS.snp #the path and name of your snp file

genotypename: DIR/SSS.geno #the path and name of your geno file

poplistname: list_qpDstat #the list of populations you are looking at, one on each line

3) Load the required modules and run Admixtools:

module load openblas

module load gsl


g.vcf files to pseudo-fasta files

Mapping RNAseq data and RNA quantification

Allele specific expression from RNAseq data

Genome assembly with supernova - 10x data

Motif and binding site predictions

1) Generate a position weight matrix for your zinc finger sequence at and save as a text file

2) Convert this output into a file format compatible with MEME

Rscript /home/groups/schumer/shared_bin/motiflogo.R cat_prdm9_PWM_zfp.txt

3) Build a background model for the genome of interest

/home/groups/schumer/shared_bin/meme-5.0.2/src/fasta-get-markov -m 2 -dna GCF_000181335.3_Felis_catus_9.0_genomic.fna

4) Generate a MEME formatted position weight matrix

/home/groups/schumer/shared_bin/meme-5.0.2/scripts/uniprobe2meme -numseqs 1 -bg -pseudo 1 cat_prdm9_PWM_zfp.txt.PWM.txt >

5) Run FIMO on the genome and position weight matrix:

/home/groups/schumer/shared_bin/meme-5.0.2/src/fimo -bgfile -oc ./FIMO-results_cat_predictedPRDM9 -thresh 0.00001 --max-stored-scores 300000 -verbosity 5 GCF_000181335.3_Felis_catus_9.0_genomic.fna