Schumer lab: Commonly used scripts

From OpenWetWare
Jump to navigationJump to search

Note: Please add any script that you write and have thoroughly tested to the lab git repository [1] and document it here

On sherlock our lab shared bin with these scripts can be found here: /home/groups/schumer/shared_bin/Lab_shared_scripts

Usage of commonly used scripts

Scripts that convert between file formats

Convert fasta to phylip format:

perl infile.fa > outfile.phy

Convert phylip to fasta:

perl infile.phy outfile.fa

Convert MSG genotypes files to plink file format:

perl genotypes_file_name

Writes two output file: genotypes_file_name.ped;

Convert a fasta file to a fastq file:

perl infile.fasta > outfile.fq

Convert the ancestry tsv output from AncestryHMM to a hard-calls genotypes files:

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv genotypes_output_file_name

Convert the 3-way ancestry tsv files from AncestryHMM to a hard-calls multi genotypes file:

perl ancestry-file-stem posterior_prob_threshold > genotypes_output_file_name


perl allchrs.tsv 0.9 > genotypes_CAPS_all3ways.txt

Convert the MSG genotypes file to a format that is compatible with rQTL (still requires phenotypes added, see workflow documentation):

perl genotypes_file_name

Writes an output file: genotypes_file_name.rqtl.csv

Convert a phylip file to an input file for treemix:

perl input.phy population_keys_list > outfile

The population keys list contains the number of individuals per population, e.g.: 2\n1\n3\n

Covert a samtools vcf file to input counts for ASE pipeline:

perl infile.vcf

Writes an output file: infile.vcf_ASE_counts

Convert a sequence in a clustal alignment to a fasta alignment:

perl clustal_alignment.txt seq_name_to_extract

Scripts that merge files, filter files, or match file contents to lists

  • Note: there are many useful scripts for combining and filtering files on the FAS scriptome [2]*

Combine read files for two different sequencing runs of the same individual. The two files to be combined *must* be in the same order:

perl list_full_path_to_file1_set list_full_path_to_file2_set

Writes a new file for each individual, using the name in list 1 appended with _combined

Script to identify markers in tsv files violating hardy-weinberg equilibrium and filter them:

perl ancestry-par1 ancestry-par2 bonferonni_pval_thresh

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv

Script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl genotypes_file num_markers_differentiation

Writes an output file with genotypes_file.identicalfilter.txt. Note: columns that differ by NAs are treated as different from the previous column

Script for grepping a list from a separate file with an option to use grep -w or regular grep

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Similar script but automatically retains header line:

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Alternately, exclude a list from a file:

perl list_to_grep_v focal_file outfile_name grepw_0_1

Takes a list of files, reads all of their entries, and writes out a combined file with duplicate lines removed:

perl list_of_files outfile_name

Filter a genotypes file based on an allowed proportion of missing markers. Depends on the script which is on Sherlock in /home/groups/schumer/shared_bin/Lab_shared_scripts

perl infile_genotypes prop_missing_allowed

Writes out an outfile named infile_genotypes_filtered_at_thresh

Merge files based shared values in two columns. See FAS scriptome [3] for other merging options based on a single column:

perl file1 column1_file1 column2_file1 file2 column1_file2 column2_file2

Subset an MSG genotypes file based on a list of markers. Depends on the script (on Sherlock /home/groups/schumer/shared_bin)

perl list_of_markers_to_select genotypes_file outfile_name

Subset files based on shared values in the first two columns to exclude values in file1 that are also found in file2:

perl file_to_select_from file_to_exclude_from

Wites an output file called file_to_select_from_excluded_overlap_file_to_exclude_from

Scripts [or commands] that extract sequences from files

Extract fasta sequences using entries from a bed-formatted list:

perl list_of_seqs.bed fasta_to_extract outfile_tag_name

Writes output to a file named list_of_seqs.bed_tag.fa

Generate predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. maculatus genome :

perl gene_of_interest.gtf fasta outfile_tag

Writes output to a file named gene_of_interest.gtf_tag.fa

This script is the same as above but prints to standard out for the X. maculatus genome :

perl gene_of_interest.gtf fasta name_tag

This script extracts the predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. birchmanni and X. malinche 10x genomes:

perl list_of_exons.gtf fasta_file outfile_tag

This script is the same as above but prints to standard out for the X. birchmanni and X. malinche 10x genomes:

perl list_of_exons.gtf fasta_file outfile_tag

This script is similar to the above scripts, works on the 10x gtfs, but extracts exons without merging them. Each exon coordinate should be a line in the gtf file, the name tag will be appended to the exon name:

perl exons_single_gene.gtf fasta_file name_tag

fastahack is also an incredibly useful program from extracting sequences from fasta files. It can be found on Sherlock in /home/groups/schumer/shared_bin

/home/groups/schumer/shared_bin/fastahack genome.fa -r chr:start_coord..stop_coord > outfile.fa

To extract the predicted cDNA sequences for a list of transcripts you can use the following scripts.

This script extracts predicted cDNA sequences for a list of genes in the 10x genome format, given a gtf file, and genome sequence fasta:

perl my_gene_list gtf_file genome_sequence outfile_name

This script extracts predicted cDNA sequences including introns for a list of genes in the 10x genome format, given a gtf file, and genome sequence fasta:

perl my_gene_list gtf_file genome_sequence outfile_name

To extract predicted cDNA sequences and run codeml to calculate dN/dS, you can use the following script. This script takes two genomes and gtf files and runs pairwise dN/dS analysis with a list of genes.

perl list_of_genes_to_analyze gtf_species1.gtf gtf_species2.gtf species1.fa species2.fa codeml.ctl_file

Note: It can be used on two different species’ gtfs if you have carefully curated them to include single-copy genes of the same length between the species, but it tends to work better with a pseudo reference for one of the species. Here is an example of how you might use it to calculate dN/dS in X. birchmanni and X. malinche:

perl list_of_genes xiphophorus_birchmanni_10x_12Sep2018_yDAA6.gtf xiphophorus_birchmanni_10x_12Sep2018_yDAA6.gtf xiphophorus_birchmanni_10x_12Sep2018_yDAA6.fasta Xmalinche_backbone_10x_12Sep2018_yDAA6.fasta codeml.ctl

Scripts for managing fastq files

Scripts for parsing fastq files by i5 and i7 index (see details in: parsing_tn5_data.txt in Dropbox and on the wiki). This script must point to a directory with so if you work with it on a different server edit the path 4 "reads_R1_allanes_combined.fastq.gz reads_R2_allanes_combined.fastq.gz reads_I1_allanes_combined.fastq.gz reads_I2_allanes_combined.fastq.gz" 10 i5_library 4 1

Count the number of reads in a list of zipped fastq files:

perl list_of_fastqgz_files

Writes an outfile named list_of_fastqgz_files_counts

Count the number of reads in a list of pair-end zipped fastq files:

perl list_of_PE_fastqgz_files

Format of reads list should be:


Writes an outfile named list_of_PE_fastqgz_files_counts

Count the number of reads in the batch of negatives on a Tn5 plate. This script is set up to recognize any file that has NEG in the name, and should be run in the folder of interest.

perl library_name

Writes an outfile named library_name_negative_counts_results

Scripts for mapping or variant calling

This script takes a list of SE or PE reads and maps them to a particular reference genome and uses samtools to take them through the generation of sorted bam files:

perl read_list_file genome_to_map_to read_type_SE_or_PE

The read list should contain one column if the data is SE and two columns \t delimited if the data is paired end. The output file for each entry in the read list will be: samplename.read1.fq.sorted.unique.bam

This script can be run after running to run bcftools on the bams generated by the mapping/samtools script:

perl reads_list genome_assembly

The reads_list file is the same file used as input in and the command should be executed in the same directory. For each bam file, the script will generate a pileup and vcf file.

This script can be run after executing a case-control GWAS with samtools legacy. This script takes the results of a case-control GWAS using the samtools legacy vcf and reformats them for plotting:

perl infile.legacy.vcf > outfile

Scripts that manipulate or summarize fasta files

Take a fasta file with IUPAC ambiguity codes and perform a base by base coin flip for use in programs that do not accomodate polymorphism (e.g. RAxML):

perl infile.fa > outfile_coinflip.fa

Calculate divergence and polymorphism (dxy and pi) in two co-linear fasta sequences:

perl fasta_file1 fasta_file2 list_of_scaffolds_to_analyze window_size_bp

Extract a scaffold by name from a fasta file:

perl file.fa contig_name > outfile

Summarize the N50 and length distribution of an assembly and optionally print scaffolds greater than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Summarize the N50 and length distribution of an assembly and optionally print scaffolds smaller than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Count polymorphic basepairs based on IUPAC ambiguity codes in the fasta file. Note: not all fasta files include this coding of polymorphic bases.

perl fasta_file > outfile

Count polymorphic basepairs based on ambiguity codes in a fasta file using a given window size:

perl fasta_file window_size_inbp > outfile

Summarize base composition in a list of windows:

perl list_of_windows_to_analyze.bed genome.fa path_to_fastahack > outfile

Remove a list of sequences from a fasta file:

perl list_of_sequences_to_remove fasta_file

Writes an outfile with these sequences removed named fasta_file.filtered

Translate a fasta file in a give frame:

python inputfile.fa outputfile_prefix DNA 1 1 NOBIN 20

Generate reverse complement of a fasta sequence:

perl fasta_name

Writes out a file appended with -revcom.fa

Extract variable sites from an aligned multi-fasta file:

module load biology py-biopython

python -v myfasta.fa -o myvariablefa.fa

For first time use of this script you may need to install progressbar2:

pip install progressbar2

Split fasta files into shorter segments as specified by a bed file:

perl fasta_file_list windows.bed

Split results will be written to individual files called focal_file_start_stop.fa

Extract sequences from a bed file and format for RAxML analysis:

perl list_of_fasta_files random_select_file.bed outfile_tag_name

Format of fasta file:

file1.fa\n file2.fa\n

Writes out a file called: outfile_tag_name_concatenated.fasta

Scripts for QTL and admixture mapping

See also: QTL mapping and Admixture mapping

Match a phenotypes file with a genotypes and hybrid index file:

perl phenotypes_file genotypes_file hybrid_index

Write output files appended with modified names indicating matching

Scripts for admixture mapping with binomial traits, see workflows here also: Admixture mapping

Rscript perform_glm_admixture_mapping_v2_binomialtrait.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag

Rscript perform_glm_admixture_mapping_v2_gaussian.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag

Convert the output of the glm scripts to easy input for plotting in R:

perl mapping_results.txt

Scripts to calculate hybrid index and make ancestry plots post ancestryinfer

Generate hybrid index file for a pair of ancestry-tsv files:

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv

Writes hybrid index and heterozygosity per individual

Generate hybrid index file for 3 way ancestry HMM files (six files total):

perl file_name_stem posterior_probability_threshold


perl allchrs.tsv 0.9

Writes the proportion of the genome from each parent and ancestry heterozygosity of each type per individual

Generate plots of ancestry per individual:

First convert ancestry-tsv files to hard called genotypes:

perl ancestry-probs-par.tsv ancestry-probs-par2.tsv genotypes_file_name.txt

Then get list of individuals in genotypes file:

cut -f 1 genotypes_file_name.txt > id_list

Then create plots.

For a single chromosome:

perl genotypes_to_ancestry_plot.R genotypes_file_name.txt scaffold_name id_list

Or for several chromosomes that will all be in the same pdf:

perl genotypes_to_ancestry_plot_wChrList.R genotypes_file_name.txt scaffold_list.txt id_list

Scripts to average by site or in windows

Write average ancestry per site for every site in two ancestry tsv files (par1 and par2 files):

perl infilepar1 infilepar2 num_ind_thresh posterior_prob_thresh > outfile

Write average ancestry per gene using the ancestry by site file generated with

perl ancestry_by_site_file gene_list gtf_file

Writes output to average_ancestrybysitesfilename_ancestry_intervals.bed

Write average ancestry per site for every site in an MSG genotypes file:

perl genotypes_file

Writes average ancestry per site to an output file named after the input file and appended with _ancestry_by_site

Average ancestry by site files in windows specified in a bed file. See workflow here: Ancestry tsvs to average ancestry

Rscript average_ancestry_bybedintervals.R ancestry_by_site_file bins.bed chr_name

Newer version runs the script across all chromosome:

Rscript average_ancestry_bybedintervals_v2.R ancestry_by_site_results bins.bed

This script takes the output of and averages them in bed intervals and outputs average ancestry per window. The outfiles will be generated by chromosome and will be called: average_[infile]_ancestry_[bedfile]_[chromosome]

Average recombination rates output by LDhelmet in windows specified in a bed file, appropriately weighing per base pair recombination rates:

Rscript average_LDhelmet_rates_res_file_window.R modified_LD_helmet_output chromosome_name window_size_bps chrom_length_file

Note the latest recombination maps can be found here: /home/groups/schumer/shared_bin/Xbirchmanni_LD_recombination_map_10xgenome_March2019

Scripts to summarize features in windows (i.e. coding, conserved, repetitive basepairs)

Determine the number of coding, conserved, or repetitive basepairs in a set of bed-formatted windows, using features in another file (can be bed or gtf formatted):

Rscript calculate_features_per_window.R bins.bed features.gtf_or_features.bed path_to_bedtools_bin/[0 for global install] outfile_name

for example:

Rscript calculate_features_per_window.R combined_0.05cM_results_TLMC_March2019.bed /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni-10x_12Sep2018_yDAA6-annotation-output/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.gtf 0 combined_TLMC_0.05cM_windows_addcoding

This will calculate the number of coding basepairs per window based on the windows provided in combined_0.05cM_results_TLMC_March2019.bed

For just conserved basepairs for all chromosomes across the bed file.

Rscript calculate_conservedBPs_per_window_v2.R windowBins conserved.bed

for example:

Rscript calculate_conservedBPs_per_window_v2.R xbir_allChrs_100kb_windows xma_mostconserved_bases_v2_xbir.bed

This counts the number of conserved base pairs in a windowed bed file and creates a version with _wConservedBPs.bed as the suffix.

To count synonymous and non-synonymous basepairs in windowed bins.

Rscript calculate_dNdSsites_per_window.R windowBins ntChanges.bed

This creates a version with _wSynNonsyn.bed as the suffix. Where the last column is the non-synonymous count and the 2nd to last column is the synonymous count.

Scripts to identify recombination transitions

Use a genotypes file produced from ancestry-par1 and ancestry-par2 tsv files to identify ancestry transitions in each individual and the interval over which they occur:

Rscript identify_intervals_10x_genomes.R genotypes_file_name

Writes out outfiles for each chromosome named: genotypes_file_name_chr_intervals. Each line is a ancestry transition event with the start and stop of the interval and the individual it was detected in:

1407648 1801361 2

3158148 3306561 2

12714653 12933987 2

18340742 18501540 2

Note that this is coded for the X. birchmanni 10x genome and would need to be modified to provide chromosome names for other genome versions

Scripts relating to ancestry tracts

This script takes a genotypes file and outputs inferred ancestry tracts for a given chromosome. The individual id list allows you to match the tract to the individual it came from:

Rscript genotypes_to_ancestry_tract_lengths.R genotypes_format_file.txt chromosome_name_to_run individual_id_list_with_header outfile_name

Note: the individual id list can be generated using the following steps: head -n 1 genotypes_format_file.txt > id_list_tmp perl id_list_tmp

This script prints inferred recombination intervals based on a genotypes file. It will output one file per chromosome in the 10x genome, appended with the word “_intervals”:

Rscript identify_intervals_10x_genomes_v2.R genotypes_format_file.txt

This script takes the file output by genotypes_to_ancestry_tract_lengths.R and converts it to estimated length in cMs, given a modified LDhelmet bed outfile. These modified bed LDhelmet outfiles can be found here on Sherlock: /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni_LD_recombination_map_10xgenome_March2019/block_penalty_50_version

Rscript convert_ancestry_tract_lengths_to_cM_lengths.R tract_length_file_chr rec_file_chr

The output file is the input tract length file name with “_cM_lengths” appended. Examples of file formats can be found at the top of the script

Scripts to summarize genetic distance or admixture LD decay

Generate bed intervals for windows of a certain genetic size in cMs for the X. birchmanni 10x recombination map:

Rscript convert_LD_rate_to_cM_rate.R chromname-bir10x window_size_cM

Generates an output file called: cM_windows_window_size_xbirchmanni10x_chr

Generate files based on plink output for approximate dating of time of admixture using the decay of admixture LD (see Commonly used workflows for details on generating this files and workflows):

Add distance between the two markers in cMs to a plink-generated LD file:

Rscript convert_plink_LD_genetic_distance_xbir10x.R plink_ld_file focal_chromosome_name

Outputs a file called: plink_ld_file_chrom_cMdistances

these results can then be summarized by running additional scripts. To average LD metrics in windows of a particular cM distances:

Rscript convert_cM_version_plink_admixture_decay.R plink_ld_file_cMdistances window_size_cM

Outputs a file called: plink_ld_file_chrom_cMdistances_admixture_ld_decay_cMdist

these results can then be used to estimate age of admixture and plot decay in D:

Rscript estimateage_plink_admixture_decay.R plink_output_ld_decay_D.ld_admixture_ld_decay

Scripts that generate or use insnp files

Generate a masking insnp file based on a vcf file. For example, use GATK's variant selecting tools to identify variants that have high or low depth for masking GATK Variant Selector

perl gatk.g.vcf

writes an outfile called infile.mask.insnp

Generates an insnp file to update and mask a fasta file. Variants that pass the quality thresholds are marked for updating and variants that fail the quality thresholds as well as invariants of low coverage/quality are marked for masking:

python file.g.vcf file.g.vcf.insnp gq_cutoff dpcutoff mq_cutoff qd_cutoff fs_cutoff sor_cutoff mqrs_cutoff readpos_cutoff indel_window

All but the indel window parameter (which masks basepairs around indels) are GATK parameters so see more details at their website GATK hard-call parameters

Currently suggested parameters for swordtails:

20 10 40 10 10 4 -12.5 -8.0 5

Scripts for admix'em simulations

See this section of the wiki for all scripts related to admix'em simulations: Admixture simulations with admix'em

Scripts for SLiM simulations or SLiM output

This script takes a SLiM vcf and generages input files for running admixtools:

perl SLiM_format.vcf

This script takes a SLiM vcf and generages input files for running admixtools. The file names will be the input vcf name appended with .geno, .snp, and .ind


Script to link locally to a list of files elsewhere on the server:

perl infile_list

The infile list should contain the full path to the files and have one file per line

Script to remove a list of files, use with caution!

perl infile_list

The infile list should contain the full path to the files and have one file per line

Cancel a list of slurm jobs. The list can either be of slurm job ids or slurm file:

perl slurm_list

Submit a slurm script recursively for every chromosome in the birchmanni 10x genome:

perl name_of_starting_chrom

Retain only every n windows from a bed file:

perl file.bed retain_every_n_windows[integer]

Writes an output file called file.bed_thinned_by_n.bed

Script to rename X. birchmanni 10x scaffold names as their corresponding chromosome names:

perl file_to_rename

Writes an output file named file_to_rename.chromnames

Estimate genetic distance between markers in the X. birchmanni 10x genome and optionally thin by physical and genetic distance:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R markers_to_thin cM_threshold physical_threshold

For example:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R ACUA_markers_to_thin 0.25 2000

the input markers for thinning should be in this format:




This script can be run on two completely co-linear genomes to document all non-polymorphic and non-missing sites that differ between them:

perl fasta1 fasta2 > outfile

Calculate base composition in a fasta file. This script takes a fasta as input and prints out base composition for each contig/sequence:

perl fasta > outfile

This script takes a PWM produced by and converts to the PWM format used by MEME and FIMO:

Rscript motiflogo.R input.PWM.txt species_name

Need to be annotated