Schumer lab: Commonly used scripts

From OpenWetWare
Jump to navigationJump to search

Note: Please add any script that you write and have thoroughly tested to the lab git repository [1] and document it here

On sherlock our lab shared bin with these scripts can be found here: /home/groups/schumer/shared_bin/Lab_shared_scripts

Usage of commonly used scripts

Scripts that convert between file formats

Convert fasta to phylip format:

perl infile.fa > outfile.phy

Convert phylip to fasta:

perl infile.phy outfile.fa

Convert MSG genotypes files to plink file format:

perl genotypes_file_name

Writes two output file: genotypes_file_name.ped;

Convert a fasta file to a fastq file:

perl infile.fasta > outfile.fq

Convert the ancestry tsv output from AncestryHMM to a hard-calls genotypes files:

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv genotypes_output_file_name

Convert the MSG genotypes file to a format that is compatible with rQTL (still requires phenotypes added, see workflow documentation):

perl genotypes_file_name

Writes an output file: genotypes_file_name.rqtl.csv

Convert a phylip file to an input file for treemix:

perl input.phy population_keys_list > outfile

The population keys list contains the number of individuals per population, e.g.: 2\n1\n3\n

Covert a samtools vcf file to input counts for ASE pipeline:

perl infile.vcf

Writes an output file: infile.vcf_ASE_counts

Convert a sequence in a clustal alignment to a fasta alignment:

perl clustal_alignment.txt seq_name_to_extract

Scripts that merge files, filter files, or match file contents to lists

  • Note: there are many useful scripts for combining and filtering files on the FAS scriptome [2]*

Combine read files for two different sequencing runs of the same individual. The two files to be combined *must* be in the same order:

perl list_full_path_to_file1_set list_full_path_to_file2_set

Writes a new file for each individual, using the name in list 1 appended with _combined

Script to identify markers in tsv files violating hardy-weinberg equilibrium and filter them:

perl ancestry-par1 ancestry-par2 bonferonni_pval_thresh

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv

Script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl genotypes_file num_markers_differentiation

Writes an output file with genotypes_file.identicalfilter.txt. Note: columns that differ by NAs are treated as different from the previous column

Script for grepping a list from a separate file with an option to use grep -w or regular grep

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Similar script but automatically retains header line:

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Alternately, exclude a list from a file:

perl list_to_grep_v focal_file outfile_name grepw_0_1

Takes a list of files, reads all of their entries, and writes out a combined file with duplicate lines removed:

perl list_of_files outfile_name

Filter a genotypes file based on an allowed proportion of missing markers. Depends on the script which is on Sherlock in /home/groups/schumer/shared_bin/Lab_shared_scripts

perl infile_genotypes prop_missing_allowed

Writes out an outfile named infile_genotypes_filtered_at_thresh

Merge files based shared values in two columns. See FAS scriptome [3] for other merging options based on a single column:

perl file1 column1_file1 column2_file1 file2 column1_file2 column2_file2

Subset an MSG genotypes file based on a list of markers. Depends on the script (on Sherlock /home/groups/schumer/shared_bin)

perl list_of_markers_to_select genotypes_file outfile_name

Subset files based on shared values in the first two columns to exclude values in file1 that are also found in file2:

perl file_to_select_from file_to_exclude_from

Wites an output file called file_to_select_from_excluded_overlap_file_to_exclude_from

Scripts [or commands] that extract sequences from files

Extract fasta sequences using entries from a bed-formatted list:

perl list_of_seqs.bed fasta_to_extract outfile_tag_name

Writes output to a file named list_of_seqs.bed_tag.fa

Generate predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. maculatus genome :

perl gene_of_interest.gtf fasta outfile_tag

Writes output to a file named gene_of_interest.gtf_tag.fa

This script is the same as above but prints to standard out for the X. maculatus genome :

perl gene_of_interest.gtf fasta name_tag

This script extracts the predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. birchmanni and X. malinche 10x genomes:

perl list_of_exons.gtf fasta_file outfile_tag

This script is the same as above but prints to standard out for the X. birchmanni and X. malinche 10x genomes:

perl list_of_exons.gtf fasta_file outfile_tag

fastahack is also an incredibly useful program from extracting sequences from fasta files. It can be found on Sherlock in /home/groups/schumer/shared_bin

/home/groups/schumer/shared_bin/fastahack genome.fa -r chr:start_coord..stop_coord > outfile.fa

Scripts for managing fastq files

Scripts for parsing fastq files by i5 and i7 index (see details in: parsing_tn5_data.txt in Dropbox and on the wiki). This script must point to a directory with so if you work with it on a different server edit the path 4 "reads_R1_allanes_combined.fastq.gz reads_R2_allanes_combined.fastq.gz reads_I1_allanes_combined.fastq.gz reads_I2_allanes_combined.fastq.gz" 10 i5_library 4 1

Count the number of reads in a list of zipped fastq files:

perl list_of_fastqgz_files

Writes an outfile named list_of_fastqgz_files_counts

Scripts that manipulate or summarize fasta files

Take a fasta file with IUPAC ambiguity codes and perform a base by base coin flip for use in programs that do not accomodate polymorphism (e.g. RAxML):

perl infile.fa > outfile_coinflip.fa

Calculate divergence and polymorphism (dxy and pi) in two co-linear fasta sequences:

perl fasta_file1 fasta_file2 list_of_scaffolds_to_analyze window_size_bp

Extract a scaffold by name from a fasta file:

perl file.fa contig_name > outfile

Summarize the N50 and length distribution of an assembly and optionally print scaffolds greater than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Summarize the N50 and length distribution of an assembly and optionally print scaffolds smaller than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Count polymorphic basepairs based on IUPAC ambiguity codes in the fasta file. Note: not all fasta files include this coding of polymorphic bases.

perl fasta_file > outfile

Count polymorphic basepairs based on ambiguity codes in a fasta file using a given window size:

perl fasta_file window_size_inbp > outfile

Summarize base composition in a list of windows:

perl list_of_windows_to_analyze.bed genome.fa path_to_fastahack > outfile

Remove a list of sequences from a fasta file:

perl list_of_sequences_to_remove fasta_file

Writes an outfile with these sequences removed named fasta_file.filtered

Translate a fasta file in a give frame:

python inputfile.fa outputfile_prefix DNA 1 1 NOBIN 20

Generate reverse complement of a fasta sequence:

perl fasta_name

Writes out a file appended with -revcom.fa

Extract variable sites from an aligned multi-fasta file:

module load biology py-biopython

python -v myfasta.fa -o myvariablefa.fa

For first time use of this script you may need to install progressbar2:

pip install progressbar2

Split fasta files into shorter segments as specified by a bed file:

perl fasta_file_list windows.bed

Split results will be written to individual files called focal_file_start_stop.fa

Scripts for QTL and admixture mapping

See also: QTL mapping and Admixture mapping

Match a phenotypes file with a genotypes and hybrid index file:

perl phenotypes_file genotypes_file hybrid_index

Write output files appended with modified names indicating matching

Scripts for admixture mapping with binomial traits, see workflows here also: Admixture mapping

Rscript perform_glm_admixture_mapping_v2_binomialtrait.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag

Rscript perform_glm_admixture_mapping_v2_gaussian.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag

Convert the output of the glm scripts to easy input for plotting in R:

perl mapping_results.txt

Scripts to average by site or in windows

Write average ancestry per site for every site in two ancestry tsv files (par1 and par2 files):

perl infilepar1 infilepar2 num_ind_thresh posterior_prob_thresh > outfile

Write average ancestry per site for every site in an MSG genotypes file:

perl genotypes_file

Writes average ancestry per site to an output file named after the input file and appended with _ancestry_by_site

Generate hybrid index file for a pair of ancestry-tsv files:

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv

Writes hybrid index and heterozygosity per individual

Average ancestry by site files in windows specified in a bed file. See workflow here: Ancestry tsvs to average ancestry

Rscript average_ancestry_bybedintervals.R ancestry_by_site_file bins.bed chr_name

Average recombination rates output by LDhelmet in windows specified in a bed file, appropriately weighing per base pair recombination rates:

Rscript average_LDhelmet_rates_res_file_window.R modified_LD_helmet_output chromosome_name window_size_bps chrom_length_file

Note the latest recombination maps can be found here: /home/groups/schumer/shared_bin/Xbirchmanni_LD_recombination_map_10xgenome_March2019

Scripts to summarize features in windows (i.e. coding, conserved, repetitive basepairs)

Determine the number of coding, conserved, or repetitive basepairs in a set of bed-formatted windows, using features in another file (can be bed or gtf formatted):

Rscript calculate_features_per_window.R bins.bed features.gtf_or_features.bed path_to_bedtools_bin/[0 for global install] outfile_name

for example:

Rscript calculate_features_per_window.R combined_0.05cM_results_TLMC_March2019.bed /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni-10x_12Sep2018_yDAA6-annotation-output/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.gtf 0 combined_TLMC_0.05cM_windows_addcoding

This will calculate the number of coding basepairs per window based on the windows provided in combined_0.05cM_results_TLMC_March2019.bed

Scripts to identify recombination transitions

Use a genotypes file produced from ancestry-par1 and ancestry-par2 tsv files to identify ancestry transitions in each individual and the interval over which they occur:

Rscript identify_intervals_10x_genomes.R genotypes_file_name

Writes out outfiles for each chromosome named: genotypes_file_name_chr_intervals. Each line is a ancestry transition event with the start and stop of the interval and the individual it was detected in:

1407648 1801361 2

3158148 3306561 2

12714653 12933987 2

18340742 18501540 2

Note that this is coded for the X. birchmanni 10x genome and would need to be modified to provide chromosome names for other genome versions

Scripts to summarize genetic distance or admixture LD decay

Generate bed intervals for windows of a certain genetic size in cMs for the X. birchmanni 10x recombination map:

Rscript convert_LD_rate_to_cM_rate.R chromname-bir10x window_size_cM

Generates an output file called: cM_windows_window_size_xbirchmanni10x_chr

Generate files based on plink output for approximate dating of time of admixture using the decay of admixture LD (see Commonly used workflows for details on generating this files and workflows):

Add distance between the two markers in cMs to a plink-generated LD file:

Rscript convert_plink_LD_genetic_distance_xbir10x.R plink_ld_file focal_chromosome_name

Outputs a file called: plink_ld_file_chrom_cMdistances

these results can then be summarized by running additional scripts. To average LD metrics in windows of a particular cM distances:

Rscript convert_cM_version_plink_admixture_decay.R plink_ld_file_cMdistances window_size_cM

Outputs a file called: plink_ld_file_chrom_cMdistances_admixture_ld_decay_cMdist

these results can then be used to estimate age of admixture and plot decay in D:

Rscript estimateage_plink_admixture_decay.R plink_output_ld_decay_D.ld_admixture_ld_decay

Scripts that generate or use insnp files

Generate a masking insnp file based on a vcf file. For example, use GATK's variant selecting tools to identify variants that have high or low depth for masking GATK Variant Selector

perl gatk.g.vcf

writes an outfile called infile.mask.insnp

Generates an insnp file to update and mask a fasta file. Variants that pass the quality thresholds are marked for updating and variants that fail the quality thresholds as well as invariants of low coverage/quality are marked for masking:

python file.g.vcf file.g.vcf.insnp gq_cutoff dpcutoff mq_cutoff qd_cutoff fs_cutoff sor_cutoff mqrs_cutoff readpos_cutoff indel_window

All but the indel window parameter (which masks basepairs around indels) are GATK parameters so see more details at their website GATK hard-call parameters

Currently suggested parameters for swordtails:

20 10 40 10 10 4 -12.5 -8.0 5

Scripts for admix'em simulations

See this section of the wiki for all scripts related to admix'em simulations: Admixture simulations with admix'em


Script to link locally to a list of files elsewhere on the server:

perl infile_list

The infile list should contain the full path to the files and have one file per line

Script to remove a list of files, use with caution!

perl infile_list

The infile list should contain the full path to the files and have one file per line

Cancel a list of slurm jobs. The list can either be of slurm job ids or slurm file:

perl slurm_list

Submit a slurm script recursively for every chromosome in the birchmanni 10x genome:

perl name_of_starting_chrom

Retain only every n windows from a bed file:

perl file.bed retain_every_n_windows[integer]

Writes an output file called file.bed_thinned_by_n.bed

Script to rename X. birchmanni 10x scaffold names as their corresponding chromosome names:

perl file_to_rename

Writes an output file named file_to_rename.chromnames

Estimate genetic distance between markers in the X. birchmanni 10x genome and optionally thin by physical and genetic distance:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R markers_to_thin cM_threshold physical_threshold

For example:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R ACUA_markers_to_thin 0.25 2000

the input markers for thinning should be in this format:




Need to be annotated