Schumer lab: Commonly used scripts

From OpenWetWare
Jump to navigationJump to search

Note: Please add any script that you write and have thoroughly tested to the lab git repository [1] and document it here

On sherlock our lab shared bin with these scripts can be found here: /home/groups/schumer/shared_bin/Lab_shared_scripts


Usage of commonly used scripts

Scripts that convert between file formats

Convert fasta to phylip format:

perl Fasta2Phylip.pl infile.fa > outfile.phy


Convert phylip to fasta:

perl Phylip2Fasta.pl infile.phy outfile.fa


Convert MSG genotypes files to plink file format:

perl convert_msg_genotypes_to_plink.pl genotypes_file_name

Writes two output file: genotypes_file_name.ped; genotypes_file_name.map


Convert a fasta file to a fastq file:

perl fasta_to_fastq.pl infile.fasta > outfile.fq


Convert the ancestry tsv output from AncestryHMM to a hard-calls genotypes files:

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1.tsv ancestry-probs-par2.tsv genotypes_output_file_name


Convert the MSG genotypes file to a format that is compatible with rQTL (still requires phenotypes added, see workflow documentation):

perl genotypes_to_rqtl_Feb2018_v3.pl genotypes_file_name

Writes an output file: genotypes_file_name.rqtl.csv


Convert a phylip file to an input file for treemix:

perl phy_to_treemix.pl input.phy population_keys_list > outfile

The population keys list contains the number of individuals per population, e.g.: 2\n1\n3\n


Covert a samtools vcf file to input counts for ASE pipeline:

perl samtools_vcf_to_ASE_counts.pl infile.vcf

Writes an output file: infile.vcf_ASE_counts


Convert a sequence in a clustal alignment to a fasta alignment:

perl convert_clustal_alignment_to_fasta_alignment.pl clustal_alignment.txt seq_name_to_extract

Scripts that merge files, filter files, or match file contents to lists

  • Note: there are many useful scripts for combining and filtering files on the FAS scriptome [2]*

Combine read files for two different sequencing runs of the same individual. The two files to be combined *must* be in the same order:

perl combine_reads_two_lists.pl list_full_path_to_file1_set list_full_path_to_file2_set

Writes a new file for each individual, using the name in list 1 appended with _combined


Script to identify markers in tsv files violating hardy-weinberg equilibrium and filter them:

perl determine_HWE_filter_markers.pl ancestry-par1 ancestry-par2 bonferonni_pval_thresh path_to:transpose_nameout.pl

Writes a file named infile_deviating_markers with deviating markers from HWE, and filtered ancestry tsv files: ancestry-par1_HWE.tsv ancestry-par2_HWE.tsv


Script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt. Note: columns that differ by NAs are treated as different from the previous column


Script for grepping a list from a separate file with an option to use grep -w or regular grep

perl grep_list.pl select_list file_to_grep_from outfile_name grep_w_1_or_0


Similar script but automatically retains header line:

perl grep_list_keep_header.pl select_list file_to_grep_from outfile_name grep_w_1_or_0


Alternately, exclude a list from a file:

perl grep_v_list.pl list_to_grep_v focal_file outfile_name grepw_0_1


Takes a list of files, reads all of their entries, and writes out a combined file with duplicate lines removed:

perl combine_lines_remove_duplicates_list.pl list_of_files outfile_name


Filter a genotypes file based on an allowed proportion of missing markers. Depends on the script transpose_nameout.pl which is on Sherlock in /home/groups/schumer/shared_bin/Lab_shared_scripts

perl filter_missing_markers_genotypes_file_thresh_v2.pl infile_genotypes path_to:transpose_nameout.pl prop_missing_allowed

Writes out an outfile named infile_genotypes_filtered_at_thresh


Merge files based shared values in two columns. See FAS scriptome [3] for other merging options based on a single column:

perl merge_files_using_two_columns_sharing_values.pl file1 column1_file1 column2_file1 file2 column1_file2 column2_file2


Subset an MSG genotypes file based on a list of markers. Depends on the script transpose_nameout.pl (on Sherlock /home/groups/schumer/shared_bin)

perl select_passing_markers_multi_geno_files_v2.pl list_of_markers_to_select genotypes_file path_to:transpose_nameout.pl outfile_name


Subset files based on shared values in the first two columns to exclude values in file1 that are also found in file2:

perl exclude_shared_values_lists_based_on_first_two_columns.pl file_to_select_from file_to_exclude_from

Wites an output file called file_to_select_from_excluded_overlap_file_to_exclude_from

Scripts [or commands] that extract sequences from files

Extract fasta sequences using entries from a bed-formatted list:

perl extract_bed_seqs.pl list_of_seqs.bed fasta_to_extract outfile_tag_name

Writes output to a file named list_of_seqs.bed_tag.fa


Generate predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. maculatus genome :

perl extract_gtf_seqs_mergetranscript_xmac_genome.pl gene_of_interest.gtf fasta outfile_tag

Writes output to a file named gene_of_interest.gtf_tag.fa


This script is the same as above but prints to standard out for the X. maculatus genome :

perl extract_gtf_seqs_mergetranscript_printstdout_xmac_genome.pl gene_of_interest.gtf fasta name_tag


This script extracts the predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file for the X. birchmanni and X. malinche 10x genomes:

perl extract_gtf_seqs_mergetranscript_10x_assembly.pl list_of_exons.gtf fasta_file outfile_tag


This script is the same as above but prints to standard out for the X. birchmanni and X. malinche 10x genomes:

perl extract_gtf_seqs_mergetranscript_printstdout_10x_assembly.pl list_of_exons.gtf fasta_file outfile_tag


fastahack is also an incredibly useful program from extracting sequences from fasta files. It can be found on Sherlock in /home/groups/schumer/shared_bin

/home/groups/schumer/shared_bin/fastahack genome.fa -r chr:start_coord..stop_coord > outfile.fa

Scripts for managing fastq files

Scripts for parsing fastq files by i5 and i7 index (see details in: parsing_tn5_data.txt in Dropbox and on the wiki). This script must point to a directory with barcode_splitter.py so if you work with it on a different server edit the path

divideConquerParser.sh 4 "reads_R1_allanes_combined.fastq.gz reads_R2_allanes_combined.fastq.gz reads_I1_allanes_combined.fastq.gz reads_I2_allanes_combined.fastq.gz" 10 i5_library 4 1


Count the number of reads in a list of zipped fastq files:

perl count_reads_fastq_list.pl list_of_fastqgz_files

Writes an outfile named list_of_fastqgz_files_counts

Scripts that manipulate or summarize fasta files

Take a fasta file with IUPAC ambiguity codes and perform a base by base coin flip for use in programs that do not accomodate polymorphism (e.g. RAxML):

perl coinflip_fasta_poly.pl infile.fa > outfile_coinflip.fa


Calculate divergence and polymorphism (dxy and pi) in two co-linear fasta sequences:

perl fa2polydiv_window_list_v3.pl fasta_file1 fasta_file2 list_of_scaffolds_to_analyze window_size_bp


Extract a scaffold by name from a fasta file:

perl getScaffold_samtools.pl file.fa contig_name > outfile


Summarize the N50 and length distribution of an assembly and optionally print scaffolds greater than a particular size:

perl getScaffLengthDist.pl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1


Summarize the N50 and length distribution of an assembly and optionally print scaffolds smaller than a particular size:

perl getScaffLengthDist_printsmallerthan.pl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1


Count polymorphic basepairs based on IUPAC ambiguity codes in the fasta file. Note: not all fasta files include this coding of polymorphic bases.

perl count_IUPAC_poly_single_genome.pl fasta_file > outfile


Count polymorphic basepairs based on ambiguity codes in a fasta file using a given window size:

perl count_IUPAC_poly_single_genome_window.pl fasta_file window_size_inbp > outfile


Summarize base composition in a list of windows:

perl count_basecomp_window.pl list_of_windows_to_analyze.bed genome.fa path_to_fastahack > outfile


Remove a list of sequences from a fasta file:

perl remove_record_list_from_fasta.pl list_of_sequences_to_remove fasta_file

Writes an outfile with these sequences removed named fasta_file.filtered


Translate a fasta file in a give frame:

python seqs_processor_and_translator_bin_V118_AGCT.py inputfile.fa outputfile_prefix DNA 1 1 NOBIN 20

Generate reverse complement of a fasta sequence:

perl rev_com_v2.pl fasta_name

Writes out a file appended with -revcom.fa


Extract variable sites from an aligned multi-fasta file:

module load biology py-biopython

python Variable_sites_extractor.py -v myfasta.fa -o myvariablefa.fa

For first time use of this script you may need to install progressbar2:

pip install progressbar2

Split fasta files into shorter segments as specified by a bed file:

perl split_fastas_list_by_bed_windows.pl fasta_file_list windows.bed

Split results will be written to individual files called focal_file_start_stop.fa

Scripts for QTL and admixture mapping

See also: QTL mapping and Admixture mapping

Match a phenotypes file with a genotypes and hybrid index file:

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file genotypes_file hybrid_index

Write output files appended with modified names indicating matching


Scripts for admixture mapping with binomial traits, see workflows here also: Admixture mapping

Rscript perform_glm_admixture_mapping_v2_binomialtrait.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag

Rscript perform_glm_admixture_mapping_v2_gaussian.R genotypes_file hybrid_index_file phenotypes_file focal_column_number name_tag


Convert the output of the glm scripts to easy input for plotting in R:

perl convert_birchmanni10x_mapping_output_manhattan_plot_input.pl mapping_results.txt

Scripts to average by site or in windows

Write average ancestry per site for every site in two ancestry tsv files (par1 and par2 files):

perl calculate_avg_ancestry_by_site_tsv_files_v5.pl infilepar1 infilepar2 num_ind_thresh posterior_prob_thresh path_to:transpose_nameout.pl > outfile


Write average ancestry per site for every site in an MSG genotypes file:

perl parse_genotypes_ancestry_bysite.pl genotypes_file path_to:transpose_nameout.pl

Writes average ancestry per site to an output file named after the input file and appended with _ancestry_by_site


Generate hybrid index file for a pair of ancestry-tsv files:

perl parsetsv_ancestry_v2.pl ancestry-probs-par1.tsv ancestry-probs-par2.tsv

Writes hybrid index and heterozygosity per individual


Average ancestry by site files in windows specified in a bed file. See workflow here: Ancestry tsvs to average ancestry

Rscript average_ancestry_bybedintervals.R ancestry_by_site_file bins.bed chr_name


Average recombination rates output by LDhelmet in windows specified in a bed file, appropriately weighing per base pair recombination rates:

Rscript average_LDhelmet_rates_res_file_window.R modified_LD_helmet_output chromosome_name window_size_bps chrom_length_file

Note the latest recombination maps can be found here: /home/groups/schumer/shared_bin/Xbirchmanni_LD_recombination_map_10xgenome_March2019

Scripts to summarize features in windows (i.e. coding, conserved, repetitive basepairs)

Determine the number of coding, conserved, or repetitive basepairs in a set of bed-formatted windows, using features in another file (can be bed or gtf formatted):

Rscript calculate_features_per_window.R bins.bed features.gtf_or_features.bed path_to_bedtools_bin/[0 for global install] outfile_name

for example:

Rscript calculate_features_per_window.R combined_0.05cM_results_TLMC_March2019.bed /home/groups/schumer/shared_bin/shared_resources/Xbirchmanni-10x_12Sep2018_yDAA6-annotation-output/xiphophorus_birchmanni_10x_12Sep2018_yDAA6.gtf 0 combined_TLMC_0.05cM_windows_addcoding

This will calculate the number of coding basepairs per window based on the windows provided in combined_0.05cM_results_TLMC_March2019.bed

Scripts to identify recombination transitions

Use a genotypes file produced from ancestry-par1 and ancestry-par2 tsv files to identify ancestry transitions in each individual and the interval over which they occur:

Rscript identify_intervals_10x_genomes.R genotypes_file_name path_to:transpose_nameout.pl

Writes out outfiles for each chromosome named: genotypes_file_name_chr_intervals. Each line is a ancestry transition event with the start and stop of the interval and the individual it was detected in:

1407648 1801361 2

3158148 3306561 2

12714653 12933987 2

18340742 18501540 2

Note that this is coded for the X. birchmanni 10x genome and would need to be modified to provide chromosome names for other genome versions

Scripts to summarize genetic distance or admixture LD decay

Generate bed intervals for windows of a certain genetic size in cMs for the X. birchmanni 10x recombination map:

Rscript convert_LD_rate_to_cM_rate.R chromname-bir10x window_size_cM

Generates an output file called: cM_windows_window_size_xbirchmanni10x_chr


Generate files based on plink output for approximate dating of time of admixture using the decay of admixture LD (see Commonly used workflows for details on generating this files and workflows):

Add distance between the two markers in cMs to a plink-generated LD file:

Rscript convert_plink_LD_genetic_distance_xbir10x.R plink_ld_file focal_chromosome_name

Outputs a file called: plink_ld_file_chrom_cMdistances

these results can then be summarized by running additional scripts. To average LD metrics in windows of a particular cM distances:

Rscript convert_cM_version_plink_admixture_decay.R plink_ld_file_cMdistances window_size_cM

Outputs a file called: plink_ld_file_chrom_cMdistances_admixture_ld_decay_cMdist

these results can then be used to estimate age of admixture and plot decay in D:

Rscript estimateage_plink_admixture_decay.R plink_output_ld_decay_D.ld_admixture_ld_decay

Scripts that generate or use insnp files

Generate a masking insnp file based on a vcf file. For example, use GATK's variant selecting tools to identify variants that have high or low depth for masking GATK Variant Selector

perl gatk_vcf_to_masked_insnp.pl gatk.g.vcf

writes an outfile called infile.mask.insnp


Generates an insnp file to update and mask a fasta file. Variants that pass the quality thresholds are marked for updating and variants that fail the quality thresholds as well as invariants of low coverage/quality are marked for masking:

python insnp_v9_gatk3.4_gvcf.py file.g.vcf file.g.vcf.insnp gq_cutoff dpcutoff mq_cutoff qd_cutoff fs_cutoff sor_cutoff mqrs_cutoff readpos_cutoff indel_window

All but the indel window parameter (which masks basepairs around indels) are GATK parameters so see more details at their website GATK hard-call parameters

Currently suggested parameters for swordtails:

20 10 40 10 10 4 -12.5 -8.0 5

Scripts for admix'em simulations

See this section of the wiki for all scripts related to admix'em simulations: Admixture simulations with admix'em

Miscellaneous

Script to link locally to a list of files elsewhere on the server:

perl generate_local_link_file_list.pl infile_list

The infile list should contain the full path to the files and have one file per line


Script to remove a list of files, use with caution!

perl rm_list.pl infile_list

The infile list should contain the full path to the files and have one file per line


Cancel a list of slurm jobs. The list can either be of slurm job ids or slurm file:

perl slurm_cancel_jobs_list.pl slurm_list


Submit a slurm script recursively for every chromosome in the birchmanni 10x genome:

perl submit_all_10x_chroms_shell.pl name_of_starting_chrom slurm.sh


Retain only every n windows from a bed file:

perl thin_to_every_n_windows.pl file.bed retain_every_n_windows[integer]

Writes an output file called file.bed_thinned_by_n.bed


Script to rename X. birchmanni 10x scaffold names as their corresponding chromosome names:

perl replace_bir10x_names_with_chrom_names.pl file_to_rename

Writes an output file named file_to_rename.chromnames


Estimate genetic distance between markers in the X. birchmanni 10x genome and optionally thin by physical and genetic distance:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R markers_to_thin cM_threshold physical_threshold

For example:

Rscript calculate_genetic_distance_and_thin_adjacent_markers.R ACUA_markers_to_thin 0.25 2000

the input markers for thinning should be in this format:

Scaffold_name\tpos1

Scaffold_name\tpos2

Scaffold_name\tposn

Need to be annotated

gvcf_to_pseudo_fasta.pl

random_flip_samtools_low_coverage_vcf_for_PCA.pl