Schumer lab: Commonly used scripts

From OpenWetWare

Note: Please add any script that you write and have thoroughly tested to the lab git repository [1] and document it here

On sherlock our lab shared bin with these scripts can be found here: /home/groups/schumer/shared_bin/Lab_shared_scripts


Usage of commonly used scripts

Scripts that convert between file formats

Convert fasta to phylip format:

perl Fasta2Phylip.pl infile.fa > outfile.phy

Convert phylip to fasta:

perl Phylip2Fasta.pl infile.phy outfile.fa

Convert MSG genotypes files to plink file format:

perl convert_msg_genotypes_to_plink.pl genotypes_file_name

Writes two output file: genotypes_file_name.ped; genotypes_file_name.map

Convert a fasta file to a fastq file:

perl fasta_to_fastq.pl infile.fasta > outfile.fq

Convert the ancestry tsv output from AncestryHMM to a hard-calls genotypes files:

perl parsetsv_to_genotypes_Dec2017_v2.pl ancestry-probs-par1.tsv ancestry-probs-par2.tsv genotypes_output_file_name

Convert the MSG genotypes file to a format that is compatible with rQTL (still requires phenotypes added, see workflow documentation):

perl genotypes_to_rqtl_Feb2018_v3.pl genotypes_file_name

Writes an output file: genotypes_file_name.rqtl.csv

Convert a phylip file to an input file for treemix:

perl phy_to_treemix.pl input.phy population_keys_list > outfile

The population keys list contains the number of individuals per population, e.g.: 2\n1\n3\n

Covert a samtools vcf file to input counts for ASE pipeline:

perl samtools_vcf_to_ASE_counts.pl infile.vcf

Writes an output file: infile.vcf_ASE_counts

Convert a two-sequence clustal alignment to a fasta alignment:

perl convert_clustal_alignment_to_fasta_alignment.pl clustal_alignment

Scripts that merge files, filter files, or match file contents to lists

  • Note: there are many useful scripts for combining and filtering files on the FAS scripture [2]*

Combine read files for two different sequencing runs of the same individual. The two files to be combined *must* be in the same order:

perl combine_reads_two_lists.pl list_full_path_to_file1_set list_full_path_to_file2_set

Writes a new file for each individual, using the name in list 1 appended with _combined

Script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl filter_identical_columns_threshold.pl genotypes_file num_markers_differentiation path_to:transpose_nameout.pl

Writes an output file with genotypes_file.identicalfilter.txt

Script for grepping a list from a separate file with an option to use grep -w or regular grep

perl grep_list.pl select_list file_to_grep_from outfile_name grep_w_1_or_0

Similar script but automatically retains header line:

perl grep_list_keep_header.pl select_list file_to_grep_from outfile_name grep_w_1_or_0

Match a phenotypes file with a genotypes and hybrid index file:

perl match_phenotypes_names_with_genotypes_and_index_file.pl phenotypes_file genotypes_file hybrid_index

Write output files appended with modified names indicating matching

Takes a list of files, reads all of their entries, and writes out a combined file with duplicate lines removed:

perl combine_lines_remove_duplicates_list.pl list_of_files outfile_name

Filter a genotypes file based on an allowed proportion of missing markers. Depends on the script transpose_nameout.pl which is on Sherlock in /home/groups/schumer/shared_bin/Lab_shared_scripts

perl filter_missing_markers_genotypes_file_thresh_v2.pl infile_genotypes path_to:transpose_nameout.pl prop_missing_allowed

Writes out an outfile named infile_genotypes_filtered_at_thresh

Merge files based shared values in two columns. See FAS scriptome [3] for other merging options based on a single column:

perl merge_files_using_two_columns_sharing_values.pl file1 column1_file1 column2_file1 file2 column1_file2 column2_file2

Subset an MSG genotypes file based on a list of markers. Depends on the script transpose_nameout.pl (on Sherlock /home/groups/schumer/shared_bin)

perl select_passing_markers_multi_geno_files_v2.pl list_of_markers_to_select genotypes_file path_to:transpose_nameout.pl outfile_name

Scripts [or commands] that extract sequences from files

Extract fasta sequences using entries from a bed-formatted list:

perl extract_bed_seqs.pl list_of_seqs.bed fasta_to_extract outfile_tag_name

Writes output to a file named list_of_seqs.bed_tag.fa

Generate predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file:

perl extract_gtf_seqs_mergetranscript.pl gene_of_interest.gtf fasta outfile_tag

Writes output to a file named gene_of_interest.gtf_tag.fa

fastahack is also an incredibly useful program from extracting sequences from fasta files. It can be found on Sherlock in /home/groups/schumer/shared_bin

/home/groups/schumer/shared_bin/fastahack genome.fa -r chr:start_coord..stop_coord > outfile.fa

Scripts for managing fastq files

Scripts for parsing fastq files by i5 and i7 index (see details in: parsing_tn5_data.txt in Dropbox and on the wiki). This script must be in the same directory as barcode_splitter.py

./divideConquerParser.sh 4 "reads_R1_allanes_combined.fastq.gz reads_R2_allanes_combined.fastq.gz reads_I1_allanes_combined.fastq.gz reads_I2_allanes_combined.fastq.gz" 10 i5_library 4 1

Count the number of reads in a list of zipped fastq files:

perl count_reads_fastq_list.pl list_of_fastqgz_files

Writes an outfile named list_of_fastqgz_files_counts

Scripts that summarize data or run analyses

Write average ancestry per site for every site in an MSG genotypes file:

perl parse_genotypes_ancestry_bysite.pl genotypes_file path_to:transpose_nameout.pl

Writes average ancestry per site to an output file named after the input file and appended with _ancestry_by_site

perl parsetsv_ancestry_v2.pl ancestry-probs-par1.tsv ancestry-probs-par2.tsv

Writes hybrid index and heterozygosity per individual

Generate a file based on plink output for approximate dating of time of admixture using the decay of admixture LD (see [Schumer lab: Commonly used pipeines] for details on generating these files with plink):

Rscript plink_output_admixture_age_estimate.R plink_output_ld_decay_D.ld window_size_bp corresponding_length_genetic_distance

these results can then be summarized by running the following summary script:

Rscript estimateage_plink_admixture_decay.R plink_output_ld_decay_D.ld_admixture_ld_decay

Scripts that manipulate or summarize fasta files

Take a fasta file with IUPAC ambiguity codes and perform a base by base coin flip for use in programs that do not accomodate polymorphism (e.g. RAxML):

perl coinflip_fasta_poly.pl infile.fa > outfile_coinflip.fa

Extract a scaffold by name from a fasta file:

perl getScaffold_samtools.pl file.fa contig_name > outfile

Summarize the N50 and length distribution of an assembly and optionally print scaffolds greater than a particular size:

perl getScaffLengthDist.pl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Summarize the N50 and length distribution of an assembly and optionally print scaffolds smaller than a particular size:

perl getScaffLengthDist.pl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Count polymorphic basepairs based on IUPAC ambiguity codes in the fasta file. Note: not all fasta files include this coding of polymorphic bases.

perl count_IUPAC_poly_single_genome.pl fasta_file > outfile

Count polymorphic basepairs based on ambiguity codes in a fasta file using a given window size:

perl count_IUPAC_poly_single_genome.pl fasta_file window_size_inbp > outfile

Summarize base composition in a list of windows:

perl count_basecomp_window.pl list_of_windows_to_analyze.bed genome.fa path_to_fastahack > outfile

Remove a list of sequences from a fasta file:

perl remove_record_list_from_fasta.pl list_of_sequences_to_remove fasta_file

Writes an outfile with these sequences removed named fasta_file.filtered

Translate a fasta file in a give frame:

python seqs_processor_and_translator_bin_V118_AGCT.py inputfile.fa outputfile_prefix DNA 1 1 NOBIN 20

Translate a fasta sequence:

perl rev_com.pl

When prompted, enter the name of the fasta file to translate. Writes out a file appended with -reverse_complement

Miscellaneous