Schumer lab: Commonly used scripts

From OpenWetWare

Note: Please add any script that you write and have thoroughly tested to the lab git repository [1] and document it here

On sherlock our lab shared bin with these scripts can be found here: /home/groups/schumer/shared_bin/Lab_shared_scripts

Usage of commonly used scripts

Scripts that convert between file formats

Convert fasta to phylip format:

perl infile.fa > outfile.phy

Convert phylip to fasta:

perl infile.phy outfile.fa

Convert MSG genotypes files to plink file format:

perl genotypes_file_name

Writes two output file: genotypes_file_name.ped;

Convert a fasta file to a fastq file:

perl infile.fasta > outfile.fq

Convert the ancestry tsv output from AncestryHMM to a hard-calls genotypes files:

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv genotypes_output_file_name

Convert the MSG genotypes file to a format that is compatible with rQTL (still requires phenotypes added, see workflow documentation):

perl genotypes_file_name

Writes an output file: genotypes_file_name.rqtl.csv

Convert a phylip file to an input file for treemix:

perl input.phy population_keys_list > outfile

The population keys list contains the number of individuals per population, e.g.: 2\n1\n3\n

Covert a samtools vcf file to input counts for ASE pipeline:

perl infile.vcf

Writes an output file: infile.vcf_ASE_counts

Convert a two-sequence clustal alignment to a fasta alignment:

perl clustal_alignment

Scripts that merge files, filter files, or match file contents to lists

  • Note: there are many useful scripts for combining and filtering files on the FAS scripture [2]*

Combine read files for two different sequencing runs of the same individual. The two files to be combined *must* be in the same order:

perl list_full_path_to_file1_set list_full_path_to_file2_set

Writes a new file for each individual, using the name in list 1 appended with _combined

Script for filtering genotypes file of redundant columns. The number in the command line corresponds to the number of markers that can differ between adjacent columns for the column to be retained:

perl genotypes_file num_markers_differentiation

Writes an output file with genotypes_file.identicalfilter.txt

Script for grepping a list from a separate file with an option to use grep -w or regular grep

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Similar script but automatically retains header line:

perl select_list file_to_grep_from outfile_name grep_w_1_or_0

Match a phenotypes file with a genotypes and hybrid index file:

perl phenotypes_file genotypes_file hybrid_index

Write output files appended with modified names indicating matching

Takes a list of files, reads all of their entries, and writes out a combined file with duplicate lines removed:

perl list_of_files outfile_name

Filter a genotypes file based on an allowed proportion of missing markers. Depends on the script which is on Sherlock in /home/groups/schumer/shared_bin/Lab_shared_scripts

perl infile_genotypes prop_missing_allowed

Writes out an outfile named infile_genotypes_filtered_at_thresh

Merge files based shared values in two columns. See FAS scriptome [3] for other merging options based on a single column:

perl file1 column1_file1 column2_file1 file2 column1_file2 column2_file2

Subset an MSG genotypes file based on a list of markers. Depends on the script (on Sherlock /home/groups/schumer/shared_bin)

perl list_of_markers_to_select genotypes_file outfile_name

Scripts [or commands] that extract sequences from files

Extract fasta sequences using entries from a bed-formatted list:

perl list_of_seqs.bed fasta_to_extract outfile_tag_name

Writes output to a file named list_of_seqs.bed_tag.fa

Generate predicted transcript sequence using a gtf file containing the exon coordinates and strand information and the relevant fasta file:

perl gene_of_interest.gtf fasta outfile_tag

Writes output to a file named gene_of_interest.gtf_tag.fa

fastahack is also an incredibly useful program from extracting sequences from fasta files. It can be found on Sherlock in /home/groups/schumer/shared_bin

/home/groups/schumer/shared_bin/fastahack genome.fa -r chr:start_coord..stop_coord > outfile.fa

Scripts for managing fastq files

Scripts for parsing fastq files by i5 and i7 index (see details in: parsing_tn5_data.txt in Dropbox and on the wiki). This script must be in the same directory as

./ 4 "reads_R1_allanes_combined.fastq.gz reads_R2_allanes_combined.fastq.gz reads_I1_allanes_combined.fastq.gz reads_I2_allanes_combined.fastq.gz" 10 i5_library 4 1

Count the number of reads in a list of zipped fastq files:

perl list_of_fastqgz_files

Writes an outfile named list_of_fastqgz_files_counts

Scripts that summarize data or run analyses

Write average ancestry per site for every site in an MSG genotypes file:

perl genotypes_file

Writes average ancestry per site to an output file named after the input file and appended with _ancestry_by_site

perl ancestry-probs-par1.tsv ancestry-probs-par2.tsv

Writes hybrid index and heterozygosity per individual

Generate a file based on plink output for approximate dating of time of admixture using the decay of admixture LD (see [Schumer lab: Commonly used pipeines] for details on generating these files with plink):

Rscript plink_output_admixture_age_estimate.R plink_output_ld_decay_D.ld window_size_bp corresponding_length_genetic_distance

these results can then be summarized by running the following summary script:

Rscript estimateage_plink_admixture_decay.R plink_output_ld_decay_D.ld_admixture_ld_decay

Scripts that manipulate or summarize fasta files

Take a fasta file with IUPAC ambiguity codes and perform a base by base coin flip for use in programs that do not accomodate polymorphism (e.g. RAxML):

perl infile.fa > outfile_coinflip.fa

Extract a scaffold by name from a fasta file:

perl file.fa contig_name > outfile

Summarize the N50 and length distribution of an assembly and optionally print scaffolds greater than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Summarize the N50 and length distribution of an assembly and optionally print scaffolds smaller than a particular size:

perl fasta_file min_contig_length print_scaffolds_greater_than_min_contig_length_0_1

Count polymorphic basepairs based on IUPAC ambiguity codes in the fasta file. Note: not all fasta files include this coding of polymorphic bases.

perl fasta_file > outfile

Count polymorphic basepairs based on ambiguity codes in a fasta file using a given window size:

perl fasta_file window_size_inbp > outfile

Summarize base composition in a list of windows:

perl list_of_windows_to_analyze.bed genome.fa path_to_fastahack > outfile

Remove a list of sequences from a fasta file:

perl list_of_sequences_to_remove fasta_file

Writes an outfile with these sequences removed named fasta_file.filtered

Translate a fasta file in a give frame:

python inputfile.fa outputfile_prefix DNA 1 1 NOBIN 20

Translate a fasta sequence:


When prompted, enter the name of the fasta file to translate. Writes out a file appended with -reverse_complement