VonHoldt:High Throughput Sequencing Resources

From OpenWetWare

Della and Tigress

The DELLA processing and TIGRESS data storage servers of the High Performance Computing center of Princeton are our analytical powerhouses and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through the IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. The PU website has more information on basic usage and tutorials if you are interested.

You should familiarize yourself with some basic Unix commands by doing a few tutorials. Here is also a nice website with a large number of linux commands.


  • ssh netid@della.princeton.edu --- to secure login
    • If you are on wifi, you need to use VPN for secure access!! This makes it possible to ssh remotely from Small World Coffee!
  • slogin netid@della.princeton.edu --- to secure login
  • uname -a --- to learn about the server
  • passwd --- to change the default password you are given
  • logout (or control+D) --- to logout


  • Della is only to be used to execute code via a formal job submission program (qsub command)
    • You only have 1GB of space on your Della home directory
    • You have 500GB of SCRATCH space on your Della home directory
  • Tigress is for storage of all data! Write all output to this server, as well
    • ln -s --- soft link to your Tigress data files using the command

Shared data - the vonHoldt Group

  • All data is available to members of the vonHoldt Group on Tigress
  • To access the data
    • ssh into Della using your own netID and password
    • cd to /tigress/VONHOLDT
      • /tigress/VONHOLDT --- contains directories for scripts, programs, and data
        • Also keep your scripts in your own scripts folder in this directory
      • /tigress/VONHOLDT/data --- contains shared data
  • Make sure you don't move data around - we will use scripts to link to each file's absolute path and moving data around will break our code
  • You can create your own directory in /tigress/VONHOLDT if you need to test code, manipulate new files, etc
  • Please take all efforts to avoid overwriting any data! It will be backed-up locally as well but it will be a hassle to fix (for both you and me)! Thank you!

Submitting jobs using qsub
The to_run.sh script needs to be kept in your Della home directory.

Here is a good format for your own to_run.sh scripts:

#SBATCH -N 4 # nodes=4
#SBATCH --ntasks-per-node=20 # ppn=20
#SBATCH -t 14:00 # 14 minutes walltime
#SBATCH --mail-user=username@princeton.edu
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
module load openmpi
srun ./myprogram

Use sbatch to_run.sh to run jobs on Della. Make sure your active part of the to_run.sh (being the ./batchAwk.sh) points to the right directories on Tigress that contain either more scripts or the data. Note that the new nodes have 20 cores and 128G of memory. Again, any problems/questions, please email cses@princeton.edu.


  • slurm --- to submit a script (e.g. jobs_to_run.sh) on Della which can point to a perl/python/R/shell scripts on Tigress that does the actual work
  • sbatch --- to submit your script/job
  • Job length: Initially estimate 2x the amount of time you think your job will take to complete. You can refine this value over time.
    • Test queue
      • 1 hour limit
      • 2 job maximum per user and NOT to be used for production mode
    • Short queue
      • 24 hour limit
      • 40 job maximum
    • Medium queue
      • 72 hour limit
      • 16 jobs maximum per user
      • 432 total cores
  • qstat --- to check the job progress on Della
  • You can ssh into any node once you have the node ID from your qsub to check on the job status using traditional commands:
    • htop --- use to view real-time CPU usage
    • top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.

A schematic on how Della and Tigress are setup and basic usage.

Basic unix

Just some basics on unix!

  • If you don't know something, use manual
    • man ls --- to look up the functionality of the ls tool, use Google
  • mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
  • mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
  • sar --- displays the contents of selected cumulative activity counters in the operating system
  • ps -u yourusername --- lists your processes
  • kill PID --- kills (ends) the process with that process ID
  • ps -u username --- lists all the current jobs for a specified username

Installing programs yourself (locally on the lab computers)

  • Check if it's already installed
  • mkdir ~/bin --- to creak a directory in your home folder
  • cat .bash_profile --- put it in your path or check to see if it's already there
  • PATH=$PATH:$HOME/bin
  • export PATH
  • compile it with prefix ~/bin --- install programs to bin

Data transfer (network)

  • scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
  • scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
  • FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
  • df -h -- check disk usage
  • du -hs /path --- check disk space used by a directory
  • du -h -max-depth=1 /path --- check disk space used by a directory


  • ls --- lists your files
  • ls -l --- lists your files in long format
  • ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using little space but it turns out you have a million hidden files... voila, hidden files can be managed.
  • ls -t --- sorted by time modified instead of name
  • ls -h --- lists your files in "human" format
  • ls -hla --- gives you all from combing the three commands; it's beautiful.
  • more filename --- shows first part of a file; hit space bar to see more
  • head filename --- print to screen the top 10 lines or so of the specified file
  • tail filename --- print to screen the last 10 lines or so of the specified file
  • emacs filename --- an editor for editing a file
    • When you enter in emacs, you can edit a file
    • To save once you are done editing, press Control+X+S
    • To exit, press Control+X+C
  • cp filename1 filename2 --- copies a file in your current location
  • cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
  • mv filename1 destination/filename1 --- moves your file to a new location
    • mv *.bam ../ --- moves all of the bam files from the current directory up one level
  • rm filename --- permanently remove a file (Caution! This cannot be undone!)
  • diff filename1 filename2 --- compares files and shows where they differ
  • wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
  • wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
  • wc -w filename --- tells you how many words are in a file
  • wc -c filename --- tells you how many characters (bytes) are in a file
  • chmod options filename --- change the read, write, and execute permissions for a file (Google this!)

File compression [see also the gzip usage website]

  • gzip filename --- compresses file to make a file with a .gz extension
    • gzip -d *.gz --- uncompresses all gzip files
  • gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
  • gunzip filename ---uncompress a gzip file
  • tar -xzf filename.tar.gz --- decompressing a tar.gz file
  • gzcat filename --- lets you look at a gzipped file without having to gunzip it


  • pwd --- prints working directory (your current location)
  • cd /path/to/desired/location --- change directories by providing path
  • cd ../ --- go up one directory
  • mkdir directoryName --- make a new directory
  • rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
  • rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
  • rmdir filename --- remove specified file...Remember that you cannot undo this move!

File permissions
File permissions may be the nagging factor keeping you from writing to an outfile, or changing the path of an executable. If you need additional help, please see this tutorial on permissions and talk to Dr. vonHoldt on any issues that come up for directory/file access. We want to prevent any raw data from being overwritten but yet not thwart your analysis.

Finding things

  • whereis [filename, command] --- lists all occurances of filename or command
  • ff --- finds files anywhere on the system
  • ff -p --- finds a file by just typing in the beginning of the file name
  • grep string filename(s) --- looks for strings in the files (use man grep for more information)
  • ~/path --- tilde designated a shortcut for the path to your home directory
  • nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
  • screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)

Data editing

  • vim filename --- to edit the file


  • ctrl+r --- searching history
  • history --- display history
  • !#cmd_num --- display history
  • Arrow up is a short cut to scroll through recently used commands

And when you get good enough, try out the screen functionality!

Globus secure file transfer

  • Go to www.globus.org
  • Login, select Princeton and then continue
  • Duo will ask for authentication
  • Chose Princeton Tigress as one of the endpoints (it doesn't matter which one you chose)
  • Make a directory in your Tigress user account into which all data that is to be transferred should be copied (e.g. tigress/vonholdt/globus)
  • Then, whoever is going to receive the data, ask them for their DTN and file (that is readable) for you to transfer data to. This should give you access via globus website to select that destination.
  • Once all to/from settings are in, click the arrow for the direction of files to move.
  • You can log out and it will continue to transfer files!!!

High throughput (HT) platform and read types

Take a moment to check out this Cornell site describing the specs of a few platforms!

  • Illumina single-end vs. paired-end - this is typically the data type we will work with and collect for any future projects. The read lengths are being extended, but this has proven to be a nice platform for any HTS data collection.
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa

Directional and Nondirectional Sequencing

Take a moment to check out this site describing the difference between directional and nondirectional sequencing.

File formats and conversions

  • blc
  • qseq
  • fastq <--> fasta
  • sam <--> bam
  • bcf <--> vcf

Many conversion scripts exist either on the internet (googling fastq/qseq conversion shows many are readily available) or via SAMtools for post-mapping file management.

RSYNC, FTP, and remote login

To obtain data from remote servers using a variety of methods:

  • rsync
rsync -rav [source location]:path [destination path]
rsync {options} {source} {destination}
rsync --help
  • ftp
ftp -i islay.qb3.berkeley.edu
cd path/of/data
mget *

The "-i" at login will turn off the prompt where ftp asks you if you want to copy every file in the specified directory. The command "mget *" is to copy everything in your specified directory to your local directory. That local directory is set from wherever you login to ftp. I suggest locally changing to the directory to where you want the files copied. Then initiate the ftp session and mget.

Deplexing using barcoded sequence tags

    Use this python script from the HTSEQ bioinformatics support to deplex single-end reads: barcode_splitter.py

    You will have to change the extension from .txt to .py to run as follows:
    python barcode_splitter.py [options] fastq_read1 [fastq_read2] [fastq_read3]

    and also make a companion text file that contains barcode information in two columns of information:
    • column 1: sample ID
    • column 2: index sequence

    This script uses a perfect match criteria in order to deplex pooled sequence lanes. It also only works for single-end sequencing to date. Talk to Bridgett if you have a paired-end dual-index sequence tag system.

Quality control

  • Fastx tools
  • Using mapping as the quality control for reads
  • For PE data Fastqc is preferable to Fastx

Trimming and clipping

To begin processing your HTS data (i.e. if you have single-end Illumina sequence data and you need to begin processing the FASTQ files), the first thing you should do is to run cutadapt for trimming low quality and clipping remnant sequences.

  • Trim based on low quality scored per nucleotide position within a read
  • Clip sequence artefacts (e.g. adapters, primers)
    • cutadapt for SE reads cutadapt download and run from your personal programs or scripts folder (Note: when running cutadapt on della the cutadapt script in the bin directory must be edited to specify python26, rather than just python.)
    • trimgalore for PE reads trimgalore download and run from your personal programs or scripts folder (also runs fastqc which is installed on sirius)

For example, for TruSeq paired-end clipping/trimming:
$ /home/vonholdt/VONHOLDT/BIN/cutadapt-1.8.1/bin/cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT --minimum-length 20 -q 20 -o 2769_read1_trimmed.fastq -p 2769_read2_trimmed.fastq 2769_read1.fastq 2769_read2.fastq

For single-end TruSeq:
$ /home/vonholdt/VONHOLDT/BIN/cutadapt-1.8.1/bin/cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 20 -q 20 -o 250031_R1_trimmed.fastq 250031_R1.fastq

For RRBS single-end read processing:
$ /home/vonholdt/VONHOLDT/BIN/cutadapt-1.8.1/bin/cutadapt -a NNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --minimum-length 20 -q 20 -o 038M_trimmed.fastq 038M_read_1.fastq

Mapping to a reference

Next, after trimming, you can map these reads to a reference genome!

First, make sure you're genome is build and indexed for mapping of short-reads using stampy on Della.

To do this, run the blow scripts in this order to build both the genome .stidx and .sthash files:
./stampy.py --species=human --assembly=hg18_ncbi36 -G hg18 /data/genomes/hg18/*.fa.gz
./stampy.py -g hg18 -H hg18

Then you are ready to map your trimmed/clipped reads to the indexed genome.

Other mapping software:

  • BWA
  • Bowtie2 on Della
  • CLC Genomics Workbench

FASTQC and FASTX tools

A few of the basic QC tools are built into trim_galore for trimming and clipping. However, if you want to run a serious of other functions, please visit the website for more details!

  • FASTQC tools
    • Recall that this functionality is build into trim_galore though that provides trimming/clipping QC
  • FASTX toolkit

BED and SAM tools

    • samtools can also be used for making pileups, calling variants, and filtering variants (this has been implemented following mapping with stumpy)
    • Make sure to think about filtering thresholds: quality, read coverage, or frequency. Oftentimes 10x is a good minimum requirement in order to keep a variant found in ~20% of the reads. But of course, these settings can vary.

You may also be interested in identifying variants among a group of samples, this is an additional step beyond identifying heterozygous sites within a genome.

  • Make sure to also check and edit your BAM header if needed. Using samtools:
samtools view -H name.sorted.bam 
  • This will show you if your sorted BAM is actually sorted
@HD	VN:1.0	SO:unsorted
  • If the "SO" column is sorted using samtools, it should say "coordinate" instead of "unsorted". If it lists "unsorted", then use the following to edit your header and re-make it as a BAM file:
samtools view -H 250031_aln.sorted.bam | sed -e 's/SO:unsorted/SO:coordinate/' | samtools reheader - 250031_aln.sorted.bam > 250031_reheadered.bam

GATK variant calling

GATK and GATK Guide
Then filter variants based on their quality, read coverage, or frequency. Oftentimes 10x is a good minimum requirement in order to keep a variant found in ~20% of the reads. But of course, these settings can vary.

You may also be interested in identifying variants among a group of samples, this is an additional step beyond identifying heterozygous sites within a genome.

Variant Call File (VCF)

  • Here is some information on the VCF format, one that is generated by SAMtools and GATK. VCF is a text file format, containing meta-information lines, a header line, and then data lines each containing information about a position in the genome.

##source=CLC Genomics Workbench 6.5 build 65094986
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total number of filtered reads per sample used by variant caller">
##FORMAT=<ID=CLCAD,Number=.,Type=Integer,Description="Allelic depth, number of filtered reads supporting alleles in the order listed in the GT field">
chr1 549 . A C . . . GT:CLCAD:DP 1:11:11
chr1 552 . GA AT . . . GT:CLCAD:DP 1:14:14
chr1 578 . G C . . . GT:CLCAD:DP 1:13:13
chr1 589 . T A . . . GT:CLCAD:DP 1:16:16
chr1 596 . GG AT . . . GT:CLCAD:DP 1:17:17
chr1 636 . C CCAGGT . . . GT:CLCAD:DP 1:8:11
chr1 640 . C A . . . GT:CLCAD:DP 1:11:11
chr1 654 . GT G . . . GT:CLCAD:DP 1:13:13
chr1 657 . G A . . . GT:CLCAD:DP 1:13:13

  • QUAL: typically lists the quality score if you haven't already filtered out low quality variants (which was done at 10x in the above example)
  • GT: genotype, encoded as allele values separated by either of ”/” (unphased) or “|” (phased_. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1|0, or 1/2, etc. For haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like 0/0/1. If a call cannot be made for a sample at a given locus, ”.”should be specified for each missing allele in the GT field (for example "./." for a diploid genotype and "." for haploid genotype).
  • AD: allelic depth for the alleles listed in the order they appear in the GT column
  • DP: read depth at this position for this sample (Integer).

Annotation features of your data

We often want to know if a site is in an exon, intron, promoter, or if it is intergenic. There is a script on Della that can help you annotate your data!


There is a read_me file there describing it's usage.

R basics

Here is a file with some helpful R commands for inputting data, making basic plots, statistics, etc. courtesy of Los Lobos.

Also, refer to the following websites for help:

Additionally, with high throughput genome sequence data, we often need modules that are implemented in R's Bioconductor. Here is a great website and course material from a short course on using R and Bioconductor

Python basics

Here is a file with helpful commands in Python, BioPython, EggLib, etc., from Los Lobos.

Also, here are several links to help you get going:

HT sequence analysis using R (and Bioconductor)

DNA methylation analysis

Primarily, the vonHoldt lab works on reduced representation bisulfite sequencing (RRBS) of wild populations.

The data will be pre-processed using scripts to both deplex the sequences but also QC the sequences (clip low-qualiuty nucleotides and trim adapter sequences) prior to mapping and calling methylation.

  1. Obtain or know where the paired qseq files are located on /tigress/VONHOLDT
    1. s_1_1_[digits] --- contain target sequence
    2. s_1_2_[digits] --- contain tag sequence
  2. illumina2fastq.pl
    1. 96x2 fastq files produced
  3. Deplex using a barcode splitter script
    1. Then concatenate each barcode tag into a single file per individual that was pooled into the lane
  4. Trim-galore
    1. Clip by quality and trim adapter sequence
    2. Cutadapt readme instructions and install help
    3. FASTQC capabilities for overall sequence descriptive statistics
  5. Mapping using BS-Seeker2 calling bowtie2 (pysam tools must be installed locally).
    1. Produces a log file that contains the number of sequences mapped, tagged, etc.
    2. Produces .bam files
    3. Either take individual .bam files to next step or you can merge them to call methylation on pooled samples
    4. Example: samtools merge /data/males.bam /data/002Maligned.bam /data/006Maligned.bam /data/029Maligned.bam /data/113Maligned.bam
  6. Methylation calling using BS-Seeker2
    1. Produces a log file
    2. Produces outfiles: .ATCGmap and .CGmap (this is the one with the useful data)

BS Seeker2 workaround for processing/calling methylation:

  1. When BSseeker2 is used to map sequence data the resultant BAM file has optional tags used for designating various characteristics.
  2. One of these flags reads that show evidence of low bisulfate conversion; it is the 13th column, with the tag for low conversion being in the form “XS:i:1”.
  3. When calling methylation in BSseeker there is a “-X” flag optional argument that can be used to specify the removal of any such low-bisulfite-conversion reads.
    1. However, this –X flag breaks with some versions of pysam.
  4. The workaround requires converting to a SAM file using samtools, filtering out all rows with the low-conversion tag “XS:i:1”, and converting back to BAM.
    1. In the process of this workaround, however, the conversion back to BAM breaks on some other problematic tags in column 19 that include hidden characters or are incomplete (this also seems to be due to a pysam incompatibility).
  5. You can therefore implement another workaround that searched for the regular expression “.*:.*:.*” and replaced any tags not matching this expression with an error tag “YE:i:Error”.
    1. This enables conversion back to BAM and this flag is not required for later analysis (but we have retained the information of which reads contained the error).

Here are some suggested analytical methods with tutorials.

To use methylKit the data must have the following columns:

  • ID (e.g. Chromosome.position)
  • Chromosome
  • Position
  • Read direction (F/R)
  • Number of reads
  • %Methylated Reads (as a number 0-100)
  • %Unmethylated Reads (as a number 0-100)

Example: chr01.11979 chr01 11979 F 4 100.0 0.0

Then follow the tutorial for analysis steps.

Other notes:

  • Memory limited (when running R on personal machine).
  • To look at CpG islands, must provide defined islands (either from genomic data or from predefined grouping of the data).
  • Can find differentially methylated regions between two groups only (male/female, treatment/control).

  • QDMR
    • Data must already be grouped into regions of comparison (see sample data).
    • Identification of DMRs by standard deviation threshold.
    • Visualization techniques best for more than two samples or an organism with the genome in QDMR.
  • swdmr

And for any sort of GO enrichment analysis

RNA-seq analysis

Common objectives of transcriptome analysis:

For a reasonably thorough list of RNA-seq bioinformatic tools, please see this site!

SOLiD software tools

Passing Arguments to Scripts and Programs Using xargs

  • xargs passes commands from the bash shell command line to a shell script and to other scripts or programs called in the script.
    • Although the argument is always simply called as $1 in a script, xargs works iteratively, going through the script with the first argument then the second, and so on.
  • Create this simple script:
 #! /bin/bash
 #check that a base file name argument was supplied
 if [ $# -eq 0 ]  # if no arguments were entered the script will complain and then stop
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
     echo $1
  • Call it using:
 echo arg1 arg2 arg3 | xargs -n 1 script.sh 
  • The -n flag to xargs specifies how many arguments at a time to supply to the given command. -n 1 tells xargs to supply 1 argument to the command. The command will be invoked repeatedly until all input is exhausted.
    • This means you can also use xargs for a command that needs two or more arguments.
      • For instance you could use this to supply read group information to the picard AddReadGroups command.
  • Another option -P # will tell xargs to split job into # different cores. -P 4 uses 4 cores.
    • This only works if you have multiple jobs that can be run in PARALLEL, ie one command run multiple times, once with each xarg or set of xargs

  • You can pass arguments to a program like fastqc, tophat, samtools etc.
    • I split up my aligned reads by chromosome to speed up processing.
      • With xargs I can call them all at once and process them on more than one core, something that samtools can't do by itself.
      • The follwing command would pileup three samples and do it sequentially for however many chromosomes I call in xargs.
 #! /bin/bash
 #check that a base file name argument was supplied
 if [ $# -eq 0 ]  # if no arguments were entered
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
     samtools mpileup -uf referencefilename /path/sample1$1.bam /path/sample2$1.bam /path/sample3$1.bam | bcftools view -bvcg - > /path/$1var.raw.bcf

  • You can pass the arguments to a python script by using sys.argv to supply arguments to the python script and calling the python script as myscript.py arg1
  • Save this simple script:
 #! /bin/bash
 if [ $# -eq 0 ]  # if no arguments were entered
     echo "Please supply argument .... "
     echo "Useage: echo arg1 arg2 ... argn | xargs -n 1 scriptname.sh"
     test.py $1
  • Save the following as test.py. It will be called by the last shell script above.
    • This is a very simple example but number could just as easily designate a file to be opened by the python script.
 #! /usr/bin/env python
 import sys #
 number = sys.argv[1]
 print "This is argument number ", number