Difference between revisions of "User talk:Darek Kedra/sandbox 28"

From OpenWetWare
Jump to: navigation, search
(GATK pipeline)
Line 1,101: Line 1,101:
#caveats (ribosomal RNA contamination)
#mapping RNASeq
web site: http://tophat.cbcb.umd.edu/
#creating gene models from RNASeq (cufflinks)
Version: 2.0.10 release 11/13/2013
It is the most popular program for mapping RNASeq reads to genomes. Requires that you have installed and available on your PATH following programs:
*bowtie2 and bowtie2-align
* bowtie2-inspect
#run tophat2 (with bowtie2) using also gene annotation:
tophat2  -G Mus_musculus.GRCm38.70.gtf --output-dir=myreads_12_vs_Mm.GRCm38.70.tophat_out MmENS70.bow2 myreads_1.fastq myreads_2.fastq
===creating gene models from RNASeq (cufflinks)===
web site: http://cufflinks.cbcb.umd.edu/
Version: 2.1.1
cufflinks -p 8 --multi-read-correct  --min-intron-length 21 --max-intron-length 200000 --output-dir myreads
--frag-bias-correct ref_genome.fa -g  ref_genome.gtf  myreads_12_vs_ref
Once we mapped our RNASeq reads to the genome we can try obtain gene models derived from these mappings.
The cufflinks overpredict gene models, but it is still better than other programs using mapping to genomic reference.
Keep in mind that cufflinks uses its own expression metrics, and that these numbers are not compatible with R based statistical packages. Moreover the cufflinks cuffdiff module for predicting differentially expressed genes has not a good overlap with leading R packages (DESeq and EdgeR).
=Additional info=
=Additional info=

Latest revision as of 05:14, 15 November 2013


Winterschool program day 1

Before we start

Who are the teachers

  • Darek Kedra
  • Emilio Palumbo

We both work at CRG (Centre for Genomic Regulation) in Barcelona, Spain. Roderic Guigo (DK & EP) and Cedric Notredame (DK) groups. Our work place looks like this: CRG image

Our main research topics start with RNASeq (DK, EP), scientific software development (EP), genome assembly and annotation (DK), SNP calling (DK). As for the farm animals we have been working with sequences mostly from chicken and cow, plus few species (duck, pig, sheep).

Organization etc.

  • lets turn out /silence the cell phones
  • our aim today is to teach you the bare necessities about command line and mapping Illumina reads to genomic DNA
  • we go through the page for about 50-60mins, then we switch to exercises for 20-30mins
  • no exposure to command line interface / Unix is assumed (we start from novice level)
  • this page will stay online after the course "forever", and we will improve it over time (old version will be accessible through History link on the top)
  • you are welcome to register as a wiki user and comment in the Talk page/fix any errors/ ask for enhancements
  • in the boxed parts of the page lines starting with # are comments, and what below are examples of commands. After typing equivalents of these in your terminal, press Enter

XXX Click here to download todays excercize files XXX

Introduction to Linux and the command line

Why Linux?

  • runs on everything from cell phones to supercomputers
  • long history of stable GNU/Unix tools
  • most of the bioinformatics software was written and intended to run on Linux

XXX Emilio: start vagrant or Linux from inside your virtual box machine, log in XXX

vagrant instructions go here

logging in, connecting to other servers with ssh / sftp

As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.

SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.

With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.

#to connect to a remote server, command line (we can not do it here, since we have closed ssh ports at this network
ssh username@remote.server.name

#to connect withe the same username as on the workstation you are already logged in:
ssh remote.server.name

#to connect with ability to run remotely programs requiting GUI
ssh -X remote.server.name

#to download single file from a host to a local directory
scp username@remote.host.name:/path/to/file/filename .

#to transfer a file from a local machine to a remote host:
scp local_filename username@remote.host.name:/directory/to/upload

#to establish sftp connection 
sftp username@remote.host.name:/path/to/directory_of_interest/

#after establishing the connection we can list the content of remote directory:

#download  files
get remote_filename

file and directories naming

Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:


Things to keep in mind:

  • never ever use space/tabs in your file/directory names
  • never use Unix special characters in file/directory names (!?"'%&^~*$|/\{}[]()<>:)

Linux directory structure/navigating

As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:





Individual users have separate directories in which they store their data, results etc.

Some shortcuts to remember (what is after # sign is a description):

#root directory:

#your home directory:

#one directory above (two dots):

#current directory (one dot)

There are few commands to move between directories, create new ones, and list what is located there:

#print working directory = show where you are

#list what is located in the current directory

#list what is located in the current directory with details
ls -l 

#go to this directory
cd /home/linus/projects/chicken 

# special shortcut to go back to the directory you have been before
cd - 

#go to the root directory
cd / 

#go to your home directory 
cd ~ 

#go todirectory one level above the current one
cd .. 

#create new directory
mkdir mynew_directory 

#remove some directory (it must be empty!)
rmdir myold_directory 

absolute vs relative directory naming

We can do all the operations on directories or files using two conventions:

# absolute directory path starting with root "/"

ls /home/linus/projects/chicken/genome.fasta

#relative directory path (lets assume we are in  /home/linus/projects/banana/)

ls ../chicken/genome.fasta

This is useful shortcut for saving typing and avoiding errors when i.e. working on different systems.


In order to restrict actions associated with a given file or directory, each entity has 3 flags (r = read, w = write, e = execute) for 3 groups of users (file owner, group to which file owner belongs, i.e. students, and "all" for all the remaining users on that computer). So we have 9 fields describing what each of these groups can do with the file. On the top of it (or rather in the front of the string) we have another flag to tell us about what kind of thing it is ("-" = just a regular file, d = directory, l = link, etc.)

By default, owner has read+write permissions, his group members can read but not write (modify or delete) his files, and the rest of the word should have no right to see content of his files. These permissions are visible when listing content of a directory with "ls -l":

-rwxr-xr-x 1 vagrant users    49 Nov  7 14:43 my_program.py
-rw-r----- 1 vagrant users  1812 Nov  7 14:39 myfile01.txt
---------- 1 vagrant users  9045 Nov  7 14:41 myfile02.txt
-r--r--r-- 1 vagrant users  2016 Nov  7 14:39 myfile03.txt
-rw------- 1 vagrant users 67863 Nov  7 14:40 myfile04.txt
-rw-rw-rw- 1 vagrant users  8125 Nov  7 14:40 myfile05.txt
-rw-rw---- 1 vagrant users  7233 Nov  7 14:40 myfile06.txt
lrwxrwxrwx 1 vagrant users    12 Nov  7 14:50 myfile07.txt -> myfile06.txt
drwxr-xr-x 2 vagrant users     0 Nov  7 14:50 test_dir1

To change permissions, we have to specify first the group we want to modify, action (add permissions or remove them) then the permission themselves, finaly the name of the file(s):

#add execute permission to for all
chmod a+x my_new_script.py  

# remove write permission for owner, group and the rest
chmod a-w 

This is often useful for sharing data / results with other users on the same machine or when writing scripts/executing some programs downloaded from the net.

copy, rename/move files, create symbolic links

Often we need to organize files in directories from the existing ones, which requires moving thing around, making copies, renaming directories and files. To save space or to avoid using long path names, we can create links to other existing files or folders.

CAVEAT: both cp and mv commands will perform the task without warning if the destination file exist. It is one of the ways of destroying your data. To avoid this error use "cp -i" or "mv -i" which will ask you for confirmation when you are overwriting existing file.

#copy move rename pattern: 
cp/mv source destination

#make a copy of  file1.txt and name it file2.txt
cp file1.txt file2.txt

#make a copy of  file1.txt with the same name but in a different directory
cp file1.txt /some/directory/of/your/choice/ 

#notice the space and the dot "." at the end of the command. The command is copy the file to the current directory
cp /home/linus/projecs/linus/file1.txt . 

Instead of doing copy (original file stays intact) one can do mv (move/rename) where only the new copy remains and the old one is destroyed.

#rename  file1.txt to file2.txt
mv  file1.txt file2.txt

#move  file1.txt to another directory
mv file1.txt /some/directory/of/your/choice/ 

#notice the space and the dot "." at the end of the command. The command is copy the file to the current directory
mv /home/linus/projecs/linus/file1.txt . 

Linking is a way of organizing your data. I.e. you have one big genome fasta file, and want to map your next generation sequencing reads using different mappers, which will require using this one file as an input to several programs. You can store your fasta file in one directory (i.e. /home/vagrant/projects/chicken/genome1.fa) and create links to this file in several subdirectories /home/vagrant/projects/chicken/bwa_mapping/, /home/vagrant/projects/chicken/bowtie2_mapping/, etc.). The command:

cd /home/vagrant/projects/chicken/bwa_mapping/
ln -s  /home/vagrant/projects/chicken/genome1.fa .

cd /home/vagrant/projects/chicken/bowtie2_mapping/
ln -s  /home/vagrant/projects/chicken/genome1.fa .

view files (more/less, head, tail), count (wc)

There are two basic types of files, text files and binary files. A lot of file formats in bioinformatics belong to simple text files, so the basic viewing/processing them is essential. Never ever do not try to open big text files (say few GB FASTQ file) in an editor, be it an editor on Linux or Microsoft Word...

Basic checking of the file content:

#displays content of the file screen by screen
more myfile04.txt 

#same as above but you can scroll back
less myfile04.txt 

#displays just the first 10 top lines of the file
head myfile04.txt 

#displays the first 100 lines 
head -100 myfile04.txt 

#displays just the last 10 lines of the file
tail myfile04.txt  

#displays  the last 100 lines of the file
tail -100 myfile04.txt  

wc myfile04.txt #newline, word, and byte counts for the file

# just count the lines of the file
wc -l myfile04.txt 

Using these commands is very useful for basic sanity check, so we can be sure that the FASTA file looks OK, that we have X lines in the file etc. First practical NGS application:

Q: how many reads do we have in our FASTQ file A: count the lines in FASTQ file, divide by 4 (each sequence occupies 4 lines)

compressing / uncompressing files (gzip, tar)

The sequence files are often really big, and we save space and time during downloads by compressing them. There are several compression programs, but the most frequently used is gzip:

#we will get some_file.gz, and the original uncompressed one will be removed
gzip some_file

# reversing the process, we get some_file in an uncompressed form
gunzip some_file.gz 

Program sources (almost always) or data are distributed not as single file but bundled in one file with "tar" program, then compressed for easier transfers/storage.

#we downloaded program_123.tar
#this will unpack the content of the tar archive
tar xfv program_123.tar 

#we downloaded program_123.tar.gz
#this will unpack the content of the compressed by gzip tar archive
tar xfvz program_123.tar.gz 

#we want to create tar archive containing all data_001.txt to data_100.txt (assuming there are only 100such files in this directory)
#tar cfv result.tar files_to_be_archived

tar cfvz my data.tar data_*.txt
#to create gzipped archive
tar cfvz my data.tar.gz data_*.txt

checking the integrity of the files

While with fast internet access we get what we wanted most of the time (meaning: downloading data files), from time to time we can get a corrupt file for various reasons. Also when performing data analyses the common source of endless checks and rechecks is a data file with the same name but slightly different content. A great way to assure that the transferred data were not corrupted, or that we and our collaborators are talking about the same input data or results, is to perform file "fingerprinting", or calculate md5sum hashes.

md5sum file_2_fingerprint

md5 bioinfo.txt
MD5 (bioinfo.txt) = 06d5cd9ffb9154d7912350250e99951b

In short, this string is strongly influenced even by minuscule changes in the file (both text and binary, including software programs), and for practical reasons if 2 hashes are identical then the content of the files is also identical.

For more info about md5sum: http://en.wikipedia.org/wiki/Md5sum

pattern matching, redirection and piping

Instead of using full name of a file each time, we can substitute part of the file name with special characters. Say we want to list the names all the files with .txt suffix ( myfile01.txt, myfile02.txt, etc. ) in a directory. And lets say we have about 500 of these:

ls *.txt

On Linux we can capture the output of one simple command (assuming it produces a text output) and do something useful with it, instead of just reading it from the screen. If we want to count how many such files we got, we can pipe "|" (vertical line, not the letter "I") the output of the ls command to wc command:

#display the number of .txt files in this directory
ls *.txt | wc -l 

We can also redirect the output to a file, creating i.e. list of txt files with names starting with some letter (redirection with single ">":

#we create a new file listing all the names of m*.txt files, one name per line
ls m*.txt > list_of_txt_files_starting_with_m.fof   

Or we can keep on adding the text to some file, which will be created if it does not exist (redirection with double ">>"):

#file is created and we get just a*.txt file names in it
ls a*.txt >> list_of_txt_files_starting_with_a_or_m.fof  

#file exists, so we are just appending   m*.txt file names to it
ls m*.txt >> list_of_txt_files_starting_with_a_or_m.fof  

#checking what is inside
more list_of_txt_files_starting_with_a_or_m.fof

search for strings / replace strings (grep & sed)

We can search the text files using command called "grep". Few examples:

#general pattern:
grep some_word text_file_name

#print all lines containing the word ITALY
grep ITALY my_file.txt 

#"reverse grep" prints all lines not containing ITALY
grep -v ITALY my_file.txt 

#just counts the number of lines containing ITALY
grep -c ITALY my_file.txt 

#prints all FASTA headers (lines starting with ">" sign)
grep "^>" multiple_sequences.fasta

# prints number of sequences in the multiple fasta file
grep -c "^>" multiple_sequences.fasta 

Instead of opening an editor and doing search and replace, we can do it on a command line, using utility called sed, and operate both on files or use pipes.

# replaces each occurrence of ITALY with SPAIN
sed 's/ITALY/SPAIN/g' my_file.txt 

# we get all FASTA headers and then remove the ">" sign, getting just the proper FASTA IDs
grep "^>" multiple_sequences.fasta  | sed 's/>//g' 

awk in few minutes

Awk is a simple programing language used for basic text processing. It is still being used because often it is faster to write and execute a command in awk than write small small script in more advanced languages, such as perl, or python. The basic concept is that awk splits the lines of the text into individual words/numbers (== great for text files in a column form), which we can then access using $column_number notation

# prints just first column of the file
awk '{print $1}' my_data.txt 

# prints first column but also "_my_new_label" string just after it
awk '{print $1"_my_new_label"}' 

We can perform simple calculations, (assuming we have columns of numbers).

#this will simply sum all the numbers from the first column then print that number
awk '{ sum+=$1} END {print sum}' my_data.txt 

Where to go from there

more information

All the commands used above have multiple options impossible to cover during the short introduction. Once you know the name of the program, you can learn about it functions and usage reading installed manual:

man ls
man grep

The recommended reference book is:

Unix Power Tools, 3rd Edition

By Jerry Peek, Shelley Powers, Tim O'Reilly, Mike Loukides

Publisher: O'Reilly Media

Released: October 2002

Pages: 1156

shell scripts

Sometimes we need to do the same thing on a set of files, like submitting a hundred jobs to a computer cluster using the same program (lets say for mapping FASTQ files) but with different input data (FASTQ files) and differently named outputs (BAM files). For doing it we need to first create 100 individual "program files" (== shell scripts), here called "map_script" i.e. using python (next section). But instead of submitting the jobs one by one we can do instead on a command line:

for file in map_script*. #press enter
do #press enter
qsub $file #press enter
done #press enter


Despite that there is a lot of bioinformatics software written in Perl (ENSEMBL, GBrowse, BioPerl), Python is a good choice as a first (and maybe even only) programming language you need for day to day bioinformatics tasks. It is easier to learn and read than Perl, but still fast enough to deal with i.e. FASTQ files.

There are two versions of Python, 2.7 and new Python 3 series which differ in subtle ways (Python 3 is an improvement). Because of the legacy software, it is still safer for time being to use the older version. One thing which distinguishes Python from other languages is that spaces/tabs/line ends are syntactic elements crucial for the language. In short, instead if using () to mark the beginning and the end of some function, Python uses indentations (tab or spaces, by default 4) to mark code blocks. With almost any programming editor, keeping track of these is painless.

  • multiple file mover/renamer
#!/usr/bin/env python


This is just and example how to rename a bunch of files using external mv
Renames all mapping_result_001.bam to mapping_result_001.bwa_cow_ENS73.bam
import glob, os

#in the pattern below the ??? stand for any 3 characters which must occur at that position in file name  
input_file_pattern = "mapping_result_???.bam"

#glob.glob below is a way to get back a list of files matching the pattern. 

for input_file_name in glob.glob(input_file_pattern):
    #here we remove the last 3 characters from the input_file_name, then we add the "bwa_cow_ENS73.bam" to it
    output_file_name = input_file_name[:-3] + "bwa_cow_ENS73.bam"
    #here we construct the whole command to be executed. Observe extra space needed between names if the input and output files
    command = "mv " + input_file_name + " " + output_file_name
    #this is an idiom using os library to execute command

  • Fasta header fixer
#!/usr/bin/env python
Comment: this script simplifies names of sequences in FASTA file limiting them to just first name
Usage: fasta_header_fixer.py some_fasta_file_I want_to_transform > some_fasta_file_I want_to_transform.fixed_file


import sys
input_fasta_fn = sys.argv[1]

#input_fasta_fn ="some_fasta_file_I want_to_transform"

for line in open(input_fasta_fn).readlines():
   if line[0] == ">":
       new_fasta_header = line.split()[0]
       print new_fasta_header
       print line[:-1]   
  • XXX ??? some mapper runner ??? XXX

computer clusters

Individual workstations or even multi CPU servers are not good enough for running hundred of CPU-demanding jobs. For that we have computer clusters, where we use tens or hundreds of computer nodes. To distribute loads on such clusters we use special software maintained by system admins, i.e. Sun Grid Engine (SGE). It takes care that not all jobs are run at once the moment they are submitted by the user, and it distributes the jobs among the nodes balancing the loads. Because individual cluster installations at your institutions are likely different from each other (they may use different software than SGE), this section is mostly for convincing you that whenever you have some cluster you can use, you should investigate how you can use it and run your next generation sequencing jobs on them. There are few concepts you need to know about typical cluster configuration and how it needs to be taken into account:

  • clusters are often partitioned into smaller entities, so for example one person from one group can not block the use of the cluster for the whole institution. These partitions are being called queues.
  • cluster jobs may have differer in their requirements. I.e genome assembly jobs often require large memory servers, and may take a week or more to complete. On the other end we can have tasks which are trivial to split into say 50 separate jobs by the user, and which will finish in minutes or few hours. So cluster admins often create separate queues for such different tasks, and may reserve small group on nodes for each research group, so the most urgent/or the longest running tasks can be run that way.
  • It is a good policy to require that all individual jobs executing threads/processes on several CPUs in parallel should be labeled as such and wait their turn until node with required number of CPUs becomes free.
  • same thing is valid for memory usage
  • often jobs without information about number of CPUs/RAM requirements will have set some default values (i.e. 1 CPU) by the admins

Each job run on the cluster creates two files, capturing the output of the job which normally is printed on the screen. One is for standard messages ("mapping reads from file XYZ") other for error messages ("file XYZ not found, terminating"). These are crucial for debugging. The naming convention is as follows:

#two jobs we submitted to run

#These will receive job numbers once they start running, ie:
12345 bwa_map_01.sh
12346 bwa_map_02.sh

#the standard output files will be names:

#the standard error files will be names:

#if we re-run these scripts, the job numbers will be different, so we can get:


Caveat: these files may be created in your home directory, not where the scripts and data are, when you forgot to specify -cwd during submitting your jobs

#submit a single job to a cluster to the mygroup_cluster_queue queue (with default values). 
#Change to the current working directory (one from which the script was submitted
qsub -q mygroup_cluster_queue -cwd some_script.sh 

#monitor the status of your jobs on the cluster

#the typical output will look like this XXXX get cluster qstat output example XXXX

#delete job with a number 23456 submitted to a cluster 
qdel 23456

Tip: it makes sense that when submitting different groups of jobs to a cluster you use different script names. If something goes wrong it will be easier to find jobs without a chance of running to completion, nd remove them.

Lets say we have 50 jobs bwa_map_001 to bwa_map_050 and another 30 jobs bowtie_map_001 to bowtie_map_030. We want to remove all our still running or submitted and waiting bowtie_map jobs, but do not touch bwa_map jobs. Here is the command:

qstat | grep myuser_name | grep bowtie_map | awk '{print $1}' | xargs qdel

Explanation of the steps:

  • qstat: we list our jobs running or waiting
  • grep myuser_name: just keep the lines listing the jobs, not qmon extra lines
  • grep bowtie_map : we keep lines containing that string (our jobs to delete). Keep in mind that qmon does not have space to display print the whole name, so we have to check first that such string is present in qmon listing
  • awk '{print $1}': we get just the first qmon column (job number)
  • xargs : special command telling the next command that the parameter will be passed from the pipeline
  • qdel : delete cluster job with a given job number


Format and quality checks

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains 4 lines per sequence:

  1. sequence name (should be unique in the file)
  2. sequence string itself with ACTG and N
  3. extra line starting with "+" sign, which contained repeated sequence name in the past
  4. string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

Here it is how it looks:

@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters. For the inquisitive: http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

  • offset 33 == Sanger / Illumina 1.9
  • offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

head -400000 my_reads.fastq > 100K_head_my_reads.fastq 
fastqc 100K_head_my_reads.fastq
#we got here 100K_head_my_reads.fastq_fastqc/ directory

grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt

Encoding	Sanger / Illumina 1.9

Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.


if [[ $file ]]; then
    if [[ $file =~ .*\.gz ]];then
    command="$command $file | "

command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs
et 64\";exit 0}}}'"

eval $command

Types of data

  • read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-120bp.

  • single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

  • insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

Program for combining overlapping reads: FLASH: http://ccb.jhu.edu/software/FLASH/

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

  • stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

quality checking (FastQC)

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

  • good quality report:


  • bad quality FastQC report


trimming & filtering

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters. All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

  • Tagdust

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq

source of published FASTQ data: Short Read Archive vs ENA

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies. There are two main places to get such data sets:

  • European Nucleotide Archive


Which one to use? ENA is easier as you get gzipped fastq files not SRA archives requiring extra processing, sometimes painful (at one stage the funding for SRA programs was cut). But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

Genomic fasta and gtf/gff gene annotation

ENSEMBL together with UCSC are two "clearing houses" when it comes to genome assembly and annotation. For mot of the genomes at least the genome versions (== sequences) are identical, but from time to time we can encounter a switch, like with bovine genome (UMD 3.1 @ENSEMBL and Baylor Btau 4.6.1 @UCSC). While it can be that you will have several releases of ENSEMBL with unchanged genome and annotation, it is still good idea to always include in the protocol / directories & file names the version used for the mappings/analyzes.

Side note

Since we are EU project, we tend to use ENSEMBL for the vast majority of the analyzes. There are some tools / data which are easier to use at UCSC, i.e. liftOver utility, for "mapping" genomic intervals from one mammalian genome (usually human or mouse) to another one (like cow). This is useful when we got large set of human genomic features (say ChIPSeq peaks) and want to find corresponding regions in cow. Check: http://genome.ucsc.edu/cgi-bin/hgLiftOver


For getting the genome and gene annotations:

1. go to http://www.ensembl.org/info/data/ftp/index.html
2. put in white on blue bar search window (default text "Filter") i.e. "chicken". 
# you will get just one row with multiple entries, two of our interest at the moment:

* Gene sets (GTF)

3. Click on DNA (FASTA)
# in the directory:

we are interested in: 

Do not click on it as doing it 50x on slow WiFi will kill the connection.

4. go back in the browser to http://www.ensembl.org/info/data/ftp/index.html (with just chicken row)
5. click on GTF
# you get to ftp://ftp.ensembl.org/pub/release-73/gtf/gallus_gallus
file to download (again NOT NOW!) is: Gallus_gallus.Galgal4.73.gtf.gz

Keep in mind that there are several species which were not included for various reasons in the main ENSEMBL (currently 73) release, but are accessible in preENSEMBL:


This time not every species has downloadable genome, there are just few available here: ftp://ftp.ensembl.org/pub/pre/fasta/dna/

As for GTFs it is even more unordered set, with all annotations laying around: ftp://ftp.ensembl.org/pub/pre/gtf/

Still, in times of need this is probably better than no genome & no annotation.

grepping fasta headers

This was covered in the grep session, so just a quick reminder:

#count number of sequences in multiple fasta
grep -c "^>" multi_fasta.fa

#list sequence names in fasta, display screen by screen with "q" to escape:
grep "^>" | less 

fasta reformat from exonerate

Most likely well written program for short read mapping should cope with long fasta headers:

>GI12345|QW76543 bovine contig from Angus containing genes X, Y and Z

or sequences with one chromosome per line. Still, it is in general a good idea to simplify and reformat multiple fasta genomic draft sequences to avoid potential problems problems down the road. For fixing FASTA headers (assuming that the first part of the header is unique), use Python script from the above Python section.

For reformatting sequences so that all sequences will have 70 columns:

fastareformat input.fa > output.fa

GTF / GFF annotation formats

For storing information about multiple, often genomic features the most comonly used format are few versions of GFF, in particular GTF. The GTF (General Transfer Format) is identical to GFF version 2.

Section adopted from: http://www.ensembl.org/info/website/upload/gff.html

GTF Fields:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
Sample GFF output from Ensembl export:

===Example ===

X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.	
X	Ensembl	Variation	2413805	2413805	.	+	.

Mapping genomic reads

overview of mappers

There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will treat these programs as black boxes, not going into algorithms used inside.

We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.

checking the genome

If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.

checking reads

See FastQC above and Tagdust.

basic mapping steps

  • indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

  • mapping

This is often the longest step, with options specific for each mapper

  • postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.


BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

  • Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)
  • sequences from 70bp up to 1Mbp:

There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.

#creating genome index
bwa index ref.fa

#mapping single end reads using MEM algorithm
bwa mem -p ref.bwa_is ref.fa reads.fq > short_reads.bwa_mem.sam

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.sam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true


This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | last-map-probs.py -s150 > input_reads_vs_ ref_genome.last.maf

#convert from MAF to SAM format
maf-convert.py sam  input_reads_vs_ ref_genome.last.maf >  input_reads_vs_ ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ ref_genome.last.sorted.bam 


#creating index, note that fires goes fasta file, index name prefix as second
bowtie2-build ref_gen.fa ref_gen.bowtie2

#map reads in unpaired mode, output as SAM
 bowtie2 -x ref_gen.bowtie2 ggal_test_1.fq -S ggal_test_1_vs_ref_gen.bowtie2.sam 

#map paired reads as 
 bowtie2 -x ref_gen.bowtie2 -1 ggal_test_1.fq -2 ggal_test_2.fq -S ggal_test_12_vs_ref_gen.bowtie2.sam 

#convert SAM 2 BAM, sort the reads and create BAM index in one step
 java -jar ~/soft/picard-tools-1.101/SortSam.jar I=ggal_test_1_vs_ref_gen.bowtie2.sam O=ggal_test_1_vs_ref_gen.bowtie2.bam CREATE_INDEX=true SO=coordinate

Stampy with bwa

Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:

#creating two special index files 
stampy.py --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa
stampy.py -g ens73_toygenome -H ens73_toy   

#remapping reads already mapped with BWA (prefered option)
stampy.py -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam  > ggal_test_1_vs_ref_gen.stampy.sam

Winterschool program day 2

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

SRR197978.9007654       177     1       189     0       50M     12      19732327        0       CAGATTTCAGTAGTCTAAACAAAAACGTACTCACTACACGAGAACGACAG      5936A><<4=<=5=;=;?@<?BA@4A@B<AAB9BB;??B?=;<B@A@BCB      XT:A:R  NM:i:3  SM:i:0  AM:i:0  X0:i:58 XM:i:3  XO:i:0  XG:i:0  MD:Z:0A0G0T47
SRR197978.9474929       69      1       238     0       *       =       238     0       GAGAAAAGGCTGCTGTACCGCACATCCGCCTTGGCCGTACAGCAGAGAAC      B9B@BBB;@@;::@@>@<@5&+2;702?;?.3:6=9A5-124=4677:7+
SRR197978.9474929       137     1       238     0       50M     =       238     0       GTTAGCTGCCTCCACGGTGGCGTGGCCGCTGCTGATGGCCTTGAACCGGC      B;B9=?>AA;?==;?>;(2A;=/=<1357,91760.:4041=;(6535;%      XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:62 XM:i:0  XO:i:0  XG:i:0  MD:Z:50

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf There are multiple tools to process and extract information from SAM and its compressed form, BAM files, so it is better to learn how to use them than decipher it and access with often slow scripts.

Analyzing BAM files

sorting / indexing

During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Searching such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:

samtools sort unsorted_input.bam sorted_output #notice no bam suffix!!!
samtools index sorted_output.bam #creates sorted_output.bam.bai  
  1. viewing the mappings in IGV

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome anmotation or even genom-2-genome alignment (not a typical usage). The order of steps:

  1. start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files oppened at the same time. In short, more RAM assigned, faster scrolling.
  2. select exact the same version of te genome with contigs named also the same way as in your BAM files
  3. open BAM files (need to be sorted and indexed), plus any annotation you may need.

tools for processing BAM files




bamtools filter -in SRR197978_12_vs_GgENS73.bwa_aln.sorted.bam -out 1_48850000_49020000_reads.bam -region "1:48850000..49020000"

getting mapping stats

  1. extracting reads mapping to regions
  2. getting coverage info for selected regions

Detecting SNPs

  1. general procedure

The detection of SNPs depends on mapping, which in turn depends on the input read length, quality and coverage. Very short reads (i.e. 36bp) are hard to map to many regions.

GATK pipeline

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site: http://www.broadinstitute.org/gatk/index.php

This is the step by step procedure to follow their best practice.

It is essential that we do have group info included in our BAM files (we assume that these have been already sorted and indexed). If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

#mysample in the read group info should be replaced by some mnemonic describing the #experiment/sample. Shortened file name, or SRA file prefix, like SRR197978 are a good choices.

java -jar ~/soft/picard-tools-1.101/AddOrReplaceReadGroups.jar \
I=chicken_genomic_12_vs_refgen.bwa_mem.bam \
O=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam \
RGLB=mysampleLB RGPL=Illumina RGPU=mysamplePU RGSM=mysampleSM \

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some picard utilities

java -jar ~/soft/picard-tools-1.101/MarkDuplicates.jar I=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam O=chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam METRICS_FILE=metrics.file CREATE_INDEX=true

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and past them into scipt files, then change the script permission and execute them instead.

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T RealignerTargetCreator \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-L 20 \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list 

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T IndelRealigner \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-targetIntervals chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam

The recommended Best Practices step here is to run base recalibration, meaning that the base quality is being reestimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complecity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.


java -Xmx240G -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper \ 
-R ref_gen.fa \
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf

Working with VCF files

  1. VCF file format
  2. viewing VCFs in IGV
  3. filtering SNPs by quality
  4. set operations on VCF files (common SNPs, unique SNPs)



web site: http://tophat.cbcb.umd.edu/

Version: 2.0.10 release 11/13/2013

It is the most popular program for mapping RNASeq reads to genomes. Requires that you have installed and available on your PATH following programs:

  • bowtie2 and bowtie2-align
  • bowtie2-inspect
  • bowtie2-build
  • samtools
#run tophat2 (with bowtie2) using also gene annotation:
tophat2  -G Mus_musculus.GRCm38.70.gtf --output-dir=myreads_12_vs_Mm.GRCm38.70.tophat_out MmENS70.bow2 myreads_1.fastq myreads_2.fastq

creating gene models from RNASeq (cufflinks)

web site: http://cufflinks.cbcb.umd.edu/

Version: 2.1.1

cufflinks -p 8 --multi-read-correct  --min-intron-length 21 --max-intron-length 200000 --output-dir myreads
 --frag-bias-correct ref_genome.fa -g  ref_genome.gtf  myreads_12_vs_ref

Once we mapped our RNASeq reads to the genome we can try obtain gene models derived from these mappings. The cufflinks overpredict gene models, but it is still better than other programs using mapping to genomic reference. Keep in mind that cufflinks uses its own expression metrics, and that these numbers are not compatible with R based statistical packages. Moreover the cufflinks cuffdiff module for predicting differentially expressed genes has not a good overlap with leading R packages (DESeq and EdgeR).

Additional info

Software list


  1. linux Ubuntu 12.04.3 vs Debian 7.1 (think about 32 vs 64 bit versions)
  2. java http://www.java.com/en/download/linux_manual.jsp?locale=en

Specific tools 1

  1. FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  1. TagDust: http://genome.gsc.riken.jp/osc/english/software/src/tagdust.tgz
  2. fastareformat from fastareformat exonerate-2.2.0 [1]
  3. fixing fasta headers (gff fields) with python? small script
  4. GEM [2]
    1. CAVEAT: (problem with cores on different laptops...)


  1. BWA http://sourceforge.net/projects/bio-bwa/files/
  2. Stampy http://www.well.ox.ac.uk/~gerton/software/Stampy/stampy-1.0.22r1848.tgz
  3. last http://last.cbrc.jp/ (the 362 versiona has split and splice-mapping options)
  4. bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (bowtie2)
  5. samtools http://sourceforge.net/projects/samtools/files/
  6. picard http://sourceforge.net/projects/picard/files/
  7. IGV/ IGVtools http://www.broadinstitute.org/software/igv/download
  1. bamtools https://github.com/pezmaster31/bamtools
    1. requires cmake: http://www.cmake.org/files/v2.8/cmake-2.8.12.tar.gz (or apt get)
  2. bedtools http://code.google.com/p/bedtools/downloads/list
  3. GATK http://www.broadinstitute.org/gatk/auth?package=GATK (download yourself: license!)
  4. vcftools http://sourceforge.net/projects/vcftools/files/

Specific tools 2/RNA-Seq

  1. tophat http://tophat.cbcb.umd.edu/
  2. cufflinks http://cufflinks.cbcb.umd.edu/ (may require Boost libs!)
  3. GEMtools https://github.com/gemtools/gemtools

Vagrant fixes

For X11 forwarding the Vagrantfile has to contain

config.ssh.forward_x11 = true