Wikiomics:WinterSchool day2

From OpenWetWare
Jump to navigationJump to search

Winterschool program day 2

Creating a tiny genome for exercises

Because working over the network and inside vagrant instances on laptops with the whole mammalian or even avian genomes would be too slow, I created a tiny "genome" using chicken ENSEMBL 73 genomic sequence and annotations (gtf and vcf files downloadable from ENSEMBL).

The sequence was extracted using [pyfasta](not working inside our network at WinterSchool). The region of interest was: 1:48850000-49020000

You can also do it easily for a single region using ENSEMBL.

1. do to
2. from species pulldown menu select chicken
3. search for 1:48850000-49020000
#you should get:
4. from th Manu on the left: Export data
#new small window pops up
5. click Next button
#you get: Export Configuration - Output Format window
6. select compressed text
7. download ensembl.txt.gz
8. gunzip  ensembl.txt.gz
9. mv  ensembl.txt to ref_gen.fa


#bed file with just one line used for bedtools intersect
1	48850000	     49020000

Getting the FASTQ files containing reads mapping to our genomic fragment of choice

#extract portion of the BAM mapping to a specific genomic interval 
bamtools filter -in SRR197978_12_vs_GgENS73.bwa_aln.sorted.bam -out 1_48850000_49020000_reads.bam -region "1:48850000..49020000"

get the BamToFastq command here XXX

Getting the annotation:

#to construct annotation for our exercise data set we used (note subtraction for positions
bedtools intersect -a Gallus_gallus.Galgal4.72.gtf -b get_region_annot.bed | awk '{print $1"\t"$2"\t"$3"\t"$4-48850000"\t"$5-48850000"\t"$6"\t"$7"\t"$8"\t"$9" "$10" "$11" "$12" "$13" "$14" "$15" "$16}' > exercise_annotation.gtf

#to obtain SNPs present in our exercise data set
bedtools intersect -a Gallus_gallus.vcf -b get_region_annot.bed | awk '{print $1"\t"$2-48850000"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}' > exercise_annotation.vcf

Mapping genomic reads

overview of mappers

There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will treat these programs as black boxes, not going into algorithms used inside.

We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.

checking the genome

If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.

checking reads

See FastQC above and Tagdust.

basic mapping steps

  • indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

  • mapping

This is often the longest step, with options specific for each mapper

  • postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.


BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

  • Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)
  • sequences from 70bp up to 1Mbp:

There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.

#creating genome index
bwa index -p ref.bwa_is  ref.fa

#mapping single end reads using MEM algorithm
bwa mem ref.bwa_is  reads.fq > reads.bwa_mem.sam

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.bam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

For subsequent processing the mapping files with GATK (SNP calling) it is easier to introduce necessary information at the mapping stage, than run an extra step using picard. What is required by GATK is so called reads group info. We will cover it later, but at this stage is good to know that bwa can be run with extra parameters saving us one extra step.

#below is the example read group info needed to be passed to bwa on the command line: 

#here is the mapping step where in the place of string in <> we put group info from above.
#different samples should have different group info, like this:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' ref_gen.bwa_is chicken_genomic_short_1.fq chicken_genomic_short_2.fq > chicken_genomic_12_vs_refgen.bwa_mem.rg.sam 


web site:

current version: 475 (Sep 2014)

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | -s150 > input_reads_vs_ref_genome.last.maf

#convert from MAF to SAM format sam  input_reads_vs_ref_genome.last.maf >  input_reads_vs_ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ref_genome.last.sorted.bam 


#creating index, note that fires goes fasta file, index name prefix as second
bowtie2-build ref_gen.fa ref_gen.bowtie2

#map reads in unpaired mode, output as SAM
 bowtie2 -x ref_gen.bowtie2 ggal_test_1.fq -S ggal_test_1_vs_ref_gen.bowtie2.sam 

#map paired reads as 
 bowtie2 -x ref_gen.bowtie2 -1 ggal_test_1.fq -2 ggal_test_2.fq -S ggal_test_12_vs_ref_gen.bowtie2.sam 

#convert SAM 2 BAM, sort the reads and create BAM index in one step
 java -jar ~/soft/picard-tools-1.101/SortSam.jar I=ggal_test_1_vs_ref_gen.bowtie2.sam O=ggal_test_1_vs_ref_gen.bowtie2.bam CREATE_INDEX=true SO=coordinate

Stampy with bwa

Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:

#creating two special index files --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa -g ens73_toygenome -H ens73_toy   

#remapping reads already mapped with BWA (prefered option) -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam  > ggal_test_1_vs_ref_gen.stampy.sam


This is a fast and accurate mapper developed at CRG.

#creating index
gem-indexer -i ref_gen.fa -o ref_gen.gem

#mapping single end reads. ggal_test_12_vs_ref_gen.gem will be creared
gem-mapper  -I ref_gen.gem.gem -i ggal_test_1.fq -q offset-33 -o ggal_test_1_vs_ref_gen.gem 

#mapping in a paired mode, will be created
gem-mapper  -I ref_gen.gem.gem -1 ggal_test_1.fq -2 ggal_test_2.fq -q offset-33 -o ggal_test_12_vs_ref_gen.gem

#converting from GEM map format 2 SAM:
gem-2-sam -I ref_gen.gem.gem -i -o ggal_test_1_vs_ref_gen.gem.sam -q offset-33 --expect-single-end-reads

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

SRR197978.9007654       177     1       189     0       50M     12      19732327        0       CAGATTTCAGTAGTCTAAACAAAAACGTACTCACTACACGAGAACGACAG      5936A><<4=<=5=;=;?@<?BA@4A@B<AAB9BB;??B?=;<B@A@BCB      XT:A:R  NM:i:3  SM:i:0  AM:i:0  X0:i:58 XM:i:3  XO:i:0  XG:i:0  MD:Z:0A0G0T47
SRR197978.9474929       69      1       238     0       *       =       238     0       GAGAAAAGGCTGCTGTACCGCACATCCGCCTTGGCCGTACAGCAGAGAAC      B9B@BBB;@@;::@@>@<@5&+2;702?;?.3:6=9A5-124=4677:7+
SRR197978.9474929       137     1       238     0       50M     =       238     0       GTTAGCTGCCTCCACGGTGGCGTGGCCGCTGCTGATGGCCTTGAACCGGC      B;B9=?>AA;?==;?>;(2A;=/=<1357,91760.:4041=;(6535;%      XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:62 XM:i:0  XO:i:0  XG:i:0  MD:Z:50

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: There are multiple tools to process and extract information from SAM and its compressed form, BAM files, so it is better to learn how to use them than decipher it and access with often slow scripts.

Analyzing BAM files

sorting / indexing

During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Searching such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:

samtools sort unsorted_input.bam sorted_output  #notice no bam suffix!!!
samtools index sorted_output.bam #creates sorted_output.bam.bai  

simple stats

We can easily calculate the number of reads mapped to each of the chromosomes/contigs using samtools:

samtools idxstats chicken_genomic_12_vs_refgen.bwa_mem.bam > chicken_genomic_12_vs_refgen.bwa_mem.bam.idxstats

The output looks like this:

1	171001	1827 	1
*	0	0	0

Where here we have just one contig named "1", 171001bp long, to which we go 1827 reads mapped. To calculate the total number of reads, we can use this awk line:

awk '{ sum+=$3} END {print sum}'

viewing the mappings in IGV

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genome-2-genome alignment (not a typical usage). The order of steps:

  • start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.
  • select exact the same version of the genome with contigs named also the same way as in your BAM files
  • open BAM files (need to be sorted and indexed), plus any annotation you may need.

To work with IGV:

#assuming that is on the PATH like in vagrant:

We have a large number of genomes available through IGV pull down menus, but we may need to create our own genome for viewing (top menu): Genomes > Create .genome file

We need to have FASTA genome reference file and its index (ref_gen.fa, ref_gen.fa.fai). The later one (FASTA index) we create as follows:

samtools faidx ref_gen.fa 

In the IGV new genome creation menu we have to put unique names for our genome, then FASTA , and if we have it, also genome annotation file.

So now we will be ready to load some BAM files mapped to our ref_gen sequence.

There are multiple options to display the reads. The important thing to notice are mismatches in reads, the coverage track, and paired display.

GATK pipeline

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site:

This is the step by step procedure to follow their best practice.

Caveat: GATK requires that you have more than one sequence in your reference genome. If not, it reports a strange error about wrong IUPAC (sequence character).

#Create index and dictionary for reference
samtools faidx ref_gen.fa
java -jar ~/soft/picard-tools-1.101/CreateSequenceDictionary.jar R=ref_gen.fa O=ref_gen.dict

It is essential that we do have group info included in our BAM files (we assume that these have been already sorted and indexed). If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

#mysample in the read group info should be replaced by some mnemonic describing the #experiment/sample. Shortened file name, or SRA file prefix, like SRR197978 are a good choices.

java -jar ~/soft/picard-tools-1.101/AddOrReplaceReadGroups.jar \
I=chicken_genomic_12_vs_refgen.bwa_mem.bam \
O=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam \
RGLB=mysampleLB RGPL=Illumina RGPU=mysamplePU RGSM=mysampleSM \

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some picard utilities

java -jar ~/soft/picard-tools-1.101/MarkDuplicates.jar I=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam O=chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam METRICS_FILE=metrics.file CREATE_INDEX=true

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and past them into scipt files, then change the script permission and execute them instead.

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T RealignerTargetCreator \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list 

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T IndelRealigner \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-targetIntervals chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam

The recommended Best Practices step here is to run base recalibration, meaning that the base quality is being reestimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complecity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.


java -Xmx240G -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ref_gen.fa -I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam -o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf


Analysis Workflows

An example of a standard workflow we use for RNAseq analyses can be seen [here].

Common analyses that can be performed on RNAseq data are:

  • expression quantitative traits loci (eQTL)
  • splicing quantitative traits loci (sQTL)
  • differantial gene/isoform expression

Input files

To perform RNAseq analysis we need:

  • reference genome sequence
  • reference gene annotation
  • the data

Important note: Please make sure the contig names for your reference genome and annotation correspond.


Specific variables to consider when mapping RNAseq:

  • intron size
  • overhang (number of bases from each side of the junction that should be covered by a certain read)
  • splice site consensus
  • donor/acceptor splice site consensus sequences
  • junction “filtering”:
    • chromosome/strand
    • block order
    • min/max distance


The [GEM mapper] is a mapping program for next generation sequencing developed in collaboration between CRG and CNAG in Barcelona. Many high-performance standalone programs (splice mapper, concersion tool, etc.) are provided along with the mapper; in general, new algorithms and tools can be easily implemented on the top of these.

[Gemtools] is a powerful set of high-level pipelines which greatly simplifies the use of the GEM mapper. Using gemtools one can index references and/or map several kinds of data from a simple command-line interface, without having to type complicated commands. In particular, gemtools contains a fast and accurate pipeline for mapping RNA-sequencing data.

The default gemtools RNAseq pipeline is shown [here].

Element quantification

  • exon
  • intron/all-intronic region
  • splice junction
  • transcript
  • gene

Transcript expression quantification

Transcript quantification is a complex problem. Quantifying the expression of a gene is simple. We just need to count the RNA-seq reads that fall within the exons of this gene. However, to quantify expression of a transcript we can have reads mapping to an exon of the gene where multiple transcripts overlap. The process of assigning a read to a certain transcript is called read deconvolution or isoform expression quantification.

For transcript expression we use [Flux Capacitor] developed at the CRG in Barcelona.

Running the pipeline

In this section the commands for two components of our RNAseq standard pipeline are explained.

The gemtools mapping pipeline consists of three main steps. First the reference genome needs to be indexed. The then genome index is used with the reference annotation to compute a transcriptome and index it. Last step is the actual run of the mapping pipeline.

# indexing the genome
gemtools index -i genome.fa

# computing the transcriptome and indexing
gemtools t-index -i genome.gem -a annotation.gtf -m 110

# running the mapping pipeline
gemtools rna-pipeline -f sample.fastq.gz -q 33 -i genome.gem -a annotation.gtf -m 110

To run the Flux Capacitor a bam file sorted by genomic position and indexed is needed. The output bam file coming from a gemtools pipeline a run is already sorted and indexed.

In case you need to sort and index another bam file see Analyzing BAM files.

You can then run the transcript quantifications in the following way:

# running Flux Capacitor
flux-capacitor -i sample.bam -a annotation.gtf -o sample.gtf



web site:

Version: 2.0.10 release 11/13/2013

It is the most popular program for mapping RNASeq reads to genomes. Requires that you have installed and available on your PATH following programs:

  • bowtie2 and bowtie2-align
  • bowtie2-inspect
  • bowtie2-build
  • samtools
#run tophat2 (with bowtie2) using also gene annotation:
tophat2  -G Mus_musculus.GRCm38.70.gtf --output-dir=myreads_12_vs_Mm.GRCm38.70.tophat_out MmENS70.bow2 myreads_1.fastq myreads_2.fastq


Web site:

Versions: 2013-10-28

This is another RNA mapper with long history of updates and (with GMAP) used by some gene prediction pipelines (PASA, Trinity). The GMAP program is still being used for ESTs mapping.

#build the database for both GMAP and GSNAP

gmap_build -d genomeDBid ref_gen.fa

#there are 3 Makefiles created which then are used to build and install DB

    make -f Makefile.genomeDBid coords
    make -f Makefile.genomeDBid gmapdb
    make -f Makefile.genomeDBid install

gsnap -D /path/to/gmap/libs/ -d reg_gen.DBid --format=sam --quality-protocol=illumina --force-xs-dir --nofails --batch=5   myreads.fq  > myreads_vs_ref_gen.gsnap.sam

creating gene models from RNASeq (cufflinks)

web site:

Version: 2.1.1

cufflinks -p 8 --multi-read-correct  --min-intron-length 21 --max-intron-length 200000 --output-dir myreads
 --frag-bias-correct ref_genome.fa -g  ref_genome.gtf  myreads_12_vs_ref

Once we mapped our RNASeq reads to the genome we can try obtain gene models derived from these mappings. The cufflinks overpredict gene models, but it is still better than other programs using mapping to genomic reference. Keep in mind that cufflinks uses its own expression metrics, and that these numbers are not compatible with R based statistical packages. Moreover the cufflinks cuffdiff module for predicting differentially expressed genes has not a good overlap with leading R packages (DESeq and EdgeR).



web site:

Bamtools is a set of tools for processing/extracting information from the BAM files. Some of the typical usage:

#extract portion of the BAM mapping to a specific genomic interval 
bamtools filter -in SRR197978_12_vs_GgENS73.bwa_aln.sorted.bam -out 1_48850000_49020000_reads.bam -region "1:48850000..49020000"

#counting all the reads mapping to certain region:
bamtools filter -in SRR867768_1_vs_GgENS73.last.split.sorted.bam  -region "1:48850000..49020000" | bamtools count 

#counting all the reads mapping to certain region, restricting this to just one (reverse)strand:
bamtools filter -in SRR867768_1_vs_GgENS73.last.split.sorted.bam  -region "1:48850000..49020000" -isReverseStrand true| bamtools count


It is another set of utilities for working on genomic intervals. It can output intersections between two different datasets, such as BED file and a GTF or VCF file, merge overlapping intervals or find elements closest to the intervals/positions in a one file.

Commands moved to another section. This part will be updated.