Difference between revisions of "Wikiomics:WinterSchool day2"

From OpenWetWare
Jump to: navigation, search
(GATK pipeline)
(GATK pipeline)
Line 281: Line 281:
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf

Revision as of 01:08, 15 November 2013


Winterschool program day 2

Mapping genomic reads

overview of mappers

There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will treat these programs as black boxes, not going into algorithms used inside.

We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.

checking the genome

If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.

checking reads

See FastQC above and Tagdust.

basic mapping steps

  • indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

  • mapping

This is often the longest step, with options specific for each mapper

  • postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.


BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

  • Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)
  • sequences from 70bp up to 1Mbp:

There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.

#creating genome index
bwa index -p ref.bwa_is  ref.fa

#mapping single end reads using MEM algorithm
bwa mem ref.bwa_is  reads.fq > reads.bwa_mem.samp

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.sam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

For subsequent processing the mapping files with GATK (SNP calling) it is easier to introduce necessary information at the mapping stage, than run an extra step using picard. What is required by GATK is so called reads group info. We will cover it later, but at this stage is good to know that bwa can be run with extra parameters saving us one extra step.

#below is the example read group info needed to be passed to bwa on the command line: 

#here is the mapping step where in the place of string in <> we put group info from above.
#different samples should have different group info, like this:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' ref_gen.bwa_is chicken_genomic_short_1.fq chicken_genomic_short_2.fq > chicken_genomic_12_vs_refgen.bwa_mem.rg.sam 


This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | last-map-probs.py -s150 > input_reads_vs_ ref_genome.last.maf

#convert from MAF to SAM format
maf-convert.py sam  input_reads_vs_ ref_genome.last.maf >  input_reads_vs_ ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ ref_genome.last.sorted.bam 


#creating index, note that fires goes fasta file, index name prefix as second
bowtie2-build ref_gen.fa ref_gen.bowtie2

#map reads in unpaired mode, output as SAM
 bowtie2 -x ref_gen.bowtie2 ggal_test_1.fq -S ggal_test_1_vs_ref_gen.bowtie2.sam 

#map paired reads as 
 bowtie2 -x ref_gen.bowtie2 -1 ggal_test_1.fq -2 ggal_test_2.fq -S ggal_test_12_vs_ref_gen.bowtie2.sam 

#convert SAM 2 BAM, sort the reads and create BAM index in one step
 java -jar ~/soft/picard-tools-1.101/SortSam.jar I=ggal_test_1_vs_ref_gen.bowtie2.sam O=ggal_test_1_vs_ref_gen.bowtie2.bam CREATE_INDEX=true SO=coordinate

Stampy with bwa

Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:

#creating two special index files 
stampy.py --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa
stampy.py -g ens73_toygenome -H ens73_toy   

#remapping reads already mapped with BWA (prefered option)
stampy.py -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam  > ggal_test_1_vs_ref_gen.stampy.sam

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

SRR197978.9007654       177     1       189     0       50M     12      19732327        0       CAGATTTCAGTAGTCTAAACAAAAACGTACTCACTACACGAGAACGACAG      5936A><<4=<=5=;=;?@<?BA@4A@B<AAB9BB;??B?=;<B@A@BCB      XT:A:R  NM:i:3  SM:i:0  AM:i:0  X0:i:58 XM:i:3  XO:i:0  XG:i:0  MD:Z:0A0G0T47
SRR197978.9474929       69      1       238     0       *       =       238     0       GAGAAAAGGCTGCTGTACCGCACATCCGCCTTGGCCGTACAGCAGAGAAC      B9B@BBB;@@;::@@>@<@5&+2;702?;?.3:6=9A5-124=4677:7+
SRR197978.9474929       137     1       238     0       50M     =       238     0       GTTAGCTGCCTCCACGGTGGCGTGGCCGCTGCTGATGGCCTTGAACCGGC      B;B9=?>AA;?==;?>;(2A;=/=<1357,91760.:4041=;(6535;%      XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:62 XM:i:0  XO:i:0  XG:i:0  MD:Z:50

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf There are multiple tools to process and extract information from SAM and its compressed form, BAM files, so it is better to learn how to use them than decipher it and access with often slow scripts.

Analyzing BAM files

sorting / indexing

During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Searching such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:

samtools sort unsorted_input.bam sorted_output  #notice no bam suffix!!!
samtools index sorted_output.bam #creates sorted_output.bam.bai  

simple stats

We can easily calculate the number of reads mapped to each of the chromosomes/contigs using samtools:

samtools idxstats chicken_genomic_12_vs_refgen.bwa_mem.bam > chicken_genomic_12_vs_refgen.bwa_mem.bam.idxstats

The output looks like this:

1	171001	1827 	1
*	0	0	0

Where here we have just one contig named "1", 171001bp long, to which we go 1827 reads mapped. To calculate the total number of reads, we can use this awk line:

awk '{ sum+=$3} END {print sum}'

viewing the mappings in IGV

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genome-2-genome alignment (not a typical usage). The order of steps:

  • start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.
  • select exact the same version of the genome with contigs named also the same way as in your BAM files
  • open BAM files (need to be sorted and indexed), plus any annotation you may need.

To work with IGV:

#assuming that igv.sh is on the PATH like in vagrant:


We have a large number of genomes available through IGV pull down menus, but we may need to create our own genome for viewing (top menu): Genomes > Create .genome file

We need to have FASTA genome reference file and its index (ref_gen.fa, ref_gen.fa.fai). The later one (FASTA index) we create as follows:

samtools faidx ref_gen.fa 

In the IGV new genome creation menu we have to put unique names for our genome, then FASTA , and if we have it, also genome annotation file.

So now we will be ready to load some BAM files mapped to our ref_gen sequence.

There are multiple options to display the reads. The important thing to notice are mismatches in reads, the coverage track, and paired display.

GATK pipeline

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site: http://www.broadinstitute.org/gatk/index.php

This is the step by step procedure to follow their best practice.

It is essential that we do have group info included in our BAM files (we assume that these have been already sorted and indexed). If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

#mysample in the read group info should be replaced by some mnemonic describing the #experiment/sample. Shortened file name, or SRA file prefix, like SRR197978 are a good choices.

java -jar ~/soft/picard-tools-1.101/AddOrReplaceReadGroups.jar \
I=chicken_genomic_12_vs_refgen.bwa_mem.bam \
O=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam \
RGLB=mysampleLB RGPL=Illumina RGPU=mysamplePU RGSM=mysampleSM \

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some picard utilities

java -jar ~/soft/picard-tools-1.101/MarkDuplicates.jar I=chicken_genomic_12_vs_refgen.bwa_mem.rg.bam O=chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam METRICS_FILE=metrics.file CREATE_INDEX=true

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and past them into scipt files, then change the script permission and execute them instead.

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T RealignerTargetCreator \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list 

java -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T IndelRealigner \ 
-R ref_gen.fa \ 
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam \ 
-targetIntervals chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.bam.target_intervals.list \ 
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam

The recommended Best Practices step here is to run base recalibration, meaning that the base quality is being reestimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complecity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.


java -Xmx240G -jar ~/soft/GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper \ 
-R ref_gen.fa \
-I chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam
-o chicken_genomic_12_vs_refgen.bwa_mem.rg.dedup.realignd.bam.gatk.vcf