Wikiomics:WinterSchool day2

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(Analyzing BAM files)
(bwa)
Line 34: Line 34:
<pre>
<pre>
#creating genome index
#creating genome index
-
bwa index ref.fa
+
bwa index -p ref.bwa_is  ref.fa
#mapping single end reads using MEM algorithm
#mapping single end reads using MEM algorithm
-
bwa mem -p ref.bwa_is ref.fa reads.fq > short_reads.bwa_mem.sam
+
bwa mem ref.bwa_is reads.fq > reads.bwa_mem.samp
#mapping paired end reads using MEM algorithm
#mapping paired end reads using MEM algorithm

Revision as of 22:40, 14 November 2013

Contents

http://openwetware.org/wiki/Wikiomics:WinterSchool_day2

Winterschool program day 2

Mapping genomic reads

overview of mappers

There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will treat these programs as black boxes, not going into algorithms used inside.

We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.

checking the genome

If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.

checking reads

See FastQC above and Tagdust.

basic mapping steps

  • indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

  • mapping

This is often the longest step, with options specific for each mapper

  • postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.

bwa

BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

  • Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)
  • sequences from 70bp up to 1Mbp:

There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.

#creating genome index
bwa index -p ref.bwa_is  ref.fa

#mapping single end reads using MEM algorithm
bwa mem ref.bwa_is  reads.fq > reads.bwa_mem.samp

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.sam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

last

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | last-map-probs.py -s150 > input_reads_vs_ ref_genome.last.maf

#convert from MAF to SAM format
maf-convert.py sam  input_reads_vs_ ref_genome.last.maf >  input_reads_vs_ ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ ref_genome.last.sorted.bam 
 
 

bowtie2

#creating index, note that fires goes fasta file, index name prefix as second
bowtie2-build ref_gen.fa ref_gen.bowtie2

#map reads in unpaired mode, output as SAM
 bowtie2 -x ref_gen.bowtie2 ggal_test_1.fq -S ggal_test_1_vs_ref_gen.bowtie2.sam 

#map paired reads as 
 bowtie2 -x ref_gen.bowtie2 -1 ggal_test_1.fq -2 ggal_test_2.fq -S ggal_test_12_vs_ref_gen.bowtie2.sam 

#convert SAM 2 BAM, sort the reads and create BAM index in one step
 java -jar ~/soft/picard-tools-1.101/SortSam.jar I=ggal_test_1_vs_ref_gen.bowtie2.sam O=ggal_test_1_vs_ref_gen.bowtie2.bam CREATE_INDEX=true SO=coordinate

Stampy with bwa

Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:

#creating two special index files 
stampy.py --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa
stampy.py -g ens73_toygenome -H ens73_toy   

#remapping reads already mapped with BWA (prefered option)
stampy.py -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam  > ggal_test_1_vs_ref_gen.stampy.sam


SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

SRR197978.9007654       177     1       189     0       50M     12      19732327        0       CAGATTTCAGTAGTCTAAACAAAAACGTACTCACTACACGAGAACGACAG      5936A><<4=<=5=;=;?@<?BA@4A@B<AAB9BB;??B?=;<B@A@BCB      XT:A:R  NM:i:3  SM:i:0  AM:i:0  X0:i:58 XM:i:3  XO:i:0  XG:i:0  MD:Z:0A0G0T47
SRR197978.9474929       69      1       238     0       *       =       238     0       GAGAAAAGGCTGCTGTACCGCACATCCGCCTTGGCCGTACAGCAGAGAAC      B9B@BBB;@@;::@@>@<@5&+2;702?;?.3:6=9A5-124=4677:7+
SRR197978.9474929       137     1       238     0       50M     =       238     0       GTTAGCTGCCTCCACGGTGGCGTGGCCGCTGCTGATGGCCTTGAACCGGC      B;B9=?>AA;?==;?>;(2A;=/=<1357,91760.:4041=;(6535;%      XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:62 XM:i:0  XO:i:0  XG:i:0  MD:Z:50

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf There are multiple tools to process and extract information from SAM and its compressed form, BAM files, so it is better to learn how to use them than decipher it and access with often slow scripts.

Analyzing BAM files

sorting / indexing

During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Searching such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:

samtools sort unsorted_input.bam sorted_output  #notice no bam suffix!!!
samtools index sorted_output.bam #creates sorted_output.bam.bai  

viewing the mappings in IGV

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genom-2-genome alignment (not a typical usage). The order of steps:

  • start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.
  • select exact the same version of the genome with contigs named also the same way as in your BAM files
  • open BAM files (need to be sorted and indexed), plus any annotation you may need.
Personal tools