Difference between revisions of "Wikiomics:WinterSchool day2"
(New page: =http://openwetware.org/wiki/Wikiomics:WinterSchool_day2= =Winterschool program day 2= ==Mapping genomic reads== ===overview of mappers=== There are tens of different mappers for alignin...)
Revision as of 17:56, 14 November 2013
- 1 http://openwetware.org/wiki/Wikiomics:WinterSchool_day2
- 2 Winterschool program day 2
Winterschool program day 2
Mapping genomic reads
overview of mappers
There are tens of different mappers for aligning millions of short reads to genomes. There are huge differences in speed, memory usage, handling of errors in the read. We will take a look at just few, most commonly used by various pipelines. For the time being we will treat these programs as black boxes, not going into algorithms used inside.
We need reference genome (FASTA) from the same species to which we will map our reads, and short read data (Illumina FASTQ). Before mapping we need to do some basic checks.
checking the genome
If you get the genome for canonical species from ENSEMBL (cow, chicken, pig) you can trust that it contains proper number of chromosomes and mitochondrial sequences. This is not the case with draft genomes available from individual research groups. It is always a good idea to look at number and names of individual contigs (see grep section), and check that i.e sequences are present. In case of genomes lacking certain sequences we have bigger chances to map reads to wrong locations, which in turn will give us wrong coverage, SNPs or expression in case of RNA-Seq.
See FastQC above and Tagdust.
basic mapping steps
Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.
This is often the longest step, with options specific for each mapper
The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.
BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:
- Illumina reads up to 100bp: bwa-backtrack (the legacy bwa)
- sequences from 70bp up to 1Mbp:
There are two algorithms for these: BWA-SW (Smith Waterman) and BWA-MEM(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))
Please note that BWA-SW requires different algorithm for indexing the genome. The default indexing algorithm is called IS.
#creating genome index bwa index ref.fa #mapping single end reads using MEM algorithm bwa mem -p ref.bwa_is ref.fa reads.fq > short_reads.bwa_mem.sam #mapping paired end reads using MEM algorithm bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam #mapping single and reads bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam #mapping paired reads bwa aln ref.bwa_is short_read_1.fq > short_read_1.bwa_aln.sai bwa aln ref.bwa_is short_read_2.fq > short_read_2.bwa_aln.sai bwa sampe ref.bwa_is short_read_1.bwa_aln.sai short_read_2.bwa_aln.sai short_read_1.fq short_read_2.fq > short_read_12.bwa_aln.sam #mapping long reads using bwasw algorithm bwa index -p ref.bwa_sw -a bwtsw ref.fa bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam
The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.
To create usable BAM files we can process SAM files using Picard's SortSam
java -jar /path/to/SortSam.jar I=reads_vs_reference.bwa.unsorted.sam O=reads_vs_reference.bwa.sorted.sam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true
This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:
#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai samtools faidx ref_genome.fa #index ref_genome for last, with a preference for short, exact matches lastdb -m1111110 ref_genome.lastdb ref_genome.fa #map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info lastal -Q1 -e120 ref_genome.lastdb input_reads.fastq | last-map-probs.py -s150 > input_reads_vs_ ref_genome.last.maf #convert from MAF to SAM format maf-convert.py sam input_reads_vs_ ref_genome.last.maf > input_reads_vs_ ref_genome.last.sam #convert SAM to BAM inserting header samtools view -but ref_genome.fa.fai input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ ref_genome.last.unsorted.bam #sort BAM samtools sort input_reads_vs_ ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted #create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai) samtools index input_reads_vs_ ref_genome.last.sorted.bam
#creating index, note that fires goes fasta file, index name prefix as second bowtie2-build ref_gen.fa ref_gen.bowtie2 #map reads in unpaired mode, output as SAM bowtie2 -x ref_gen.bowtie2 ggal_test_1.fq -S ggal_test_1_vs_ref_gen.bowtie2.sam #map paired reads as bowtie2 -x ref_gen.bowtie2 -1 ggal_test_1.fq -2 ggal_test_2.fq -S ggal_test_12_vs_ref_gen.bowtie2.sam #convert SAM 2 BAM, sort the reads and create BAM index in one step java -jar ~/soft/picard-tools-1.101/SortSam.jar I=ggal_test_1_vs_ref_gen.bowtie2.sam O=ggal_test_1_vs_ref_gen.bowtie2.bam CREATE_INDEX=true SO=coordinate
Stampy with bwa
Stampy is a quite slow but at times more accurate mapper, allowing for improvement over simple BWA mappings. The basic usage is as follows:
#creating two special index files stampy.py --species=chicken --assembly=ens73_toy -G ens73_toygenome ref_gen.fa stampy.py -g ens73_toygenome -H ens73_toy #remapping reads already mapped with BWA (prefered option) stampy.py -g ens73_toygenome -h ens73_toy -t2 --bamkeepgoodreads -M ggal_test_1_vs_ref_gen.bwa_aln.bam > ggal_test_1_vs_ref_gen.stampy.sam
Analyzing BAM files
sorting / indexing
During the mapping reads are mapped on first in first out, so the output is ordered by order of reads not the position along the chromosomes. Searching such BAM files is very inefficient, and the files themselves are unnecessary large. So every BAM file to be used downstream needs to be first sorted, and then indexed. BAM index (.bai suffix) is just for external programs to access a given part of the genome even more effectively. For sorting and indexing we will use samtools:
samtools sort unsorted_input.bam sorted_output #notice no bam suffix!!! samtools index sorted_output.bam #creates sorted_output.bam.bai
viewing the mappings in IGV
IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genom-2-genome alignment (not a typical usage). The order of steps:
- start IGV (it needs to be started specifying the amount of RAM being used by the program). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.
- select exact the same version of the genome with contigs named also the same way as in your BAM files
- open BAM files (need to be sorted and indexed), plus any annotation you may need.