Link to day 1 page: http://openwetware.org/wiki/Wikiomics:EMBO_Tunis_1

Introduction

Since we have students at the vastly various levels when it comes to bioinformatics, we will rush through few points necessary for the so far unexposed to these tasks people to comprehend the steps. But to make sure that we all return home able to use the IGV, we will start with it, and then go through other steps.

Feel free to skip anything too obvious and continue to play with other parts of this tutorial.

If you feel stranded unix command line wise, try checking this tutorial:

http://openwetware.org/wiki/Wikiomics:WinterSchool_day1#Introduction_to_Linux_and_the_command_line

Todays goals

IGV

create custom new .genome in IGV
load various tracks at the same time, move around IGV
learn about increasing the amount of memory available for IGV through igv.sh
switch views, explore various visualizing options

Mapping

learn about BWA mapping algorithms
map using mem/create BAM file for IGV
learn about LAST mapper, perform mappings to Lmex

Genome comparisons

align few Leishmania genoms to Lmex
convert them to BAMs

SNP calling

explore GATK pipeline
call SNPs if time allows

Command line

This part is meant to be with an active participation of students (& teacher), therefore resembling a bit more real life situations where working directories do not contain all files needed at the given step.

You should either create symbolic links

ln -s /path/to/desired/file .
#or copy them
cp -i /path/to/desired/file .

If you feel stranded unix command line wise, try checking this tutorial: http://openwetware.org/wiki/Wikiomics:WinterSchool_day1#Introduction_to_Linux_and_the_command_line

Starting

there will be 0917_kd directory in which you will find a set of subdirectories:

000_igv
00_fasta
01_fastq
02_sambam
03_gtf
04_bed
05_vcf
06_wig
07_bwa
08_last
09_last_genomes

Lets go there:

#jump to home directory if you happen to be somewhere else
cd 

#go to this directory
cd 0917_kd

#list the content
ls
ls -R | less
press q if you need to quit

Viewing mapping results with IGV

starting with new genome

IGV is a java program primarily for viewing mappings of short reads to a genome. But it can also be used for viewing SNPs (VCF files), genome annotation or even genome-2-genome alignment (not a typical usage). The order of steps:

start IGV (it needs to be started specifying the amount of RAM being used by the program, see the content of igv.sh). This depends in the coverage, number of BAM files opened at the same time. In short, more RAM assigned, faster scrolling.
select exact the same version of the genome with contigs named also the same way as in your BAM files
open BAM files (need to be sorted and indexed), plus any annotation you may need.

To work with IGV:

#assuming that igv.sh is on the PATH:

igv.sh

We have a large number of genomes available through IGV pull down menus, but we may need to create our own genome for viewing (top menu): Genomes > Create .genome file

We need to have:

FASTA genome reference file (ref_gen.fa)
FASTA index (ref_gen.fa.fai).

The later one (FASTA index) we create as follows:

samtools faidx ref_gen.fa

Optional:

genome annotation GFF

In the IGV new genome creation menu we have to put unique names for our genome, then location of the genomic FASTA file, and if we have it, also genome annotation file.

For more details see: See manual: http://www.broadinstitute.org/igv/loaddata

loading data into IGV

So now we will be ready to load some BAM, WIG, VCF and IVF files computed using our ref_gen sequence. The prerequisite is that BAM and VCF files should be indexed prior to loading. Some other files, depending on their size, can be loaded into IGV without this:

WIG
IGV

#in top menu line:
File > Load from file > Select the correct file in File Browser.

Exploring various views in IGV

There are multiple options to display the reads. The important thing to notice are mismatches in reads, the coverage track, and paired display.

on the left panel right-click on  LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam
select Rename Track
put ERR307343_12.BWA

#practice it with two other tracks

#in paired ends mode with BAM LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam
cut and paste: LmxM.01:161,874-162,350 in locator window

observe just one pair of reads spanning the assembly gap
what is the insert size?

cut and paste: LmxM.01:108,424-116,060 in locator window
observe read pairs
what is going on here?

What to believe as a mutation:

#in paired ends mode with BAM LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam
go to: LmxM.01:106,694-106,752

what is happening here?

Mapping Illumina reads to the genome

basic mapping steps

indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

mapping

This is often the longest step, with options specific for each mapper

postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.

bwa

BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

Illumina reads up to 100bp: BWA-BACKTRACK (the legacy bwa)

commands: "bwa aln" followed by bwa samse/sampe

sequences from 70bp up to 1Mbp:
- Illumina data: "BWA-MEM"

(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

command: bwa mem

- very long reads/contigs: "BWA-SW" (Smith Waterman)

command: bwa bwasw

BWA has two different genome indexing algorithms:

IS (compatible with aln and mem), the default indexing (or: "bwa index -a is"
SW (for SW only): "bwa index bwtsw"

Please note that SW indexing (and therefore mapping using SW by BWA) does not work for short genomes

#creating genome index
bwa index -p ref.bwa_is  ref.fa

#mapping single end reads using MEM algorithm
bwa mem ref.bwa_is  reads.fq > reads.bwa_mem.sam

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar \
I=reads_vs_reference.bwa.unsorted.sam \
O=reads_vs_reference.bwa.sorted.bam \
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

For subsequent processing the mapping files with GATK (SNP calling) it is easier to introduce necessary information at the mapping stage, than run an extra step using picard. What is required by GATK is so called reads group info. We will cover it later, but at this stage is good to know that bwa can be run with extra parameters saving us one extra step.

#below is the example read group info needed to be passed to bwa on the command line: 
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 

#here is the mapping step where in the place of string in <> we put group info from above.
#different samples should have different group info, like this:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' ref_gen.bwa_is myreads_1.fq myreadst_2.fq > myreads_12_vs_refgen.bwa_mem.rg.sam

last

web site: http://last.cbrc.jp/

current version: 475 (Sep 2014)

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | last-map-probs.py -s150 > input_reads_vs_ref_genome.last.maf

#convert from MAF to SAM format
maf-convert.py sam  input_reads_vs_ref_genome.last.maf >  input_reads_vs_ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ref_genome.last.sorted.bam

CAVEAT: this LAST pipeline does not compute mapping quality values comparable with other mappers.

Genomes comparisons using LAST

The proper multiple genome alignment is a complex procedure often mildly successful with fragmented assemblies. Quick but less accurate solution is to compare to our genome of interest, L.mexicana, few other Leishmania genomes to get the information about level of conservation/SNPs. The end goal is to compare short read mapping results with other genomes alignment.

#create LAST genome index 
lastdb Lmex_genome.last Lmex_genome.fa

#genome 2 genome alignment
lastal   Lmex_genome.last Lmaj_genome.fa |  last-split > Lmaj_genome_v_Lmex80.lst475.maf

#convert from MAF format to SAM
maf-convert.py sam Lmaj_genome_v_Lmex80.lst475.maf > Lmaj_genome_v_Lmex80.lst475.sam

#fix missing SAM header with Lmex_genome.fa.fai file and samtools creating BAM
samtools view -but Lmex_genome.fa.fai  Lmaj_genome_v_Lmex80.lst475.sam -o Lmaj_genome_v_Lmex80.lst475.us.bam

#sort BAM
samtools sort Lmaj_genome_v_Lmex80.lst475.us.bam Lmaj_genome_v_Lmex80.lst475.sorted 

#index BAM
samtools index Lmaj_genome_v_Lmex80.lst475.sorted.bam

We can repeat this procedure with other genomes, but only for the similar genomes this makes sense. In our case, despite that L.amazonensis and L.enriettii belong to the Leishmania mexicana species complex (according to NCBI taxonomy), L.enriettii seems to be more distant to L.mexicana than L.major.

SNP discovery

GATK pipeline

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site: http://www.broadinstitute.org/gatk/index.php

This is the step by step procedure to follow their best practice.

Prerequisites

reference genome fasta file
reference genome index

samtools faidx Lmex_genome.fa

Result: Lmex_genome.fa.fai

reference genome dictionary

java -jar $PICARD_PATH/CreateSequenceDictionary.jar R=Lmex_genome.fa O=Lmex_genome.dict

Result: Lmex_genome.dict

BAM file preferably bwa mapped (from previous steps)

CAVEAT: GATK requires meta information in BAM files to work. This information is often not there after performing the mappings. If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

 
java -jar  $PICARD_PATH/AddOrReplaceReadGroups.jar \
I=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam \
O=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam \
RGID=ERR307343 RGPL=Illumina RGLB=ERR307343_lib  RGPU=ERR307343_pu RGSM=ERR307343_sm \
VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE SO=coordinate 


Result: 
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam 
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bai

Mark duplicates

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some Picard utilities

java -jar $PICARD_PATH/MarkDuplicates.jar \
I=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam \
O=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
METRICS_FILE=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam.metrics CREATE_INDEX=true



Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bai
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam.metric

Realign read around indels

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and paste them into script files, then change the script permission and execute them instead.

CAVEAT: notice the small -o in the command

#computing intervals in which reads will be realigned

java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T RealignerTargetCreator \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
-o  LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals


Result:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals

#realignment
java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T IndelRealigner \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
-targetIntervals  LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals  \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam  \
--filter_bases_not_stored 

Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bai

The recommended Best Practices step here is to run base re-calibration, meaning that the base quality is being re-estimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complexity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.

Call mutations (SNPs)

GATK has two different algorithms for calling SNPs:

UnifiedGenotyper: older, suitable for pooled diploid or non-diploid samples. Also a choice when number of samples in your BAM > 100.
HaplotypeCaller: newer, more accurate, first choice unless you have any of the listed above reasons to use UnifiedGenotyper

UnifiedGenotyper

#SNP only:
java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T UnifiedGenotyper \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.ug.vcf 

#Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.ug.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.vcf.ug.idx

#SNP + indels on tetraploid genome
java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T UnifiedGenotyper \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.ug.vcf \
--genotype_likelihoods_model BOTH
--ploidy 4

#Results :
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.ug.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.vcf.ug.idx

HaplotypeCaller

java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T HaplotypeCaller \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.hc.vcf 

#Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.ug.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.vcf.hc.idx


#SNP + indels on non-pooled diploid DNA, less than 100 samples
java -jar  $GATK_PATH/GenomeAnalysisTK.jar  -T HaplotypeCaller \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.hc.vcf \
--genotype_likelihoods_model BOTH

#Results HaplotypeCaller:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.hc.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.vcf.hc.idx

Wikiomics:EMBO Tunis 2

Contents