Before we start

Teacher

Darek Kedra (darek dot kedra at gmail dot com)

I work at CRG (Centre for Genomic Regulation) in Barcelona, Spain, in Cedric Notredame (author of t-coffee ligner) group. Our work place looks like this: CRG image

My current research topics include RNASeq, genome assembly and annotation, SNP calling, lncRNA discovery. As for the parasites I am working with sequences from L.donovani, L.major and T.brucei.

Organization etc.

lets turn out /silence the cell phones
some exposure to command line interface / Unix is assumed (cut and paste of commands will be OK)
in case of Unix command line problems check the page from the other course [here]
this page will stay online after the course "forever", and we will improve it over time (old version will be accessible through History link on the top)
you are welcome to register as a wiki user and comment in the Talk page/fix any errors/ ask for enhancements
in the boxed parts of the page lines starting with # are comments, and what below are examples of commands. After typing equivalents of these in your terminal, press Enter

Why Next Generation Sequencing

Typical uses cases of NGS data include:

de novo genome assembly
mapping reads to a known genome
1. whole genome random reads (DNA)
2. targeted resequencing: regions selected by long range PCR/hybridisation
3. exome mapping: DNA fragments are preselected using microarrays/beads with attached exonic sequences, so only fragments hybridising to them are selected
4. RNASeq: just RNA, but specific protocols may target just 5- or 3-prime ends of transcripts or miRNA with small size selection. RNASeq is the application in which stranded libraries are used, i.e. to differentiate between sense and antisense transcripts
5. ChIP-Seq: chromatin immunoprecipitation combined with sequencing of DNA fragments bound by histones/transcription factors.
6. bisulfite sequencing (non-methylated cytosines converted to uracil, read as Ts on the non-methylated sequences)
metagenomics: sequencing populations of microorganisms to investigate the diversity of species/strains

NGS file formats overview

There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:

text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG)
binary (BAM, BCF, SFF(454 sequencer data))

In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.

On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See i.e.

FASTA (.fa & .fai)
BAM and BAI formats (suffixes .bam & .bai),
VCF (.vcf & .vcf.idx).

The important thing for all files containing positions on chromosomes (mappings, features) is their numbering. To make things complicated, we have formats starting with 1, or with 0.

1 based formats: GFF/GFT, SAM/BAM, WIG,
0 BED, CN (copy number data, not covered)

Programs automatically deduce the correct numbering scheme, but whenever you are converting from one format to another watch for this "feature".

More info on file formats used by IGV: http://www.broadinstitute.org/igv/?q=book/export/html/16

Fasta (just few tricks)

#count number of sequences in multiple fasta
grep -c "^>" multi_fasta.fa

#list sequence names in fasta, display screen by screen with "q" to escape:
grep "^>" | less

While most likely mature NGS programs will handle complex FASTA sequence names correctly, you may sometimes have problems with various scripts. To fix it and just keep the first, hopefully unique single word sequence name:

!/usr/bin/env python
"""
name: fix_fasta_header.py
usage: ./fix_fasta_header.py input_fasta.fa > output_fasta.fa

CAVEAT: it does not check if the resulting sequence names are unique
"""
import sys

fn = sys.argv[1]

for line in open(fn).readlines():
	if line[0] == ">":
		line = line.split()[0]
		print line
	else:
		print line,

For reformatting sequences so that all sequences will have 70 columns you can use this program from a package called exonerate from: https://github.com/nathanweeks/exonerate

fastareformat input.fa > output.fa

For extracting sequences from your FASTA file you can use pyfasta library:

#single contig/chromosome
pyfasta extract --header --fasta Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa

#list of sequences in the desired order
#input seqids.txt file (just two lines with sequence names):
LmxM.01
LmxM.00

#command
pyfasta extract --header --fasta Lmex_genome.sort.fa --file seqids.txt  > LmxM.two_contigs.fa

FASTQ

Format and quality encoding checks

Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains nowadays 4 lines per sequence:

sequence name (should be unique in the file)
sequence string itself with ACTG and N
extra line starting with "+" sign, which contained repeated sequence name in the past
string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs

Here it is how it looks:

@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1
CAGCAAGTTGATCTCTCACCCAGAGAGAAGTGTTTCATGCTAAGTGGCACTGGTGCAGAACAGTTCTGCAATGG
+
IHIIHDHIIIHIIIIIIHIIIDIIHGGIIIEIIIIIIIIIIIIGGGHIIIHIIIIIIBBIEDGGFHHEIHGIGE

CAVEAT: When planning to do SNP calling using GATK, do not try to modify part of the name describing location on the sequencing lane:

HWUSI-EAS1696_0025_FC:3:1:2892:17869

This part is used for looking for optical replicates.

Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters. For the inquisitive: http://en.wikipedia.org/wiki/FASTQ_format#Quality

What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.

There are two main quality encodings: Sanger and Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:

offset 33 == Sanger / Illumina 1.9
offset 64 == Illumina 1.3+ to Illumina 1.7

For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:

head -400000 my_reads.fastq > 100K_head_my_reads.fastq 
fastqc 100K_head_my_reads.fastq
#we got here 100K_head_my_reads.fastq_fastqc/ directory

grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt

#output: 
Encoding	Sanger / Illumina 1.9

CAVEAT: all this works only on unfiltered FASTQ files. Once you remove the lower quality bases/reads containing them, guessing which encoding format is present in your files is problematic.

Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.

#!/bin/bash
file=$1

if [[ $file ]]; then
    command="cat"
    if [[ $file =~ .*\.gz ]];then
        command="zcat"
    fi
    command="$command $file | "
fi

command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs
et 64\";exit 0}}}'"

eval $command

Types of data

read length

from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-150bp.

single vs paired

Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.

Most of the time forward and reverse reads facing each other end-to-end are

insert length

With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.

Program for combining overlapping reads: FLASH: http://ccb.jhu.edu/software/FLASH/ To run it:

flash SRR611712_1.fastq SRR611712_2.fastq -o SRR611712_12.flash

For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.

stranded vs unstranded (RNASeq only)

We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.

quality checking (FastQC)

It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.

good quality report:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

bad quality FastQC report

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

trimming & filtering

Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters. All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.

You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:

http://pathogenomics.bham.ac.uk/blog/2013/04/adaptor-trim-or-die-experiences-with-nextera-libraries/

Trimmomatic

web: http://www.usadellab.org/cms/index.php?page=trimmomatic

version (Sep 2014): 0.32

From the manual:

Paired End:

java -jar /path/to/trimmomatic-0.32.jar PE --phred33 \
input_forward.fq.gz input_reverse.fq.gz \
output_forward_paired.fq.gz output_forward_unpaired.fq.gz \
output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters Remove leading low quality or N bases (below quality 3) Remove trailing low quality or N bases (below quality 3) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 Drop reads below the 36 bases long

Single End:

java -jar /path/to/trimmomatic-0.32.jar SE --phred33 \
input.fq.gz output.fq.gz \
ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the same steps, using the single-ended adapter file

Tagdust (for simple unpaired reads)

Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.

tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq

Error correction

For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.

Coral

web site: http://www.cs.helsinki.fi/u/lmsalmel/coral/ version: 1.4

It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM).

#Illumina reads
./coral -fq input.fq -o output.fq  -illumina  

#454 reads
./coral -fq input.454.fq -o output.454.fq  -454  

#correcting 454 reads with Illumina reads
Coral can not use more than one input file, therefore one has to combine Illumina & 454 reads into one FASTQ file, noting the number of Illumina reads used. To prepare such file:

cat 10Millions_illumina_reads.fq > input_4_coral.fq
cat some_number_of_454_reads.fq >> input_4_coral.fq

## run coral with the command:
coral -454 -fq  input_4_coral.fq -o  input_4_coral.corrected.fq-i 10000000 -j 10000000

#This will correct just the 454 reads,not the Illumina ones
#In real life you have to count the number of reads in your Illumina FASTQ file, i.e. (assuming you do not have wrapped sequence/qualities FASTQ 1 read=4lines ) :

wc -l illumina_reads.fq | awk '{print $1/4}' 

#if in doubt, use fastqc to get the numbers

Public FASTQ data: Short Read Archive vs ENA

While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies. There are two main places to get such data sets:

NCBI Short Read Archive / Taxonomy Browser:
- SRA: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=search_obj

For both of them it is a good idea to install program Aspera Connect to reduce download times:

http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in/

Click on Resources, download and install the web plugin.

* go there 
* put Leishmania major RNA
* we get "SRA Experiments	6"	
* click on "6"
* one of them is: "Transcriptome Analysis of Leishmania major strain Friedlin", ERX190352

Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

* go there
* put Leishmania major
* click on the checkbox SRA Experiments
* click on Display button
* we got 58 public experiment, 7of which are RNA
* click on RNA (7) on the left
* note 
"Whole Genome Sequencing of Leishmania major strain Friedlin"
Accession: SRX203187

CAVEAT: not everybody submits the right descriptions of their experiments. In case of doubt, download and map.

Extracting the FASTQ from SRA archive is easy with sra-toolkit installed on your computer You get it here: https://github.com/ncbi/sratoolkit

fastq-dump SRR1016916.sra

#Result:
SRR1016916.fastq

You can also get the multiple SRA files in one step and without 20 clicks from NCBI with configured ASPERA by writing a shell script (one line per file), preferably using a (Python/Perl/Ruby) script and a list of files to get:


ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012154/  ./

ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \
anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012155/  ./

Which one to use? ENA may be easier as you get gzipped fastq files directly. But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.

SAM and BAM file formats

The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. Numbering starts from base 1, contains header, and it is line oriented (one sequence per line).

The minimal version looks like this:

@HD	VN:1.0	SO:unsorted
@SQ	SN:1	LN:171001
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0

It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:

#unmapped reads:
SRR594632.1	4	*	0	0	*	*	0	0	NTGAGGACAAAGCTCCGCTGCTGCTCGCTTTCCCCT	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SRR594632.2	4	*	0	0	*	*	0	0	NTGAGCGAACTATTGATATTCGGCATAGAACAAAAA	BQKKHNMKOR___T_TTRWWVTVTVVTTOV___QQQ
SRR594632.3	4	*	0	0	*	*	0	0	NTGTTATTCAATCTGAGCATCGTTATCTTGACCATG	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SRR

#mapped reads
SRR594632.20	0	LmxM.20	3139684	25	36M	*	0	0	NTGAAAATAAAGAACATCTGAACATTTCTCTGTAAG	BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:0A0A34
SRR594632.21	16	LmxM.22	312482	25	36M	*	0	0	ATACAAGGCACAAAAATAAAGCAGTCGAGAGAGCAN	BBBBBBBBBBBBBBBB__________RRRRRLLLQBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:34T0G0
SRR594632.22	16	LmxM.01	33973	25	36M	*	0	0	AACACTGTGCGTGTGCGCTTGTCTAAGATGAGCCAN	_WTRWWTOTVTTVPTT__________PPLOPJJJKBXT:A:U	NM:i:2	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:34T0A0

In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf Use BAMs instead of SAM files (speed of access, size, compatibility with tools.There are multiple tools to process and extract information from SAM and its compressed form, BAM files. The few used on daily/basis:

samtools
Picard
bamtools
bedtools

Typical tasks:

#view the BAM file
samtools view your_bam | less

#view the header of BAM file
samtools view -H your_bam | less

#sort the BAM/SAM file
java -jar /path/to/SortSam.jar I=mapped_reads.unsorted.sam O=mapped_reads.sorted.bam \
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

#merge (sorted) BAM files:
java -jar /path/to/SortSam.jar \
I=mapped_reads.set1.bam \
I=mapped_reads.set2.bam \
I=mapped_reads.set3.bam \
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true ASSUME_SORTED=true

#extracting as BAM reads mapped to certain chromosome (LmxM.01):
bamtools filter -in ERR307343_12.Lmex.bwa_mem.Lmex.bwa.sorted.bam  \
-out chr_1_ERR307343_12.Lmex.bwa_mem.Lmex.bam -region LmxM.01

GTF/GFF

The most commonly used sequence annotation format with few flavors.

GTF Fields:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Sample GFF output from Ensembl export:

X	Ensembl	Repeat	2419108	2419128	42	.	.	hid=trf; hstart=1; hend=21
X	Ensembl	Repeat	2419108	2419410	2502	-	.	hid=AluSx; hstart=1; hend=303
X	Ensembl	Repeat	2419108	2419128	0	.	.	hid=dust; hstart=2419108; hend=2419128
X	Ensembl	Pred.trans.	2416676	2418760	450.19	-	2	genscan=GENSCAN00000019335
X	Ensembl	Variation	2413425	2413425	.	+	.	
X	Ensembl	Variation	2413805	2413805	.	+	.

And from TriTrypDB (gff3)

LmjF.01 TriTrypDB       CDS     12073   12642   .       -       0       ID=cds_LmjF.01.0040-1;Name=cds;description=.;size=570;Parent=rna_LmjF.01.0040-1
LmjF.01 TriTrypDB       exon    12073   12642   .       -       .       ID=exon_LmjF.01.0040-1;Name=exon;description=exon;size=570;Parent=rna_LmjF.01.0040-1
LmjF.01 TriTrypDB       gene    15025   17022   .       -       .       ID=LmjF.01.0050;Name=LmjF.01.0050;description=carboxylase%2C+putative;size=1998;web_id=LmjF.01.0050;locus_tag=LmjF.01.0050;size=1998;Alias=321438056,389592315,LmjF1.0050,LmjF01.0050,LmjF.01.0050,LmjF01.0050:pep,LmjF01.0050:mRNA

The detailed description of GFF3: http://www.sequenceontology.org/gff3.shtml

IGV-centered CAVEAT: GFF files downloaded from various sources often contain more annotations than a single track of IGV can handle. Typical offenders are chromosome/scaffold lines, causing non-visibility of gene annotations. Since the authors are quite inventive in naming their tracks, plus key words like "chromosome" may be in column 9, you can:

#see what is there:
grep -v "^#" TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff | awk '{print $3}' | sort | uniq -c | sort -n 

#result TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff:
     1 random_sequence
     16 rRNA
     34 chromosome
     83 tRNA
    714 transcript
   8250 mRNA
   8336 CDS
   9063 gene
   9149 exon
 908081

As you can see, we can try to remove just 34+1 (chromosome + random_sequence) and check that we will get 908081-35 entries when running it the same command on the result. Just remember that we removed the header, which we may want to keep.

BED

Another text annotation file format from UCSC, compatible with its Genome Browser: http://genome.ucsc.edu/cgi-bin/hgGateway

Full format description: http://genome.ucsc.edu/FAQ/FAQformat.html#format1

CAVEAT: first base in contig/chromosome is numbered as 0

Essential 3 columns

1: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2: chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

3: chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

#just regular genomic intervals for i.e GC calculations 
LmxM.01	0	10000
LmxM.01	10000	20000
LmxM.01	20000	30000
LmxM.01	30000	40000

Often used next 3 columns

4: name - Defines the name of the BED line. I

5: score - A score between 0 and 1000.

6: strand - Defines the strand - either '+' or '-'.

#just regular genomic intervals for i.e GC calculations 
LmxM.01 210     720     Rep1    100     -
LmxM.01 800     1234    ABC|cds 1000    +
LmxM.01 2100    3030    DEF|gen	100	+

Last 6 columns for extended features

7: thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.

8: thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).

9: itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.

10: blockCount - The number of blocks (exons) in the BED line.

11: blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.

12: blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

VCF

Stands for Variant Call Format. Text file format for storing information about SNPs, insertions and deletions. It is rather complex, see detailed description: http://www.1000genomes.org/node/101

Obtaining variants listed in such form is a multistep procedure, but well standardised. See GATK section below.

Here are the examples from our data:

#simple SNP
LmxM.01	17218	.	G	A	1818.77	.	AC=2;AF=1.00;AN=2;BaseQRankSum=2.761;DP=55

#insertion
LmxM.01	38177	.	C	CTG	367.73	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.082;DP=34

#deletion
LmxM.01	99318	.	GCACACACA	G	1774.73	.	AC=2;AF=1.00;AN=2;DP=30

For using it with IGV, we have to have VCF index, created by GATK.

WIG

A rather line oriented format for storing and displaying numerical information along the genome for intervals. The generic form of WIG:

line defining type of this section, chr_position, start, step and more
number1
number2

The intervals can be regular, as when we compute some measure across the whole genome. This is fixedStep type:

fixedStep chrom=LmxM.01 start=0 step=10000 span=10000
0.572100
0.612700
0.637300
0.628000
0.637200
0.643700
0.642800

Above fixedStep WIG is from GC calculations in a given window in the genome. Procedure described here (we start from "create a N-base wide step file": http://wiki.bits.vib.be/index.php/Create_a_GC_content_track

When the calculated number different from 0 are sparse in our genomic intervals, it is simpler to use variableStep

variableStep	chrom=LmxM.17	span=41
1	0.142857
variableStep	chrom=LmxM.17	span=249
43	0.142857
variableStep	chrom=LmxM.17	span=4
292	0.166667
variableStep	chrom=LmxM.17	span=7
296	0.142857

This variableStep example comes from gem-mapability program from GEM-Mapper.

#index the genome (fix FASTA headers first to one word sequence names)
gem-indexer -i Lmex_genome.fa -o Lmex_genome

#calculate mapability for 100bp fragments
gem-mappability -I Lmex_genome.gem -o Lmex_genome -l 100 

#convert to WIG
gem-2-wig -I Lmex_genome.gem -i Lmex_genome.mappability -o Lmex_genome.mappability

#Result (loadable in IGV):
Lmex_genome.mappability.wig

Mapping Illumina reads to the genome

basic mapping steps

indexing

Before we can use the genome for mapping we have to transform it into a format specific for each of the mappers allowing for much faster search and lower memory usage. This is often called indexing, but to make things worse indexing fasta with samtools is not the same as indexing with bwa, bowtie etc.

mapping

This is often the longest step, with options specific for each mapper

postprocessing

The output of the mappers is seldom directly usable by downstream programs, which often use sorted and indexed BAM files. So we need to transform the mapper output (often SAM, but sometimes different format (MAF for LAST, MAP for GEM) to get such BAM files.

bwa

BWA is a the default mapper used by state of the art SNP calling GATK pipeline. There are some mappers which on some statistics may be better or equal but faster than BWA, but it is still a safe choice for doing genetic mapping. The main problem of BWA is mapping of paired reads: once one read is mapped to a good location, the second read seems to be placed close to this read (taking into account the insert size) even if the mapping would be very doubtful. This may not be a problem for GATK, since mapping qualities and flags are being accounted for, but one should keep this in mind when doing any analysis of the mapping results on your own. Currently BWA can use 3 different algorithms, each one with some limits and strong points. Here is the overview:

Illumina reads up to 100bp: BWA-BACKTRACK (the legacy bwa)

commands: "bwa aln" followed by bwa samse/sampe

sequences from 70bp up to 1Mbp:
- Illumina data: "BWA-MEM"

(seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW))

command: bwa mem

- very long reads/contigs: "BWA-SW" (Smith Waterman)

command: bwa bwasw

BWA has two different genome indexing algorithms:

IS (compatible with aln and mem), the default indexing (or: "bwa index -a is"
SW (for SW only): "bwa index bwtsw"

Please note that SW indexing (and therefore mapping using SW by BWA) does not work for short genomes

#creating genome index
bwa index -p ref.bwa_is  ref.fa

#mapping single end reads using MEM algorithm
bwa mem ref.bwa_is  reads.fq > reads.bwa_mem.sam

#mapping paired end reads using MEM algorithm
bwa mem ref.bwa_is reads_1.fq reads_2.fq > reads_12.bwa_mem.sam

#mapping single and reads
bwa aln ref.bwa_is short_read.fq > short_read.bwa_aln.sai
bwa samse ref.bwa_is short_read.bwa_aln.sai short_read.fq > short_read.bwa_aln.sam

#mapping paired reads
bwa aln ref.bwa_is  short_read_1.fq > short_read_1.bwa_aln.sai
bwa aln ref.bwa_is  short_read_2.fq > short_read_2.bwa_aln.sai
bwa sampe ref.bwa_is  short_read_1.bwa_aln.sai  short_read_2.bwa_aln.sai   short_read_1.fq   short_read_2.fq >  short_read_12.bwa_aln.sam

#mapping long reads using bwasw algorithm
bwa index -p ref.bwa_sw -a bwtsw ref.fa
bwa bwasw ref.bwa_sw long_read.fq > long_read.bwa_sw.sam

The mode currently recommended for mapping by BWA manual and the leading SNP calling software called GATK is MEM.

To create usable BAM files we can process SAM files using Picard's SortSam

java -jar /path/to/SortSam.jar \
I=reads_vs_reference.bwa.unsorted.sam \
O=reads_vs_reference.bwa.sorted.bam \
SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true

For subsequent processing the mapping files with GATK (SNP calling) it is easier to introduce necessary information at the mapping stage, than run an extra step using picard. What is required by GATK is so called reads group info. We will cover it later, but at this stage is good to know that bwa can be run with extra parameters saving us one extra step.

#below is the example read group info needed to be passed to bwa on the command line: 
@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 

#here is the mapping step where in the place of string in <> we put group info from above.
#different samples should have different group info, like this:

bwa mem -M -R '@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1' ref_gen.bwa_is myreads_1.fq myreadst_2.fq > myreads_12_vs_refgen.bwa_mem.rg.sam

last

web site: http://last.cbrc.jp/

current version: 475 (Sep 2014)

This is less popular but sometimes quite useful mapper reporting unique mappings only. It can handle large number of mismatches and it simply remove the non-matching parts of the read, as long as what is left is sufficient to secure unique mapping. It can also be used to map very long reads, and even genome to genome (but then one has to index the genome differently). Standard usage:

#create samtools fasta index used to insert FASTA header sequence info in SAM 2 BAM. Creates ref_genome.fa.fai
samtools faidx ref_genome.fa

#index ref_genome for last, with a preference for short, exact matches
lastdb -m1111110  ref_genome.lastdb ref_genome.fa

#map short reads with Sanger (Q1) quality encoding, with the alignment score 120 (e120), then filter the output for 150 threshold (s150). See the http://last.cbrc.jp/doc/last-map-probs.txt for more info 
lastal -Q1 -e120 ref_genome.lastdb  input_reads.fastq  | last-map-probs.py -s150 > input_reads_vs_ref_genome.last.maf

#convert from MAF to SAM format
maf-convert.py sam  input_reads_vs_ref_genome.last.maf >  input_reads_vs_ref_genome.last.sam 

#convert SAM to BAM inserting header
samtools view -but ref_genome.fa.fai  input_reads_vs_ ref_genome.last.sam -o input_reads_vs_ref_genome.last.unsorted.bam 

#sort BAM
samtools sort input_reads_vs_ref_genome.last.unsorted.bam input_reads_vs_ ref_genome.last.sorted

#create BAM index (input_reads_vs_ ref_genome.last.sorted.bam.bai)
samtools index input_reads_vs_ref_genome.last.sorted.bam

CAVEAT: this LAST pipeline does not compute mapping quality values comparable with other mappers.

Genomes comparisons using LAST

The proper multiple genome alignment is a complex procedure often mildly successful with fragmented assemblies. Quick but less accurate solution is to compare to our genome of interest, L.mexicana, few other Leishmania genomes to get the information about level of conservation/SNPs. The end goal is to compare short read mapping results with other genomes alignment.

#create LAST genome index 
lastdb Lmex_genome.last Lmex_genome.fa

#genome 2 genome alignment
lastal   Lmex_genome.last Lmaj_genome.fa |  last-split > Lmaj_genome_v_Lmex80.lst475.maf

#convert from MAF format to SAM
maf-convert.py sam Lmaj_genome_v_Lmex80.lst475.maf > Lmaj_genome_v_Lmex80.lst475.sam

#fix missing SAM header with Lmex_genome.fa.fai file and samtools creating BAM
samtools view -but Lmex_genome.fa.fai  Lmaj_genome_v_Lmex80.lst475.sam -o Lmaj_genome_v_Lmex80.lst475.us.bam

#sort BAM
samtools sort Lmaj_genome_v_Lmex80.lst475.us.bam Lmaj_genome_v_Lmex80.lst475.sorted 

#index BAM
samtools index Lmaj_genome_v_Lmex80.lst475.sorted.bam

We can repeat this procedure with other genomes, but only for the similar genomes this makes sense. In our case, despite that L.amazonensis and L.enriettii belong to the Leishmania mexicana species complex (according to NCBI taxonomy), L.enriettii seems to be more distant to L.mexicana than L.major.

SNP discovery

GATK pipeline

Genome Analysis Toolkit (GATK) from Broad is the de facto standard for detecting Single Nucleotide Polymorphisms (SNPs). There are very good and extensive manuals available on their site: http://www.broadinstitute.org/gatk/index.php

This is the step by step procedure to follow their best practice.

Prerequisites

reference genome fasta file
reference genome index

samtools faidx Lmex_genome.fa

Result: Lmex_genome.fa.fai

reference genome dictionary

java -jar ~/soft/picard_1.119/CreateSequenceDictionary.jar R=Lmex_genome.fa O=Lmex_genome.dict

Result: Lmex_genome.dict

BAM file preferably bwa mapped (from previous steps)

CAVEAT: GATK requires meta information in BAM files to work. This information is often not there after performing the mappings. If we have not done it during the mapping with bwa, we can still fix it easily with AddOrReplaceReadGroups from picard:

 
java -jar  ~/soft/picard_1.119/AddOrReplaceReadGroups.jar \
I=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.bam \
O=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam \
RGID=ERR307343 RGPL=Illumina RGLB=ERR307343_lib  RGPU=ERR307343_pu RGSM=ERR307343_sm \
VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE SO=coordinate 


Result: 
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam 
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bai

Mark duplicates

At this stage we have mapped reads with group info as BAM. The next step is to mark duplicate reads (~PCR artifacts) in this file. We can almost always use CREATE_INDEX=true, so we do not need to run extra indexing when using some picard utilities

java -jar ~/soft/picard_1.118/MarkDuplicates.jar \
I=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam \
O=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
METRICS_FILE=LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam.metrics CREATE_INDEX=true



Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bai
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.bam.metric

Realign read around indels

After getting marked duplicated reads, the next step is to realign read around indels. This is being done in two steps. Also at this stage it becomes more and more cumbersome to execute these steps as commands on the command line. The solution is to cut and paste them into script files, then change the script permission and execute them instead.

CAVEAT: notice the small -o in the command

#computing interval in which read will be realigned

java -jar  ~/soft/GATK_3.2.2/GenomeAnalysisTK.jar  -T RealignerTargetCreator \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
-o  LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals


Result:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals

#realignment
java -jar  ~/soft/GATK_3.2.2/GenomeAnalysisTK.jar  -T IndelRealigner \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.bam \
-targetIntervals  LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.target.intervals  \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam  \
--filter_bases_not_stored 

Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bai

The recommended Best Practices step here is to run base re-calibration, meaning that the base quality is being re-estimated after taking into account mapping results. It adds several steps, and while it may be worthwhile, Illumina got better at estimating base qualities of it reads, so the results may not justify the extra complexity.

Another optional (by Best Practices) step is to reduce the complexity of the BAM. Since it is not necessary, we will skip it this time, but it is recommended to run it when dealing with multiple / large data sets.

Call mutations (SNPs)

#SNP only:

java -jar  ~/soft/GATK_3.2.2/GenomeAnalysisTK.jar  -T UnifiedGenotyper \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.vcf 

Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp.vcf.idx


#SNP + indels
java -jar  ~/soft/GATK_3.2.2/GenomeAnalysisTK.jar  -T UnifiedGenotyper \
-R Lmex_genome.fa \
-I LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.bam   \
-o LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.vcf \
 --genotype_likelihoods_model BOTH

Results:
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.vcf
LmxM.01_ERR307343_12.Lmex.bwa_mem.Lmex.rg.md.realigned.snp_indel.vcf.idx

Not used instructions

#lets use pet sized one-chromosome genome:
LmxM.01_single.fa
LmxM.01_single.fa.fai
LmxM.01_single.gff

#check the section of main page, and use these files as input

Load BAM, VCF, WIG and IGV files into browser

In 000_igv and 02_sambam directories these is a bunch of files compatible with LmxM.01 _single chromosome. Check which files can be loaded. We can see GC content across the chromosome in 100bp window, "mappability" of 100bp long reads to different parts of the chromosome, mutations, and BAMs containing genomic mapped genomic DNA reads. Scroll along the chromosome, look at the differences. Note places enriched in SNPs, gaps in mapping, indels.

Close the IGV, go back to the command line.

User talk:Darek Kedra/sandbox 30