Wikiomics:EMBO Tunis 1
Before we start
Teacher
Darek Kedra (darek dot kedra at gmail dot com)
I work at CRG (Centre for Genomic Regulation) in Barcelona, Spain, in Cedric Notredame (author of t-coffee ligner) group. Our work place looks like this: CRG image
My current research topics include RNASeq, genome assembly and annotation, SNP calling, lncRNA discovery. As for the parasites I am working with sequences from L.donovani, L.major and T.brucei.
Organization etc.
- lets turn out /silence the cell phones
- some exposure to command line interface / Unix is assumed (cut and paste of commands will be OK)
- in case of Unix command line problems check the page from the other course [here]
- this page will stay online after the course "forever", and we will improve it over time (old version will be accessible through History link on the top)
- you are welcome to register as a wiki user and comment in the Talk page/fix any errors/ ask for enhancements
- in the boxed parts of the page lines starting with # are comments, and what below are examples of commands. After typing equivalents of these in your terminal, press Enter
Why Next Generation Sequencing
Typical uses cases of NGS data include:
- de novo genome assembly
- mapping reads to a known genome
- whole genome random reads (DNA)
- targeted resequencing: regions selected by long range PCR/hybridisation
- exome mapping: DNA fragments are preselected using microarrays/beads with attached exonic sequences, so only fragments hybridising to them are selected
- RNASeq: just RNA, but specific protocols may target just 5- or 3-prime ends of transcripts or miRNA with small size selection. RNASeq is the application in which stranded libraries are used, i.e. to differentiate between sense and antisense transcripts
- ChIP-Seq: chromatin immunoprecipitation combined with sequencing of DNA fragments bound by histones/transcription factors.
- bisulfite sequencing (non-methylated cytosines converted to uracil, read as Ts on the non-methylated sequences)
- metagenomics: sequencing populations of microorganisms to investigate the diversity of species/strains
Whole Genome Sequencing
Tips before you start
- if possible, use haploid genome (see bee genome project where they used haploid drones)
- next best thing: highly inbred, nearly homozygous lines
Sometimes you have unexpected treasures: There is probably (almost)-homozygous cow: http://en.wikipedia.org/wiki/Chillingham_cattle
- there are cases where extra-chromosomal DNA (40-70 chloroplasts per plant cell are often in 150kb range, with 40-70 copies per organelle) contributes non-trivial portion of total DNA. Select tissues/stages with less multiple copy DNAs
Technologies to consider
- Simple bacteria can be sequenced using just Illumina data (still preferably paired 2x100bp or more).
- If you sequence anything in the range of 30kb (fungi /kinetoplasts/etc.) often containing repeats exceeding length of reads, are unlikely to give you satisfactory assembly (we tried it with 2x100bp at >700x coverage => shattered genome, unusable for annotation
- the likely solution is to use much larger read lengths (even 454 @400bp can help, but longer sequences are much better)
- PacBio is a current leader, giving you bulk of your reads in range 3-4kbs, good enough for smaller genome
- Moleculo from Illumina (not sure about availability in your region) is able to get high 10kb fragments using modification of Illumina technology
- on the low cost front, one can:
- optimize preparation of Illumina sequencing protocol targeting i.e. low GC genome (Plasmodium falciparum, 25%GC)
- use overlapping ends long reads Illumina library (i.e 2x 100bp with 180bp inserts).
- since the longest to date illumina reads are from low throughput MiSeq machine:
- protocol V3, 2 × 300 bp ~55 hrs 13.2-15 Gb
- it is possible to greatly improve scaffolding (which contigs comes after another and in which orientation) using lower coverage 2k, 5k 10k or sequencing fosmid ends
- one can try to reduce complexity of the assembly by subselecting parts of the genome
- use flow sorted chromosomes as input DNA
- sequence pools of fosmid clones, likely delivered from different parts of the genome
Bioinformatics of genome assembly
The results of Asemblathon2, an attempt to evaluate performance of number of assemblers and teams were rather unsatisfactory. There is no clear winner, performing best on all 3 species used (fish, parrot and a snake), and the programs seem to make different choices: N50 vs accuracy of the assembly. You can not have both at this time.
This reflects rather poor quality of multiple sequenced genomes, from 30Mbp Leishmania to vertebrates. And this is often made worse by over-enthusiastic annotation (the "hypothetical unlikely" genes, retrotransposons, long ORFs in ribosomal DNA.
The idea is to perform multiple assemblies using different k-mer lengths, programs, error correction programs and then evaluate what makes most sense. Keep in mind that simply getting a longer assembly with longer N50 make lead to removing some coding parts, like was case of "improvement" of highly polymorphic sea squirt Ciona intestinalis genome.
A good shot from biology point of view, is determining to which level the set of extremely conserved genes is present in the different assemblies. This was pioneered by Ian Korf with program called CEGMA using ca 400 supposedly conserved in all Eukaryotes. Probably the better approach would be to create lineage specific conserved genes (i.e. genes having clear orthologues in all sequenced fungi) .
There is expansion of tasks performed by "assemblers", doing often their own read filtering, sometimes error correction, and scaffolding. Since these may expected intact FASTQ files as input, there is no one clear path what you should do with your sequences prior to assembly.
Interesting approach is to normalize the k-mer frequencies present in your reads prior to assembly, as proposed by Titus Brown with his khmer program: http://khmer.readthedocs.org/en/v1.1/
The existing assemblers often fail for regions showing either very high or very low coverage. By making the k-mer numbers more uniform seems to improve some assemblies.
According to http://bioinformatics.oxfordjournals.org/content/early/2014/07/08/bioinformatics.btu391.full the recommended approach is to use:
- ABySS assembler with Illumina PE data
- ALLPATHS-LG or SPAdes assembler with Illumina PE-MP
Just check the exact library requirements of these assemblers (ALLPATHS requires overlapping ends l, plus long inserts libraries).
NGS file formats overview
There are multiple file formats used at various stages of NGS data processing. We can divide them into two basic types:
- text based (FASTA, FASTQ, SAM, GTF/GFF, BED, VCF, WIG)
- binary (BAM, BCF, SFF(454 sequencer data))
In principle, we can view and manipulate text based formats without special tools, but we will need these to access and view binary formats. To make things a bit more complicated, the text-based format are often compressed to save space making them de facto binary, but still easy to read by eye using standard Unix tools. Also despite that one can read values in several columns and from tens of rows, we still need dedicated programs to make sense of millions of rows and i.e. encoded columns.
On the top of these data/results files, some programs require that for faster access we need a companion file (often called index). See i.e.
- FASTA (.fa & .fai)
- BAM and BAI formats (suffixes .bam & .bai),
- VCF (.vcf & .vcf.idx).
The important thing for all files containing positions on chromosomes (mappings, features) is their numbering. To make things complicated, we have formats starting with 1, or with 0.
- 1 based formats: GFF/GFT, SAM/BAM, WIG,
- 0 BED, CN (copy number data, not covered)
Programs automatically deduce the correct numbering scheme, but whenever you are converting from one format to another watch for this "feature".
More info on file formats used by IGV: http://www.broadinstitute.org/igv/?q=book/export/html/16
Fasta (just few tricks)
#count number of sequences in multiple fasta grep -c "^>" multi_fasta.fa #list sequence names in fasta, display screen by screen with "q" to escape: grep "^>" | less
While most likely mature NGS programs will handle complex FASTA sequence names correctly, you may sometimes have problems with various scripts. To fix it and just keep the first, hopefully unique single word sequence name:
#!/usr/bin/env python """ name: fix_fasta_header.py usage: ./fix_fasta_header.py input_fasta.fa > output_fasta.fa CAVEAT: it does not check if the resulting sequence names are unique """ import sys fn = sys.argv[1] for line in open(fn).readlines(): if line[0] == ">": line = line.split()[0] print line else: print line,
For reformatting sequences so that all sequences will have 70 columns you can use this program from a package called exonerate from: https://github.com/nathanweeks/exonerate
fastareformat input.fa > output.fa
For extracting sequences from your FASTA file you can use pyfasta library (or samtools, see below):
#single contig/chromosome pyfasta extract --header --fasta Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa #list of sequences in the desired order #input seqids.txt file (just two lines with sequence names): LmxM.01 LmxM.00 #commad to extract pyfasta extract --header --fasta Lmex_genome.sort.fa --file seqids.txt > LmxM.two_contigs.fa
We can obtain fasta index (which contains contig sizes):
#index samtools faidx Lmex_genome.sort.fa #once we created .fai file for a given .fa file, we can extract a sequence(s): #whole single chromosome: samtools faidx Lmex_genome.sort.fa LmxM.01 > LmxM.01_single.fa #some region >LmxM.01:10000-20000 with the coorrect fasta sequence name (>LmxM.01:10000-20000) samtools faidx Lmex_genome.sort.fa LmxM.01:10000-20000 > LmxM.01_10000-20000.fa #Above faidx tip thanks to: XY Student (name to be found) #get the chromosomal sizes file: awk '{print $1"\t"$2} ' Lmex_genome.sort.fa.fai > Lmex_genome.sort.sizes
FASTQ
Format and quality encoding checks
Already in the 90ties when all sequencing was being done using Sanger method, the big breakthrough in genome assembly was when individual bases in the reads (ACTG) were assigned some quality values. In short, some parts of sequences had multiple bases with a lower probability of being called right. So it makes sense that matches between high quality bases are given a higher score, be it during assembly or mapping that i.e. end of the reads with multiple doubtful / unreliable calls. This concept was borrowed by Next Generation Sequencing. While we can hardly read by eye the individual bases in some flowgrams, it is still possible for the Illumina/454/etc. software to calculate base qualities. The FASTQ format, (usually files have suffixes .fq or .fastq) contains nowadays 4 lines per sequence:
- sequence name (should be unique in the file)
- sequence string itself with ACTG and N
- extra line starting with "+" sign, which contained repeated sequence name in the past
- string of quality values (one letter/character per base) where each letter is translated in a number by the downstream programs
Here it is how it looks:
@SRR867768.249999 HWUSI-EAS1696_0025_FC:3:1:2892:17869/1 CAGCAAGTTGATCTCTCACCCAGAGAGAAGTGTTTCATGCTAAGTGGCACTGGTGCAGAACAGTTCTGCAATGG + IHIIHDHIIIHIIIIIIHIIIDIIHGGIIIEIIIIIIIIIIIIGGGHIIIHIIIIIIBBIEDGGFHHEIHGIGE
CAVEAT: When planning to do SNP calling using GATK, do not try to modify part of the name describing location on the sequencing lane:
HWUSI-EAS1696_0025_FC:3:1:2892:17869
This part is used for looking for optical replicates.
Unfortunately Solexa/Illumina did not follow the same quality encoding as people doing Sanger sequencing, so there are few iterations of the standard, with quality encodings containing different characters. For the inquisitive: http://en.wikipedia.org/wiki/FASTQ_format#Quality
What we need to remember from it, that we must know which quality encoding we have in our data, because this is an information required by mappers, and getting it wrong will make our mappings either impossible (some mappers may quit when encountering wrong quality value) or at best unreliable.
There are two main quality encodings: Sanger and Two other terms, offset 33 and offset 64 are also being used for describing quality encodings:
- offset 33 == Sanger / Illumina 1.9
- offset 64 == Illumina 1.3+ to Illumina 1.7
For that, if we do not have direct information from the sequencing facility which version of the Illumina software was used, we can still find it out if we investigate the FASTQ files themselves. Instead of going by eye, we use a program FastQC. For the best results/full report we need to use the whole FASTQ file as an input, but for quick and dirty quality encoding recognition using 100K of reads is enough:
head -400000 my_reads.fastq > 100K_head_my_reads.fastq fastqc 100K_head_my_reads.fastq #we got here 100K_head_my_reads.fastq_fastqc/ directory grep Encoding 100K_head_my_reads.fastq_fastqc/fastqc_data.txt #output: Encoding Sanger / Illumina 1.9
CAVEAT: all this works only on unfiltered FASTQ files. Once you remove the lower quality bases/reads containing them, guessing which encoding format is present in your files is problematic.
Here is a bash script containing awk oneliner to detect quality encoding in both gzip-ed and not-compressed FASTQ files.
#!/bin/bash file=$1 if [[ $file ]]; then command="cat" if [[ $file =~ .*\.gz ]];then command="zcat" fi command="$command $file | " fi command="${command}awk 'BEGIN{for(i=1;i<=256;i++){ord[sprintf(\"%c\",i)]=i;}}NR%4==0{split(\$0,a,\"\");for(i=1;i<=length(a);i++){if(ord[a[i]]<59){print \"Offset 33\";exit 0}if(ord[a[i]]>74){print \"Offs et 64\";exit 0}}}'" eval $command
Types of data
- read length
from 35bp in some old Illumina reads to 250+ in MiSeq. The current sweet spot is between 70-150bp.
- single vs paired
Just one side of the insert sequenced or sequencing is done from both ends. Single ones are cheaper and faster to produce, but paired reads allow for more accurate mapping, detection of large insertions/deletions in the genome.
Most of the time forward and reverse reads facing each other end-to-end are
- insert length
With the standard protocol, the inserts are anywhere between 200-500bp. Sometimes especially for de novo sequencing, insert sizes can be smaller (160-180bp) with 100bp long reads allowing for overlap between ends of the reads. This can improve the genome assembly (i.e. when using Allpaths-LG assembler requiring such reads). Also with some mappers (LAST) using longer reads used to give better mappings (covering regions not unique enough for shorter reads) than 2x single end mapping. With paired end mappings the effects are modest.
Program for combining overlapping reads: FLASH: http://ccb.jhu.edu/software/FLASH/ To run it:
flash SRR611712_1.fastq SRR611712_2.fastq -o SRR611712_12.flash
For improving the assembly or improving the detection of larger genome rearrangements there are other libraries with various insert sizes, such as 2.5-3kb or 5kb and more. Often sequencing yields from such libs are lower than from the conventional ones.
- stranded vs unstranded (RNASeq only)
We can obtain reads just from a given strand using special Illumina wet lab kits. This is of a great value for subsequent gene calling, since we can distinguish between overlapping genes on opposite strands.
quality checking (FastQC)
It is always a good idea to check the quality of the sequencing data prior to mapping. We can analyze average quality, over-represented sequences, number of Ns along the read and many other parameters. The program to use is FastQC, and it can be run in command line or GUI mode.
- good quality report:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
- bad quality FastQC report
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
trimming & filtering
Depending on the application, we can try to improve the quality of our data set by removing bad quality reads, clipping the last few problematic bases, or search for sequencing artifacts, as Illumina adapters. All this makes much sense for de novo sequencing, were genome assemblies can be improved by data clean up. It has a low priority for mapping, especially when we have high coverage. Bad quality reads etc. will simply be discarded by the mapper.
You can read more about quality trimming for genome assembly in the two blog posts by Nick Loman:
http://pathogenomics.bham.ac.uk/blog/2013/04/adaptor-trim-or-die-experiences-with-nextera-libraries/
Trimmomatic
web: http://www.usadellab.org/cms/index.php?page=trimmomatic
version (Sep 2014): 0.32
From the manual:
Paired End:
java -jar /path/to/trimmomatic-0.32.jar PE --phred33 \ input_forward.fq.gz input_reverse.fq.gz \ output_forward_paired.fq.gz output_forward_unpaired.fq.gz \ output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz \ ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
This will perform the following:
Remove adapters Remove leading low quality or N bases (below quality 3) Remove trailing low quality or N bases (below quality 3) Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 Drop reads below the 36 bases long
Single End:
java -jar /path/to/trimmomatic-0.32.jar SE --phred33 \ input.fq.gz output.fq.gz \ ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
This will perform the same steps, using the single-ended adapter file
Tagdust (for simple unpaired reads)
Tagdust is a program for removing Illumina adapter sequences from the reads containing them. Such reads containing 6-8 bases not from genome will be impossible to map using typical mappers having often just 2 mismatch base limit. Tagdust works in an unpaired mode, so when using paired reads we have to "mix and match" two outputs to allow for paired mappings.
tagdust -o my_reads.clean.out.fq -a my_reads.artifact.out.fq adapters.fasta my_reads.input.fq
Error correction
For some applications, like de novo genome assembly, one can correct the sequencing errors in the reads by comparing them with other reads with almost identical sequence. One of the programs which do perform this and are relatively easy to install and make it running is Coral.
Coral
web site: http://www.cs.helsinki.fi/u/lmsalmel/coral/ version: 1.4
It requires large RAM machine for correcting individual Illumina files (run it on 96GB RAM).
#Illumina reads ./coral -fq input.fq -o output.fq -illumina #454 reads ./coral -fq input.454.fq -o output.454.fq -454 #correcting 454 reads with Illumina reads Coral can not use more than one input file, therefore one has to combine Illumina & 454 reads into one FASTQ file, noting the number of Illumina reads used. To prepare such file: cat 10Millions_illumina_reads.fq > input_4_coral.fq cat some_number_of_454_reads.fq >> input_4_coral.fq ## run coral with the command: coral -454 -fq input_4_coral.fq -o input_4_coral.corrected.fq-i 10000000 -j 10000000 #This will correct just the 454 reads,not the Illumina ones #In real life you have to count the number of reads in your Illumina FASTQ file, i.e. (assuming you do not have wrapped sequence/qualities FASTQ 1 read=4lines ) : wc -l illumina_reads.fq | awk '{print $1/4}' #if in doubt, use fastqc to get the numbers
Public FASTQ data: Short Read Archive vs ENA
While we will often have our data sequenced in house/provided by collaborators, we can also reuse sequences made public by others. Nobody does everything imaginable with their data, so it is quite likely we can do something new and useful with already published data, even if treating it as a control to our pipeline. Also doing exactly the same thing, say assembling genes from RNASeq data but with a newer versions of the software and or more data will likely improve on the results of previous studies. There are two main places to get such data sets:
- NCBI Short Read Archive / Taxonomy Browser:
For both of them it is a good idea to install program Aspera Connect to reduce download times:
Click on Resources, download and install the web plugin.
* go there * put Leishmania major RNA * we get "SRA Experiments 6" * click on "6" * one of them is: "Transcriptome Analysis of Leishmania major strain Friedlin", ERX190352
- Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi
* go there * put Leishmania major * click on the checkbox SRA Experiments * click on Display button * we got 58 public experiment, 7of which are RNA * click on RNA (7) on the left * note "Whole Genome Sequencing of Leishmania major strain Friedlin" Accession: SRX203187 CAVEAT: not everybody submits the right descriptions of their experiments. In case of doubt, download and map.
Extracting the FASTQ from SRA archive is easy with sra-toolkit installed on your computer You get it here: https://github.com/ncbi/sratoolkit
fastq-dump SRR1016916.sra #Result: SRR1016916.fastq
You can also get the multiple SRA files in one step and without 20 clicks from NCBI with configured ASPERA by writing a shell script (one line per file), preferably using a (Python/Perl/Ruby) script and a list of files to get:
ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \ anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012154/ ./ ascp -QTrk1 -l200m -i /home/me/.aspera/connect/etc/asperaweb_id_dsa.openssh \ anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByStudy/sra/SRP/SRP012/SRP012155/ ./
Which one to use? ENA may be easier as you get gzipped fastq files directly. But NCBI tools may have better interface at times, so you can search for interesting data set at NCBI, then store the names of experiments and download fastq.gz from ENA.
SAM and BAM file formats
The SAM file format serves to store information about result of mapping of reads to the genome. It starts with a header, describing the format version, sorting order (SO) of the reads, genomic sequences to which the reads were mapped. Numbering starts from base 1, contains header, and it is line oriented (one sequence per line).
The minimal version looks like this:
@HD VN:1.0 SO:unsorted @SQ SN:1 LN:171001 @PG ID:bowtie2 PN:bowtie2 VN:2.1.0
It can contain both mapped and unmapped reads (we are mostly interested in mapped ones). Here is the example:
#unmapped reads: SRR594632.1 4 * 0 0 * * 0 0 NTGAGGACAAAGCTCCGCTGCTGCTCGCTTTCCCCT BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SRR594632.2 4 * 0 0 * * 0 0 NTGAGCGAACTATTGATATTCGGCATAGAACAAAAA BQKKHNMKOR___T_TTRWWVTVTVVTTOV___QQQ SRR594632.3 4 * 0 0 * * 0 0 NTGTTATTCAATCTGAGCATCGTTATCTTGACCATG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB SRR #mapped reads SRR594632.20 0 LmxM.20 3139684 25 36M * 0 0 NTGAAAATAAAGAACATCTGAACATTTCTCTGTAAG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0A0A34 SRR594632.21 16 LmxM.22 312482 25 36M * 0 0 ATACAAGGCACAAAAATAAAGCAGTCGAGAGAGCAN BBBBBBBBBBBBBBBB__________RRRRRLLLQBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:34T0G0 SRR594632.22 16 LmxM.01 33973 25 36M * 0 0 AACACTGTGCGTGTGCGCTTGTCTAAGATGAGCCAN _WTRWWTOTVTTVPTT__________PPLOPJJJKBXT:A:U NM:i:2 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:34T0A0
In short, it is a complex format, where in each line we have detailed information about the mapped read, its quality mapped position(s), strand, etc. The exact description of it takes (with BAM and examples) 15 pages: http://samtools.sourceforge.net/SAMv1.pdf Use BAMs instead of SAM files (speed of access, size, compatibility with tools.There are multiple tools to process and extract information from SAM and its compressed form, BAM files. The few used on daily/basis:
- samtools
- Picard
- bamtools
- bedtools
Typical tasks:
#view the BAM file samtools view your_bam | less #view the header of BAM file samtools view -H your_bam | less #sort the BAM/SAM file java -jar /path/to/SortSam.jar I=mapped_reads.unsorted.sam O=mapped_reads.sorted.bam \ SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true #merge (sorted) BAM files: java -jar /path/to/MergeSamFiles.jar \ I=mapped_reads.set1.bam \ I=mapped_reads.set2.bam \ I=mapped_reads.set3.bam \ O=merged_mapped_reads.sets1_2_3.bam SO=coordinate VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true ASSUME_SORTED=true #extracting as BAM reads mapped to certain chromosome (LmxM.01): bamtools filter -in ERR307343_12.Lmex.bwa_mem.Lmex.bwa.sorted.bam \ -out chr_1_ERR307343_12.Lmex.bwa_mem.Lmex.bam -region LmxM.01
GTF/GFF
The most commonly used sequence annotation format with few flavors.
GTF Fields:
seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position of the feature, with sequence numbering starting at 1. end - End position of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
Sample GFF output from Ensembl export:
X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21 X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303 X Ensembl Repeat 2419108 2419128 0 . . hid=dust; hstart=2419108; hend=2419128 X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN00000019335 X Ensembl Variation 2413425 2413425 . + . X Ensembl Variation 2413805 2413805 . + .
And from TriTrypDB (gff3)
LmjF.01 TriTrypDB CDS 12073 12642 . - 0 ID=cds_LmjF.01.0040-1;Name=cds;description=.;size=570;Parent=rna_LmjF.01.0040-1 LmjF.01 TriTrypDB exon 12073 12642 . - . ID=exon_LmjF.01.0040-1;Name=exon;description=exon;size=570;Parent=rna_LmjF.01.0040-1 LmjF.01 TriTrypDB gene 15025 17022 . - . ID=LmjF.01.0050;Name=LmjF.01.0050;description=carboxylase%2C+putative;size=1998;web_id=LmjF.01.0050;locus_tag=LmjF.01.0050;size=1998;Alias=321438056,389592315,LmjF1.0050,LmjF01.0050,LmjF.01.0050,LmjF01.0050:pep,LmjF01.0050:mRNA
The detailed description of GFF3: http://www.sequenceontology.org/gff3.shtml
IGV-centered CAVEAT: GFF files downloaded from various sources often contain more annotations than a single track of IGV can handle. Typical offenders are chromosome/scaffold lines, causing non-visibility of gene annotations. Since the authors are quite inventive in naming their tracks, plus key words like "chromosome" may be in column 9, you can:
#see what is there: grep -v "^#" TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff | awk '{print $3}' | sort | uniq -c | sort -n #result TriTrypDB-8.0_LmexicanaMHOMGT2001U1103.gff: 1 random_sequence 16 rRNA 34 chromosome 83 tRNA 714 transcript 8250 mRNA 8336 CDS 9063 gene 9149 exon 908081
As you can see, we can try to remove just 34+1 (chromosome + random_sequence) and check that we will get 908081-35 entries when running it the same command on the result. Just remember that we removed the header, which we may want to keep.
BED
Another text annotation file format from UCSC, compatible with its Genome Browser: http://genome.ucsc.edu/cgi-bin/hgGateway
Full format description: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
CAVEAT: first base in contig/chromosome is numbered as 0
- Essential 3 columns
1: chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2: chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3: chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
#just regular genomic intervals for i.e GC calculations LmxM.01 0 10000 LmxM.01 10000 20000 LmxM.01 20000 30000 LmxM.01 30000 40000
- Often used next 3 columns
4: name - Defines the name of the BED line. I
5: score - A score between 0 and 1000.
6: strand - Defines the strand - either '+' or '-'.
#just regular genomic intervals for i.e GC calculations LmxM.01 210 720 Rep1 100 - LmxM.01 800 1234 ABC|cds 1000 + LmxM.01 2100 3030 DEF|gen 100 +
- Last 6 columns for extended features
7: thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
8: thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).
9: itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line.
10: blockCount - The number of blocks (exons) in the BED line.
11: blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
12: blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
VCF
Stands for Variant Call Format. Text file format for storing information about SNPs, insertions and deletions. It is rather complex, see detailed description: http://www.1000genomes.org/node/101
Obtaining variants listed in such form is a multistep procedure, but well standardised. See GATK section below.
Here are the examples from our data:
#simple SNP LmxM.01 17218 . G A 1818.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=2.761;DP=55 #insertion LmxM.01 38177 . C CTG 367.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.082;DP=34 #deletion LmxM.01 99318 . GCACACACA G 1774.73 . AC=2;AF=1.00;AN=2;DP=30
For using it with IGV, we have to have VCF index, created by GATK.
WIG
An another line oriented format for storing and displaying numerical information along the genome for intervals. The generic form of WIG:
line defining type of this section, chr_position, start, step and more number1 number2
The intervals can be regular, as when we compute some measure across the whole genome. This is fixedStep type:
fixedStep chrom=LmxM.01 start=0 step=10000 span=10000 0.572100 0.612700 0.637300 0.628000 0.637200 0.643700 0.642800
Above fixedStep WIG is from GC calculations in a given window in the genome. Procedure described here (we start from "create a N-base wide step file": http://wiki.bits.vib.be/index.php/Create_a_GC_content_track
When the calculated number different from 0 are sparse in our genomic intervals, it is simpler to use variableStep
variableStep chrom=LmxM.17 span=41 1 0.142857 variableStep chrom=LmxM.17 span=249 43 0.142857 variableStep chrom=LmxM.17 span=4 292 0.166667 variableStep chrom=LmxM.17 span=7 296 0.142857
This variableStep example comes from gem-mapability program from GEM-Mapper.
#index the genome (fix FASTA headers first to one word sequence names) gem-indexer -i Lmex_genome.fa -o Lmex_genome #calculate mapability for 100bp fragments gem-mappability -I Lmex_genome.gem -o Lmex_genome -l 100 #convert to WIG gem-2-wig -I Lmex_genome.gem -i Lmex_genome.mappability -o Lmex_genome.mappability #Result (loadable in IGV): Lmex_genome.mappability.wig