User talk:Darek Kedra/sandbox 26

From OpenWetWare
Jump to: navigation, search

GFF Comparison

GFF, in particular GFF3, is a fairly common standard to store information in text files. For description see:

In the process of genome annotation using multiple tools there is a need of comparing the output of i.e. gene prediction programs, ESTs/protein mapping. Given two GFF files (A and B) with gene models, one can compare them on various levels, such as:

  • nucleotide level:

how many nucleotides annotated as features, i.e. nucleotides in exons are in both sets

  • splice junction level
  • exon level

how many exact exons on the same strand do overlap

  • gene level

how many genes are identical

For more information read Evan Keibler's (autor of eval) master thesis:

CAVEAT: tools listed below are often fairly simple. Some do not take into account "type" (#3 column), therefore one can compare exons from one file with a combined set of genes, exons and introns from another. Some programs smuggle extra information about primary/last exons into type" field, so all exons from one file will be compared with not all exons from the other. Always check if GFF data is compatible.

Perl scripts collection


Tested: (sorts GFF streams by sequence name and startpoint)

Python efforts

  • Brad Chapman's GFF parser:

  • GFFutils by Ryan Dale:

  • Pygr

main link:

discussion about gff/annotation parsing:

  • bpbio

  • bx-python


  • BioRuby library:


Biojava module:

Stand alone programs

  • Eval

link: version: 2.2.8

Perl program with GUI.

  • GPFE

GFPE: gene-finding program evaluation Bioinformatics (2003) 19 (13): 1712-1713. doi: 10.1093/bioinformatics/btg216

link: Program in java.

  • overlap

link: author: Sarah Djebali