Wilke:Parsing Genbank files with Biopython

= The wonders of Biopython =

Virtually all of this information comes from the excellent but tome-like Biopython Tutorial. Consult it to make your wishes come true. This wiki is actively being built up, so don't lose hope if it is barren in some areas. Ask Thomas if you want some areas to be expanded upon.

Parsing Genbank Files
Biopython is an amazing resource if you don't feel like figuring out how to parse a bunch of different idiosyncratic sequence formats (fasta,fastq,genbank, etc).

Here I focus on parsing Genbank files; SeqIO can be used to parse a bunch of different formats, but the structure of the parsed data will vary.

Let's say you want to go through every gene in an annotated genome and pull out all the genes with some specific characteristic (say, we have no idea what they do). This problem is pretty easy once you know how to use Biopython's data structures.

For this example I will be using the E.coli K12 genome, which clocks in at around 13 mbytes.

To begin, we need to load the parser and parse the genbank file. It should only take a couple seconds from Bio import SeqIO genome=SeqIO.read('CP000948.gbk','genbank') #you MUST tell SeqIO what format is being read

Use SeqIO.read if there is only one genome (or sequence) in the file, and SeqIO.parse if there are multiple sequences. Since we're using genbank files, there typically (I think) only be a single giant sequence of the genome.

There are a bunch of data objects associated to the parsed file. The main one of interest will be the features object, which is a list of all the annotated features in the genome file.

>>>genome.features[:10] #prints a short description of the first ten features [SeqFeature(FeatureLocation(ExactPosition(0),ExactPosition(4686137)), type='source', strand=1), SeqFeature(FeatureLocation(ExactPosition(189),ExactPosition(255)), type='gene', strand=1), SeqFeature(FeatureLocation(ExactPosition(189),ExactPosition(255)), type='CDS', strand=1), SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2799)), type='gene', strand=1), SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2799)), type='CDS', strand=1), SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(2790)), type='misc_feature', strand=1), SeqFeature(FeatureLocation(ExactPosition(336),ExactPosition(1224)), type='misc_feature', strand=1), SeqFeature(FeatureLocation(ExactPosition(348),ExactPosition(1134)), type='misc_feature', location_operator='order', strand=1), SeqFeature(FeatureLocation(ExactPosition(348),ExactPosition(1035)), type='misc_feature', location_operator='order', strand=1), SeqFeature(FeatureLocation(ExactPosition(447),ExactPosition(717)), type='misc_feature', location_operator='order', strand=1)]

More on Features (ie what's interesting in genbank files)
Features contain all the annotation information that you care about. Biopython has a somewhat confusing object structure, so let's step through what types of information a feature can have.

The easiest way to inspect the structure of some random object I have found is Ipython, which is an awesome python interpreter that also has some nice terminal features (like cd ls mv...etc). After using this interpreter for a year, I hate going back to the vanilla one. Its best feature (for my forgetful mind) is easy access to help files associated with functions, and the objects associated with a class. It also will try to complete a partially typed function or variable name if you press TAB midway through. To make this description more concrete, here's some ipython output.

SeqIO.read?
 * 1) I forgot how to call SeqIO.read, what does it want?
 * 2) All I have to do is type SeqIO.read? ,then press Return

Features themsType:            function Base Class:       String Form: Namespace:       Interactive File:            /usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py Definition:       SeqIO.read(handle, format, alphabet=None) Docstring: Turns a sequence file into a single SeqRecord. - handle  - handle to the file, or the filename as a string (note older verions of Biopython only took a handle). - format  - string describing the file format. - alphabet - optional Alphabet object, useful when the sequence type cannot be automatically inferred from the file itself (e.g. format="fasta" or "tab") MORE STUFF OMITTED genome=SeqIO.read('k12_original.gbk','genbank') feature=genome.features[4]


 * 1) How do I find out what extra stuff is associated with this mythical feature?
 * 2) add a . then TAB
 * 3) You can also get this information in a more annoying (to me) way with dir(feat)

In [48]: feat. feat.__class__         feat.__hash__           feat.__repr__           feat._shift             feat.ref feat.__delattr__       feat.__init__           feat.__setattr__        feat.extract            feat.ref_db feat.__dict__          feat.__module__         feat.__sizeof__         feat.id                 feat.strand feat.__doc__           feat.__new__            feat.__str__            feat.location           feat.sub_features feat.__format__        feat.__reduce__         feat.__subclasshook__   feat.location_operator  feat.type feat.__getattribute__  feat.__reduce_ex__      feat.__weakref__        feat.qualifiers

As you can see, features contain lots of cryptic information. The four most important directly useful are generally type, qualifiers, extract, and location. I will explain each in turn.

Feature Types
Let's see what feature types the E. coli genome contains. These labels will (to my knowledge) apply to similar information in any genbank genome.

In [49]: feats=set

In [50]: for feat in genome.features: ....:    feats.add(feat.type) ....:

In [51]: feats Out[51]: set(['CDS',    'gene',     'mat_peptide',     'misc_feature',     'ncRNA',     'rRNA',     'rep_origin',     'repeat_region',     'source',     'tRNA',     'tmRNA'])

The main one we'll focus on are CDS features, which stands for coding sequences. These are the spliced (introns removed) mRNAs that are translated into function proteins. I believe gene features refer to the unspliced sequence, but don't quote me on that. For prokaryotes there's not really a difference since introns are virtually absent.

Feature Qualifiers
Features have the bulk of their annotation information stored in a dictionary named qualifiers. Typical information will be 'product' (for genes), 'gene' (name), and 'note' for misc. crap. Her's the qualifier dictionary for the first coding sequence (feature.type=='CDS'):

In [54]: feat=genome.features[4]

In [55]: feat.qualifiers Out[55]: {'EC_number': ['2.7.2.4', '1.1.1.3'], 'codon_start': ['1'], 'db_xref': ['GI:1786183', 'ASAP:ABE-0000008', 'UniProtKB/Swiss-Prot:P00561', 'EcoGene:EG10998'], 'experiment': ['N-terminus verified by Edman degradation: PMID 354697,4562989'], 'function': ['enzyme; Amino acid biosynthesis: Threonine', '1.5.1.8 metabolism; building block biosynthesis; amino acids; threonine', '1.5.1.21 metabolism; building block biosynthesis; amino acids; homoserine', '7.1 location of gene products; cytoplasm'], 'gene': ['thrA'], 'gene_synonym': ['ECK0002', 'Hs', 'JW0001', 'thrA1', 'thrA2', 'thrD'], 'locus_tag': ['b0002'], 'note': ['bifunctional: aspartokinase I (N-terminal); homoserine dehydrogenase I (C-terminal); GO_component: GO:0005737 - cytoplasm; More stuff], 'product': ['fused aspartokinase I and homoserine dehydrogenase I'], 'protein_id': ['AAC73113.1'], 'transl_table': ['11'], 'translation': ['MRVLKFGGTSVANA ...blah blah... GNDVTAAGVFADLLRTLSWKLGV']}

How would we use this information in practice? Well, 'product' and 'function' provide the current knowledge of what the gene (is thought to) make and what it (is thought to) do. A simple example for selecting specific types of genes.

predicted_genes=[] for i,feature in enumerate(genome.features): if feature.type=='CDS': if 'product' in feature.qualifiers: #verify it codes for a real protein (not pseudogene) tot_genes+=1 #data in qualifiers are all lists, even if only 1 string, so [0] converts to string #use lower to make sure weirdly capitilized strings get selected as well product=feature.qualifiers['product'][0].lower

if product=='predicted protein': num_predicted+=1 predicted_genes.append(feature)

Getting the actual coding sequence
Grabbing the sequence associated with a feature is now pretty easy. You previously had to do extra work if the gene was on the opposite strand. seq=feature.extract(genome.seq) protein=seq.translate #can also use optional switches like cds=True to raise errors if not a proper coding sequence (lacks start and stop codons)

As of Biopython ???, feature.extract(genome.seq) incorporates strandedness. Thus, older version of Biopython or sequence slices obtained other than the extract function will give garbled information.

With a little extra work you can use the location information associated with each feature to see what to do. Biopython sometimes seems to be designed to emulate a Russian nesting doll, so there are objects within objects that you need to mess with for this part.

sequence=feature.extract(genome.seq) # simple if you just want the gene sequence; works on antisense strand genes location=feature.location start=location.start.position end=location.end.position
 * 1) things get trickier if you want information outside of gene
 * 1) position gives the actual python slice location of the start and end. need to reverse if on negative strand
 * 2) could use to get upstream or downstream regions

Grabbing genomes from Genbank
You can use Biopython's Entrez module to grab individual genomes. You MUST provide your email so Entrez can email you if you start overloading their servers before they block you. The id used can be pretty much any identifier, such as the acession, the accession version, the genbank id, etc.

Before you go hog wild, read the rules!

Here they are, reproduced:

* Run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays for any series of more than 100 requests. * Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov, not the standard NCBI Web address. * Make no more than 3 requests every 1 second. * Use the URL parameter email, and tool for distributed software, so that we can track your project and contact you if there is a problem. * NCBI's Disclaimer and Copyright notice must be evident to users of your service.

The big one is the first one. Biopython by default complies with rules 2,3 and 4.

The id used can be pretty much any identifier, such as the accession, the accession version, the Genbank id, etc. From the eFetch documentation : Record Identifier Current values:

* NCBI sequence number (GI) * accession * accession.version * fasta * GeneID * genome ID   * seqid

from Bio import Entrez Entrez.email = 'whatev@mail.com' handle=Entrez.efetch(db='nucleotide',id='NC_009925',rettype='gb') # Accession id works, returns genbank format, looks in the 'nucleotide' database local_file=open('NC_009925','w') local_file.write(handle.read) handle.close local_file.close
 * 1) Replace with your real email
 * 1) store locally