Harvard:Biophysics 101/2007/Notebook:CChi/2007-2-6

Code
Modified to
 * Process a different GenBank ID of your choosing
 * Tally stretches of poly-T instead of poly-A
 * Print the translated protein sequence (hint) and its length
 * Create a new NCBIDictionary without a parser and use that to print the a raw record


 * 1) !/usr/bin/env python

from Bio import GenBank, Seq from Bio.Seq import Seq,translate

record_parser = GenBank.FeatureParser
 * 1) We can create a GenBank object that will parse a raw record
 * 2) This facilitates extracting specific information from the sequences

ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)
 * 1) NCBIDictionary is an interface to Genbank

parsed_record = ncbi_dict['5000000']
 * 1) If you pass NCBIDictionary a GenBank id, it will download that record
 * 1) Rattus norvegicus cDNA clone, mRNA sequence

print "GenBank id:", parsed_record.id

s = parsed_record.seq.tostring print "total sequence length:", len(s)
 * 1) Extract the sequence from the parsed_record

max_repeat = 9


 * 1) Tally stretches of poly T

print "method 1" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) print substr, s.count(substr)

print "\nmethod 2" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) count = 0 pos = s.find(substr,0) while not pos == -1: count = count + 1 pos = s.find(substr,pos+1) print substr, count


 * 1) Translate Protein Sequence

start = s.find('ATG') readingframe = '' position = start genelength = 0
 * 1) Find start codon

for i in range(len(s)-start-1): readingframe = readingframe + s[position] genelength = genelength + 1 if genelength%3 == 0 and position <= len(s)-start-4: codon=s[position+1]+s[position+2]+s[position+3] if codon=='TAG' or codon=='TGA' or codon=='TAA': readingframe = readingframe + codon break position = position + 1
 * 1) Find open reading frame until stop codon or end

protein = translate(readingframe) print "\nprotein sequence: ", protein print "protein length:  ", len(protein)


 * 1) Create a new NCBIDictionary without a parser and print raw record

newNCBIdict = GenBank.NCBIDictionary('nucleotide','genbank') rawrecord = newNCBIdict['5000000'] print "\nraw record:      ", rawrecord

Output
GenBank id: AI710224.1 total sequence length: 352 method 1 T 78 TT 9 TTT 0 TTTT 0 TTTTT 0 TTTTTT 0 TTTTTTT 0 TTTTTTTT 0 TTTTTTTTT 0

method 2 T 78 TT 9 TTT 0 TTTT 0 TTTTT 0 TTTTTT 0 TTTTTTT 0 TTTTTTTT 0 TTTTTTTTT 0

protein sequence: MSWRHRHKDLIVIF* protein length:   15

raw record:       LOCUS       AI710224                 352 bp    mRNA    linear   EST 04-JUN-1999 DEFINITION UI-R-AF0-yd-f-08-0-UI.s1 UI-R-AF0 Rattus norvegicus cDNA clone UI-R-AF0-yd-f-08-0-UI 3', mRNA sequence. ACCESSION  AI710224 VERSION    AI710224.1  GI:5000000 KEYWORDS   EST. SOURCE      Rattus norvegicus (Norway rat) ORGANISM Rattus norvegicus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Rattus. REFERENCE  1  (bases 1 to 352) AUTHORS  Bonaldo,M.F., Lennon,G. and Soares,M.B. TITLE     Normalization and subtraction: two approaches to facilitate gene discovery JOURNAL  Genome Res. 6 (9), 791-806 (1996) PUBMED  8889548 COMMENT    Contact: Soares, MB            Coordinated Laboratory for Computational Genomics University of Iowa 375 Newton Road, 4156 MEBRF, Iowa City, IA 52242, USA Tel: 319 335 8250 Fax: 319 335 9565 Email: bento-soares@uiowa.edu Oligo-dT track not found, Not I site shown in beginning of sequence is likely internal to the message. cDNA Library Preparation: M.B.           Soares Lab Clone distribution: clones will be available through Research Genetics (www.resgen.com) Seq primer: M13 Forward POLYA=No. FEATURES            Location/Qualifiers source         1..352 /organism="Rattus norvegicus" /mol_type="mRNA" /strain="Sprague-Dawley" /db_xref="taxon:10116" /clone="UI-R-AF0-yd-f-08-0-UI" /dev_stage="adult" /lab_host="DH10B (Life Technologies)" /clone_lib="UI-R-AF0" /note="Vector: pT7T3D-PacI; Site_1: Not I; Site_2: Eco RI;                    The UI-R-AF0 library is a non-normalized library                     constructed from  15 dpc rat atrioventricular (AV) canal.                     The tag is a string of 5 nucleotides present  between the                     Not I site and the oligo-dT track.  The library was                     constructed as described by Bonaldo, Lennon and Soares,                     Genome Research 6: 791-806, 1996. Tissue provided by Jim                     Lin, Department of Biology, University of Iowa.                     TAG_TISSUE=ventricle at 15 dpc                     TAG_LIB=UI-R-AF0                     TAG_SEQ=GTGTC" ORIGIN 1 cggccgcccc tcacttcnca tctggcagga ctgaagcaaa ccaccaaagg tcatagcaga 61 gtgtgggtct tctgctcctc aggtcagcct ctgtcgtggt cgccaggtgc tgctcaaggc 121 aactgatgag ctggagacac cggcacaaag acctcatcgt catcttctag cccttcctcg 181 attggcttca tcttgggaga ggctcgctgc tgcggggagg acatggggag agaagccgtg 241 ctggaggagc cccgcaggaa gtggtggagg ccgtcctcga tgtcgctgga gctattgatg 301 ctgttcctca tctccagcat ggagatctgt gtgctgaggc tcatctgggg ct //