Harvard:Biophysics 101/Notebook:ZS/2007-2-6

Assignment 1, due 2/6/07
Link to Assignment 1

Original
Output of original code (for Homo sapiens topoisomerase (DNA) II binding protein 1, mRNA link here): GenBank id: BC126209.1 total sequence length: 5148 method 1 A 1669 AA 427 AAA 150 AAAA 48 AAAAA 16 AAAAAA 8 AAAAAAA 4 AAAAAAAA 1 AAAAAAAAA 0 method 2 A 1669 AA 589 AAA 218 AAAA 76 AAAAA 29 AAAAAA 13 AAAAAAA 5 AAAAAAAA 1 AAAAAAAAA 0

Different GI
For modified version to display subsection of cyanobacteria PCC7942 genome (GI: 23957787) :

Output: GenBank id: AY120853.2 total sequence length: 74355 method 1 A 16631 AA 3436 AAA 845 AAAA 210 AAAAA 57 AAAAAA 16 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0 method 2 A 16631 AA 4327 AAA 1117 AAAA 288 AAAAA 78 AAAAAA 21 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0

Poly-T instead of Poly-A
For code:

Output:

GenBank id: BC126209.1 total sequence length: 5148 method 1 T 1387 TT 309 TTT 102 TTTT 31 TTTTT 10 TTTTTT 4 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0 method 2 T 1387 TT 418 TTT 144 TTTT 46 TTTTT 15 TTTTTT 5 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0

Translating protein sequence
The tricky part with translating the sequence is that the coding sequence is buried in the data from NCBI; there needs to be a second parsing layer to find the start codon, extract the CDS, and find the end codon at which to terminate the signal. For this, an extra parsing layer needs to be made; the code is generalizable for any sequence with 1 CDS in it.

For code:

Output (sorry for formatting, but it matches NCBI):

MSRNDKEPFFVKFLKSSDNSKCFFKALESIKEFQSEEYLQIITEEEALKIKENDRSLYICDPFSGVVFDHLKKLGCRIVGPQVVIFCMHHQRCVPRAEHPVYNMVMSDVTISCTSLEKEKREEVHKYVQMMGGRVYRDLNVSVTHLIAGEVGSKKYLVAANLKKPILLPSWIKTLWEKSQEKKITRYTDINMEDFKCPIFLGCIICVTGLCGLDRKEVQQLTVKHGGQYMGQLKMNECTHLIVQEPKGQKYECAKRWNVHCVTTQWFFDSIEKGFCQDESIYKTEPRPEAKTMPNSSTPTSQINTIDSRTLSDVSNISNINASCVSESICNSLNSKLEPTLENLENLDVSAFQAPEDLLDGCRIYLCGFSGRKLDKLRRLINSGGGVRFNQLNEDVTHVIVGDYDDELKQFWNKSAHRPHVVGAKWLLECFSKGYMLSEEPYIHANYQPVEIPVSHKPESKAALLKKKNSSFSKKDFAPSEKHEQADEDLLSQYENGSSTVVEAKTSEARPFNDSTHAEPLNDSTHISLQEENQSSVSHCVPDVSTITEEGLFSQKSFLVLGFSNENESNIANIIKENAGKIMSLLSRTVADYAVVPLLGCEVEATVGEVVTNTWLVTCIDYQTLFDPKSNPLFTPVPVMTGMTPLEDCVISFSQCAGAEKESLTFLANLLGASVQEYFVRKSNAKKGMFASTHLILKERGGSKYEAAKKWNLPAVTIAWLLETARTGKRADESHFLIENSTKEERSLETEITNGINLNSDTAEHPGTRLQTHRKTVVTPLDMNRFQSKAFRAVVSQHARQVAASPAVGQPLQKEPSLHLDTPSKFLSKDKLFKPSFDVKDALAALETPGRPSQQKRKPSTPLSEVIVKNLQLALANSSRNAVALSASPQLKEAQSEKEEAPKPLHKVVVCVSKKLSKKQSELNGIAASLGADYRWSFDETVTHFIYQGRPNDTNREYKSVKERGVHIVSEHWLLDCAQECKHLPESLYPHTYNPKMSLDISAVQDGRLCNSRLLSAVSSTKDDEPDPLILEENDVDNMATNNKESAPSNGSGKNDSKGVLTQTLEMRENFQKQLQEIMSATSIVKPQGQRTSLSRSGCNSASSTPDSTRSARSGRSRVLEALRQSRQTVPDVNTEPSQNEQIIWDDPTAREERARLASNLQWPSCPTQYSELQVDIQNLEDSPFQKPLHDSEIAKQAVCDPGNIRVTEAPKHPISEELETPIKDSHLIPTPQAPSIAFPLANPPVAPHPREKIITIEETHEELKKQYIFQLSSLNPQERIDYCHLIEKLGGLVIEKQCFDPTCTHIVVGHPLRNEKYLASVAAGKWVLHRSYLEACRTAGHFVQEEDYEWGSSSILDVLTGINVQQRRLALAAMRWRKKIQQRQESGIVEGAFSGWKVILHVDQSREAGFKRLLQSGGAKVLPGHSVPLFKEATHLFSDLNKLKPDDSGVNIAEAAAQNVYCLRTEYIADYLMQESPPHVENYCLPEAISFIQNNKELGTGLSQKRKAPTEKNKIKRPRVH

Raw record and final code
Final code
 * 1) !/usr/bin/env python

from Bio import GenBank, Seq from Bio.Seq import Seq,translate

record_parser = GenBank.FeatureParser
 * 1) We can create a GenBank object that will parse a raw record
 * 2) This facilitates extracting specific information from the sequences

ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)
 * 1) NCBIDictionary is an interface to Genbank

parsed_record = ncbi_dict['116496646']
 * 1) If you pass NCBIDictionary a GenBank id, it will download that record

print "GenBank id:", parsed_record.id

s = parsed_record.seq.tostring print "total sequence length:", len(s)
 * 1) Extract the sequence from the parsed_record

max_repeat = 9

print "method 1" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) print substr, s.count(substr)

print "\nmethod 2" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) count = 0 pos = s.find(substr,0) while not pos == -1: count = count + 1 pos = s.find(substr,pos+1) print substr, count


 * 1) translate sequence ==> protein, searching for start codon

start_site = s.find('ATG') #position of start codon totrans ='' #compiling string of CDS

counter = start_site countcodon = 0; for i in range(len(s)-4-start_site): totrans = totrans + s[counter] counter = counter + 1 stoptest = s[counter]+s[counter+1]+s[counter+2] if countcodon == 2: if stoptest == 'TAA' or stoptest == 'TAG' or stoptest == 'TGA': totrans = totrans + s[counter+1] + s[counter+2] break countcodon = -1 countcodon = countcodon + 1 trans = translate(totrans)

print 'protein translation is %s' % trans

print '\n' new_dict = GenBank.NCBIDictionary('nucleotide','genbank') record = new_dict['116496646'] print record <\pre>
 * 1) new ncbi dictionary