Harvard:Biophysics 101/Notebook:ZS/2007-2-6
Assignment 1, due 2/6/07
Original
Output of original code (for Homo sapiens topoisomerase (DNA) II binding protein 1, mRNA link here):
GenBank id: BC126209.1 total sequence length: 5148 method 1 A 1669 AA 427 AAA 150 AAAA 48 AAAAA 16 AAAAAA 8 AAAAAAA 4 AAAAAAAA 1 AAAAAAAAA 0 method 2 A 1669 AA 589 AAA 218 AAAA 76 AAAAA 29 AAAAAA 13 AAAAAAA 5 AAAAAAAA 1 AAAAAAAAA 0
Different GI
For modified version to display subsection of cyanobacteria PCC7942 genome (GI: 23957787) : File:2607 kaiC 1.txt
Output:
GenBank id: AY120853.2 total sequence length: 74355 method 1 A 16631 AA 3436 AAA 845 AAAA 210 AAAAA 57 AAAAAA 16 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0 method 2 A 16631 AA 4327 AAA 1117 AAAA 288 AAAAA 78 AAAAAA 21 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0
Poly-T instead of Poly-A
For code: File:Code1 forT.txt
Output:
GenBank id: BC126209.1 total sequence length: 5148 method 1 T 1387 TT 309 TTT 102 TTTT 31 TTTTT 10 TTTTTT 4 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0 method 2 T 1387 TT 418 TTT 144 TTTT 46 TTTTT 15 TTTTTT 5 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0
Translating protein sequence
The tricky part with translating the sequence is that the coding sequence is buried in the data from NCBI; there needs to be a second parsing layer to find the start codon, extract the CDS, and find the end codon at which to terminate the signal. For this, an extra parsing layer needs to be made; the code is generalizable for any sequence with 1 CDS in it.
For code:File:Code fortranslating.txt
Output (sorry for formatting, but it matches NCBI):
MSRNDKEPFFVKFLKSSDNSKCFFKALESIKEFQSEEYLQIITEEEALKIKENDRSLYICDPFSGVVFDHLKKLGCRIVGPQVVIFCMHHQRCVPRAEHPVYNMVMSDVTISCTSLEKEKREEVHKYVQMMGGRVYRDLNVSVTHLIAGEVGSKKYLVAANLKKPILLPSWIKTLWEKSQEKKITRYTDINMEDFKCPIFLGCIICVTGLCGLDRKEVQQLTVKHGGQYMGQLKMNECTHLIVQEPKGQKYECAKRWNVHCVTTQWFFDSIEKGFCQDESIYKTEPRPEAKTMPNSSTPTSQINTIDSRTLSDVSNISNINASCVSESICNSLNSKLEPTLENLENLDVSAFQAPEDLLDGCRIYLCGFSGRKLDKLRRLINSGGGVRFNQLNEDVTHVIVGDYDDELKQFWNKSAHRPHVVGAKWLLECFSKGYMLSEEPYIHANYQPVEIPVSHKPESKAALLKKKNSSFSKKDFAPSEKHEQADEDLLSQYENGSSTVVEAKTSEARPFNDSTHAEPLNDSTHISLQEENQSSVSHCVPDVSTITEEGLFSQKSFLVLGFSNENESNIANIIKENAGKIMSLLSRTVADYAVVPLLGCEVEATVGEVVTNTWLVTCIDYQTLFDPKSNPLFTPVPVMTGMTPLEDCVISFSQCAGAEKESLTFLANLLGASVQEYFVRKSNAKKGMFASTHLILKERGGSKYEAAKKWNLPAVTIAWLLETARTGKRADESHFLIENSTKEERSLETEITNGINLNSDTAEHPGTRLQTHRKTVVTPLDMNRFQSKAFRAVVSQHARQVAASPAVGQPLQKEPSLHLDTPSKFLSKDKLFKPSFDVKDALAALETPGRPSQQKRKPSTPLSEVIVKNLQLALANSSRNAVALSASPQLKEAQSEKEEAPKPLHKVVVCVSKKLSKKQSELNGIAASLGADYRWSFDETVTHFIYQGRPNDTNREYKSVKERGVHIVSEHWLLDCAQECKHLPESLYPHTYNPKMSLDISAVQDGRLCNSRLLSAVSSTKDDEPDPLILEENDVDNMATNNKESAPSNGSGKNDSKGVLTQTLEMRENFQKQLQEIMSATSIVKPQGQRTSLSRSGCNSASSTPDSTRSARSGRSRVLEALRQSRQTVPDVNTEPSQNEQIIWDDPTAREERARLASNLQWPSCPTQYSELQVDIQNLEDSPFQKPLHDSEIAKQAVCDPGNIRVTEAPKHPISEELETPIKDSHLIPTPQAPSIAFPLANPPVAPHPREKIITIEETHEELKKQYIFQLSSLNPQERIDYCHLIEKLGGLVIEKQCFDPTCTHIVVGHPLRNEKYLASVAAGKWVLHRSYLEACRTAGHFVQEEDYEWGSSSILDVLTGINVQQRRLALAAMRWRKKIQQRQESGIVEGAFSGWKVILHVDQSREAGFKRLLQSGGAKVLPGHSVPLFKEATHLFSDLNKLKPDDSGVNIAEAAAQNVYCLRTEYIADYLMQESPPHVENYCLPEAISFIQNNKELGTGLSQKRKAPTEKNKIKRPRVH
Raw record and final code
Final code
#!/usr/bin/env python
from Bio import GenBank, Seq
from Bio.Seq import Seq,translate
# We can create a GenBank object that will parse a raw record
# This facilitates extracting specific information from the sequences
record_parser = GenBank.FeatureParser()
# NCBIDictionary is an interface to Genbank
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)
# If you pass NCBIDictionary a GenBank id, it will download that record
parsed_record = ncbi_dict['116496646']
print "GenBank id:", parsed_record.id
# Extract the sequence from the parsed_record
s = parsed_record.seq.tostring()
print "total sequence length:", len(s)
max_repeat = 9
print "method 1"
for i in range(max_repeat):
    substr = .join(['T' for n in range(i+1)])
    print substr, s.count(substr)
print "\nmethod 2"
for i in range(max_repeat):
    substr = .join(['T' for n in range(i+1)])
    count = 0
    pos = s.find(substr,0)
    while not pos == -1:
        count = count + 1
        pos = s.find(substr,pos+1)
    print substr, count
#translate sequence ==> protein, searching for start codon
start_site = s.find('ATG') #position of start codon
totrans = #compiling string of CDS
counter = start_site
countcodon = 0;
for i in range(len(s)-4-start_site):
    totrans = totrans + s[counter]
    counter = counter + 1
    stoptest = s[counter]+s[counter+1]+s[counter+2]
    if countcodon == 2:
        if stoptest == 'TAA' or stoptest == 'TAG' or stoptest == 'TGA':
            totrans = totrans + s[counter+1] + s[counter+2]
            break
        countcodon = -1
    countcodon = countcodon + 1
        
    
trans = translate(totrans)
print 'protein translation is %s' % trans
# new ncbi dictionary
print '\n'
new_dict = GenBank.NCBIDictionary('nucleotide','genbank')
record =  new_dict['116496646']
print record
<\pre>