Harvard:Biophysics 101/2007/Notebook:Denizkural/2007-2-6

Assignment due February 6

Here is the code for my assignment:


 * 1) !/usr/bin/env python

from Bio import GenBank, Seq from Bio.Seq import translate

record_parser = GenBank.FeatureParser
 * 1) We can create a GenBank object that will parse a raw record
 * 2) This facilitates extracting specific information from the sequences

ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)
 * 1) NCBIDictionary is an interface to Genbank

parsed_record = ncbi_dict['124484046']
 * 1) If you pass NCBIDictionary a GenBank id, it will download that record

print "GenBank id:", parsed_record.id

s = parsed_record.seq.tostring print "total sequence length:", len(s)
 * 1) Extract the sequence from the parsed_record

max_repeat = 9

my_protein = translate(s)
 * 1) Translate the sequence into a protein

print "protein length:", len(my_protein) print 'protein translation is: \n%s' %my_protein

print "\nmethod 1" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) print substr, s.count(substr)

print "\nmethod 2" for i in range(max_repeat): substr = ''.join(['T' for n in range(i+1)]) count = 0 pos = s.find(substr,0) while not pos == -1: count = count + 1 pos = s.find(substr,pos+1) print substr, count

print "\nNow we would like to print raw records:"

ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') gb_record = ncbi_dict['124484046']
 * 1) Create new dictionary without parser

print '\n%s' %gb_record

And here is the output:

GenBank id: AM491363.1 total sequence length: 1496 protein length: 498 protein translation is: PSMAFRVHSRNGKSYTFLISSDYERAEWRENIREQQKKCFRSFSLTSVELQMPTNSC VKLQTVHSIPLTINKEDDESPGLYGFLNVIVHSATGFKQSSNLYCTLEVDSFGYFVN KAKTRVYRDTAEPNWNEEFEIELEGSQTLRILCYEKCYNKTKIPKEDGESTDRLMGK GQVQLDPQALQDRDWQRTVIAMNGIEVKLSVKFNSREFSLKRMPSRKQTGVLGVKIA VVTKRERSKVPYIVRQCVEEIERRGMEEVGIYRVSGVATDIQALKAAFDVKALQRPV ASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRV LGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEHLLSSGING SFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHST VADGLITTLHYPAPKRNKPSVYGVSPNYDKWEMERTDITMKH

method 1 T 290 TT 39 TTT 9 TTTT 3 TTTTT 0 TTTTTT 0 TTTTTTT 0 TTTTTTTT 0 TTTTTTTTT 0

method 2 T 290 TT 48 TTT 12 TTTT 3 TTTTT 0 TTTTTT 0 TTTTTTT 0 TTTTTTTT 0 TTTTTTTTT 0

Now we would like to print raw records:

LOCUS      AM491363                1496 bp    mRNA    linear   PRI 13-FEB-2007 DEFINITION Homo sapiens partial mRNA for bcr-abl1 e19a2 chimeric protein. ACCESSION  AM491363 VERSION    AM491363.1  GI:124484046 KEYWORDS   bcr-abl1 e19a2 chimeric protein; BCR-ABL1 gene. SOURCE     Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE  1 AUTHORS  Burmeister,T. and Reinhardt,R. TITLE    A multiplex PCR for improved detection of all known BCR-ABL fusion transcripts JOURNAL  Unpublished REFERENCE  2  (bases 1 to 1496) AUTHORS  Burmeister,T. TITLE    Direct Submission JOURNAL  Submitted (02-FEB-2007) Burmeister T., Medizinische Klinik III, Charite Universitaetsmedizin Berlin, CBF, Hindenburgdamm 30, 12200 Berlin, GERMANY FEATURES            Location/Qualifiers source         1..1496 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /cell_type="leukocyte" /note="fusion of BCR exon 19 and ABL1 exon 2" source         1..835 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /map="22q11" source         836..1496 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /map="9q34" gene           <1..>1496 /gene="BCR-ABL1 e19a2" CDS            <1..>1496 /gene="BCR-ABL1 e19a2" /function="tyrosine kinase, oncogene" /codon_start=1 /product="bcr-abl1 e19a2 chimeric protein" /protein_id="CAM33013.1" /db_xref="GI:124484047" /translation="PSMAFRVHSRNGKSYTFLISSDYERAEWRENIREQQKKCFRSFS                    LTSVELQMPTNSCVKLQTVHSIPLTINKEDDESPGLYGFLNVIVHSATGFKQSSNLYC                     TLEVDSFGYFVNKAKTRVYRDTAEPNWNEEFEIELEGSQTLRILCYEKCYNKTKIPKE                     DGESTDRLMGKGQVQLDPQALQDRDWQRTVIAMNGIEVKLSVKFNSREFSLKRMPSRK                     QTGVLGVKIAVVTKRERSKVPYIVRQCVEEIERRGMEEVGIYRVSGVATDIQALKAAF                     DVKALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSI                     TKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEHL                     LSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAEL                     VHHHSTVADGLITTLHYPAPKRNKPSVYGVSPNYDKWEMERTDITMKH" variation      158 /gene="BCR-ABL1 e19a2" /note="T->C" /replace="t" variation      667 /gene="BCR-ABL1 e19a2" /note="C->T" /replace="c" variation      1171 /gene="BCR-ABL1 e19a2" /note="T->C" /replace="t" variation      1426 /gene="BCR-ABL1 e19a2" /note="A->T" /replace="a" ORIGIN 1 cccagcatgg ccttcagggt gcacagccgc aacggcaaga gttacacgtt cctgatctcc 61 tctgactatg agcgtgcaga gtggagggag aacatccggg agcagcagaa gaagtgtttc 121 agaagcttct ccctgacatc cgtggagctg cagatgccga ccaactcgtg tgtgaaactc 181 cagactgtcc acagcattcc gctgaccatc aataaggaag atgatgagtc tccggggctc 241 tatgggtttc tgaatgtcat cgtccactca gccactggat ttaagcagag ttcaaatctg 301 tactgcaccc tggaggtgga ttcctttggg tattttgtga ataaagcaaa gacgcgcgtc 361 tacagggaca cagctgagcc aaactggaac gaggaatttg agatagagct ggagggctcc 421 cagaccctga ggatactgtg ctatgaaaag tgttacaaca agacgaagat ccccaaggag 481 gacggcgaga gcacggacag actcatgggg aagggccagg tccagctgga cccgcaggcc 541 ctgcaggaca gagactggca gcgcaccgtc atcgccatga atgggatcga agtaaagctc 601 tcggtcaagt tcaacagcag ggagttcagc ttgaagagga tgccgtcccg aaaacagaca 661 ggggtcctcg gagtcaagat tgctgtggtc accaagagag agaggtccaa ggtgccctac 721 atcgtgcgcc agtgcgtgga ggagatcgag cgccgaggca tggaggaggt gggcatctac 781 cgcgtgtccg gtgtggccac ggacatccag gcactgaagg cagccttcga cgtcaaagcc 841 cttcagcggc cagtagcatc tgactttgag cctcagggtc tgagtgaagc cgctcgttgg 901 aactccaagg aaaaccttct cgctggaccc agtgaaaatg accccaacct tttcgttgca 961 ctgtatgatt ttgtggccag tggagataac actctaagca taactaaagg tgaaaagctc 1021 cgggtcttag gctataatca caatggggaa tggtgtgaag cccaaaccaa aaatggccaa 1081 ggctgggtcc caagcaacta catcacgcca gtcaacagtc tggagaaaca ctcctggtac 1141 catgggcctg tgtcccgcaa tgccgctgag catctgctga gcagcgggat caatggcagc 1201 ttcttggtgc gtgagagtga gagcagtcct ggccagaggt ccatctcgct gagatacgaa 1261 gggagggtgt accattacag gatcaacact gcttctgatg gcaagctcta cgtctcctcc 1321 gagagccgct tcaacaccct ggccgagttg gttcatcatc attcaacggt ggccgacggg 1381 ctcatcacca cgctccatta tccagcccca aagcgcaaca agccctctgt ctatggtgtg 1441 tcccccaact acgacaagtg ggagatggaa cgcacggaca tcaccatgaa gcacaa //