Harvard:Biophysics 101/2007/Notebook:Kaull/2007-2-6

From OpenWetWare
Jump to: navigation, search

Programming Assignment, due 2/6/07

Code:

#!/usr/bin/env python

from Bio import GenBank, Seq
# We can create a GenBank object that will parse a raw record
# This facilitates extracting specific information from the sequences

record_parser = GenBank.FeatureParser()

# NCBIDictionary is an interface to Genbank
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser)

##########
# Task 1: Use different GenBank ID

# If you pass NCBIDictionary a GenBank id, it will download that record
parsed_record = ncbi_dict['116496648']

print "GenBank id:", parsed_record.id

# Extract the sequence from the parsed_record
s = parsed_record.seq.tostring()
print "total sequence length:", len(s)
max_repeat = 9

##########
# Task 2: Count poly-T sequences

print "method 1"
for i in range(max_repeat):
    substr = ''.join(['T' for n in range(i+1)])
    print substr, s.count(substr)

print "\nmethod 2"
for i in range(max_repeat):
    substr = ''.join(['T' for n in range(i + 1)])
    count = 0
    pos = s.find(substr, 0)
    while not pos == -1:
        count = count + 1
        pos = s.find(substr, pos + 1)
    print substr, count
    
    
##########
# Task 3: Translate to protein

from Bio.Seq import translate

my_protein = translate(s)
my_protein_length = len(my_protein)

print '\n', "Translation to protein: "

count = 0
while count <= my_protein_length:
    print my_protein[count:min(count+50,my_protein_length)]
    count+=50
    
print "Length: ", len(my_protein)


##########
# Task 4: Create raw record

ncbi_dict_raw = GenBank.NCBIDictionary('nucleotide', 'genbank')
raw_record = ncbi_dict_raw['116496648']
print '\n', raw_record

Output:

GenBank id: BC126211.1
total sequence length: 4032
method 1
T 1088
TT 251
TTT 71
TTTT 30
TTTTT 12
TTTTTT 2
TTTTTTT 1
TTTTTTTT 1
TTTTTTTTT 0

method 2
T 1088
TT 333
TTT 114
TTTT 45
TTTTT 16
TTTTTT 4
TTTTTTT 2
TTTTTTTT 1
TTTTTTTTT 0

Translation to protein:  
AREYQGDSGPCRPPSPSAPHSAQVRGRALIFWRGPSWRRSQIRLRRRKRR
RGRTSRWW*DADHLIWQSGKLAPIQ**NVILYEKKLVYELEDWLTRAQGK
HTLLIWCLEHLLNRLMFTEVLFVQFWMKLLWAIIALSLRMAKLALEKLLQ
WKVKGHLMKSIPGKRIPWLV*FHVPFIKFLRNLLIMVLNFQSKCLCWRSI
MKSFLIFLIHHLMFLRDYRCLMIPVTREE**LKV*KKLQYTTRMKSIKF*
KRGQQKGQLQLL**MHTLVVPTQFSLLQYI*KKLRLMEKSLLKSES*TWL
ILQEVKTLAVLELLIRELGKLEI*INPC*LWEGSLLPL*KEHLMFLIENL
N*LESSRILLEGVQEHL*LQQFLLHLSILRKL*VHWNMLIEQRTY*ISLK
*IRNSPKKLLLRSIRRR*NV*NEILLQPVRKMECIFLKKILES*VEN*LF
KKSRL*N*LKKLVLLRRS*IGLQSCLWIIKMNLTSVNLTCKIKHKNLKPL
KNICKKLNYNLLKKNISHQLWKVLRRNFMMLPASCLTQLKKLQKMYLVSI
PNWIVRRQLTNTMQKLRIFLAKT*IVCLIIWKN*LRMAAQSKRPC*KYIR
PYLVICCLPVSLH*IPLLQ*HLDLSHLFQKMCLLMFLRFLI*Y*KNNH*Q
QKVKLYYRN*LMYSRLIF*VHWK*FYPQLWCLY*KSIVN*SIFSRLH*QW
PIR*KIKKRN*MAFSVYCVTIYMNYKKIPFVPWLSHKSNVET*LKT*RQ*
SRPIPRNFAS**IFGQRDSVLWRKSVKIYRNHLVVSRKIYSRNLRI*STK
*LFTVKNFVLILMASHRNSEILTKKVQNWLKNL*NTLINSMATWKKYLKR
LNRDVNL*TQEQFIFLNSGYLP*MKGNRNFTTYWRL*ANVVRLQVQTSLR
NQMDVRQLMRNSITFFLIR*LLMKIN**HKI*NLMKP*KLV*LSLIAFWN
RI*NWISQQVRHHRGKVIYTHQHW*ELNHVNISLIS*KGNSLSC**C*TV
QKTTKKRQFRMWM*KRQFWGSILKNL*VKSHL*MLVWIVHQLAGFHFSSI
KNHMEKTKKTEALTHWRGLKWKKLQSTWLQRADYLCEPRSTFNSLGGWQF
YF*RKLKNKT*NPRT*ALCIDFKRIYISAGRGGSCL*SQHFGRLRRVDCL
SPGV*DQPGQRGKTSSLLKISRAWWHTPVIPATGEAEARESLEPRKRGCS
EPKVHHYTPAWATEQDSVSKTKFKKDIRQYCKFS*ILISTHFSVIPIVHF
VLNWVSFGICNVNTYF*FSYKVVLL*QMKSIFLVYY*VMNI*ELYSSQLE
LT*VNITNICP*KGPSHVFFLAMTCVFSCLLPRLPYFAFSSAHF
Length:  1344

LOCUS       BC126211                4032 bp    mRNA    linear   PRI 23-OCT-2006
DEFINITION  Homo sapiens kinesin family member 11, mRNA (cDNA clone MGC:161489
            IMAGE:8991927), complete cds.
ACCESSION   BC126211
VERSION     BC126211.1  GI:116496648
KEYWORDS    MGC.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 4032)
  AUTHORS   Strausberg,R.L., Feingold,E.A., Grouse,L.H., Derge,J.G.,
            Klausner,R.D., Collins,F.S., Wagner,L., Shenmen,C.M., Schuler,G.D.,
            Altschul,S.F., Zeeberg,B., Buetow,K.H., Schaefer,C.F., Bhat,N.K.,
            Hopkins,R.F., Jordan,H., Moore,T., Max,S.I., Wang,J., Hsieh,F.,
            Diatchenko,L., Marusina,K., Farmer,A.A., Rubin,G.M., Hong,L.,
            Stapleton,M., Soares,M.B., Bonaldo,M.F., Casavant,T.L.,
            Scheetz,T.E., Brownstein,M.J., Usdin,T.B., Toshiyuki,S.,
            Carninci,P., Prange,C., Raha,S.S., Loquellano,N.A., Peters,G.J.,
            Abramson,R.D., Mullahy,S.J., Bosak,S.A., McEwan,P.J.,
            McKernan,K.J., Malek,J.A., Gunaratne,P.H., Richards,S.,
            Worley,K.C., Hale,S., Garcia,A.M., Gay,L.J., Hulyk,S.W.,
            Villalon,D.K., Muzny,D.M., Sodergren,E.J., Lu,X., Gibbs,R.A.,
            Fahey,J., Helton,E., Ketteman,M., Madan,A., Rodrigues,S.,
            Sanchez,A., Whiting,M., Madan,A., Young,A.C., Shevchenko,Y.,
            Bouffard,G.G., Blakesley,R.W., Touchman,J.W., Green,E.D.,
            Dickson,M.C., Rodriguez,A.C., Grimwood,J., Schmutz,J., Myers,R.M.,
            Butterfield,Y.S., Krzywinski,M.I., Skalska,U., Smailus,D.E.,
            Schnerch,A., Schein,J.E., Jones,S.J. and Marra,M.A.
  CONSRTM   Mammalian Gene Collection Program Team
  TITLE     Generation and initial analysis of more than 15,000 full-length
            human and mouse cDNA sequences
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)
   PUBMED   12477932
REFERENCE   2  (bases 1 to 4032)
  CONSRTM   NIH MGC Project
  TITLE     Direct Submission
  JOURNAL   Submitted (22-OCT-2006) National Institutes of Health, Mammalian
            Gene Collection (MGC), Bethesda, MD 20892-2590, USA
  REMARK    NIH-MGC Project URL: http://mgc.nci.nih.gov
COMMENT     Contact: MGC help desk
            Email: cgapbs-r@mail.nih.gov
            Tissue Procurement: Mike Brownstein, NIMH
            cDNA Library Preparation: British Columbia Cancer Research Center
            cDNA Library Arrayed by: The I.M.A.G.E. Consortium (LLNL)
            DNA Sequencing by: Genome Sequence Centre,
            BC Cancer Agency, Vancouver, BC, Canada
            info@bcgsc.bc.ca
            Martin Hirst, Thomas Zeng, Ryan Morin, Michelle Moksa, Johnson
            Pang, Diana Mah, Jing Wang, Kieth Fichter, Eric Chuah, Allen
            Delaney, Rob Kirkpatrick, Agnes Baross, Sarah Barber, Mabel
            Brown-John, Steve S. Chand, William Chow, Ryan Babakaiff, Dave
            Wong, Corey Matsuo, Jaclyn Beland, Susan Gibson, Luis delRio, Ruth
            Featherstone, Malachi Griffith, Obi Griffith, Ran Guin, Nancy Liao,
            Kim MacDonald,  Mike R. Mayo, Josh Moran, Diana Palmquist, JR
            Santos, Duane Smailus, Jeff Stott, Miranda Tsai, George Yang,
            Jacquie Schein, Asim Siddiqui,Steven Jones, Rob Holt, Marco Marra.
            
            Clone distribution: MGC clone distribution information can be found
            through the I.M.A.G.E. Consortium/LLNL at: http://image.llnl.gov
            Series: IRCB Plate: 7 Row: E Column: 13.
            
            Differences found between this sequence and the human reference
            genome (build 36) are described in misc_difference features below
            and these differences were also compared to chimpanzee genome
            (build 2).
FEATURES             Location/Qualifiers
     source          1..4032
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /clone="MGC:161489 IMAGE:8991927"
                     /tissue_type="Lung, PCR rescued clones"
                     /clone_lib="NIH_MGC_300"
                     /note="Vector: pCR-XL-TOPO; Clone identification sequence
                     tag: CTCCCGCT"
     gene            1..4032
                     /gene="KIF11"
                     /note="synonyms: EG5, HKSP, TRIP5"
                     /db_xref="GeneID:3832"
                     /db_xref="HGNC:6388"
                     /db_xref="MIM:148760"
     misc_difference 10
                     /gene="KIF11"
                     /note="'T' in cDNA is 'G' in the human genome.  The
                     chimpanzee genome agrees with the human genomic sequence
                     and not the cDNA."
     misc_difference 10^11
                     /gene="KIF11"
                     /note="1 base in the human genome, G, is not found in
                     cDNA.  The chimpanzee genome agrees with the human genomic
                     sequence and not the cDNA."
     CDS             108..3278
                     /gene="KIF11"
                     /codon_start=1
                     /product="kinesin family member 11"
                     /protein_id="AAI26212.1"
                     /db_xref="GI:116496649"
                     /db_xref="GeneID:3832"
                     /db_xref="HGNC:6388"
                     /db_xref="MIM:148760"
                     /translation="MASQPNSSAKKKEEKGKNIQVVVRCRPFNLAERKASAHSIVECD
                     PVRKEVSVRTGGLADKSSRKTYTFDMVFGASTKQIDVYRSVVCPILDEVIMGYNCTIF
                     AYGQTGTGKTFTMEGERSPNEEYTWEEDPLAGIIPRTLHQIFEKLTDNGTEFSVKVSL
                     LEIYNEELFDLLNPSSDVSERLQMFDDPRNKRGVIIKGLEEITVHNKDEVYQILEKGA
                     AKRTTAATLMNAYSSRSHSVFSVTIHMKETTIDGEELVKIGKLNLVDLAGSENIGRSG
                     AVDKRAREAGNINQSLLTLGRVITALVERTPHVPYRESKLTRILQDSLGGRTRTSIIA
                     TISPASLNLEETLSTLEYAHRAKNILNKPEVNQKLTKKALIKEYTEEIERLKRDLAAA
                     REKNGVYISEENFRVMSGKLTVQEEQIVELIEKIGAVEEELNRVTELFMDNKNELDQC
                     KSDLQNKTQELETTQKHLQETKLQLVKEEYITSALESTEEKLHDAASKLLNTVEETTK
                     DVSGLHSKLDRKKAVDQHNAEAQDIFGKNLNSLFNNMEELIKDGSSKQKAMLEVHKTL
                     FGNLLSSSVSALDTITTVALGSLTSIPENVSTHVSQIFNMILKEQSLAAESKTVLQEL
                     INVLKTDLLSSLEMILSPTVVSILKINSQLKHIFKTSLTVADKIEDQKKELDGFLSIL
                     CNNLHELQENTICSLVESQKQCGNLTEDLKTIKQTHSQELCKLMNLWTERFCALEEKC
                     ENIQKPLSSVQENIQQKSKDIVNKMTFHSQKFCADSDGFSQELRNFNQEGTKLVEESV
                     KHSDKLNGNLEKISQETEQRCESLNTRTVYFSEQWVSSLNEREQELHNLLEVVSQCCE
                     ASSSDITEKSDGRKAAHEKQHNIFLDQMTIDEDKLIAQNLELNETIKIGLTKLNCFLE
                     QDLKLDIPTGTTPQRKSYLYPSTLVRTEPREHLLDQLKRKQPELLMMLNCSENNKEET
                     IPDVDVEEAVLGQYTEEPLSQEPSVDAGVDCSSIGGVPFFQHKKSHGKDKENRGINTL
                     ERSKVEETTEHLVTKSRLPLRAQINL"
     misc_difference 3988
                     /gene="KIF11"
                     /note="'C' in cDNA is 'A' in the human genome."
ORIGIN      
        1 gcccgagagt accagggaga ctccggcccc tgtcggccgc caagcccctc cgcccctcac
       61 agcgcccagg tccgcggccg ggccttgatt ttttggcggg gaccgtcatg gcgtcgcagc
      121 caaattcgtc tgcgaagaag aaagaggaga aggggaagaa catccaggtg gtggtgagat
      181 gcagaccatt taatttggca gagcggaaag ctagcgccca ttcaatagta gaatgtgatc
      241 ctgtacgaaa agaagttagt gtacgaactg gaggattggc tgacaagagc tcaaggaaaa
      301 catacacttt tgatatggtg tttggagcat ctactaaaca gattgatgtt taccgaagtg
      361 ttgtttgtcc aattctggat gaagttatta tgggctataa ttgcactatc tttgcgtatg
      421 gccaaactgg cactggaaaa acttttacaa tggaaggtga aaggtcacct aatgaagagt
      481 atacctggga agaggatccc ttggctggta taattccacg tacccttcat caaatttttg
      541 agaaacttac tgataatggt actgaatttt cagtcaaagt gtctctgttg gagatctata
      601 atgaagagct ttttgatctt cttaatccat catctgatgt ttctgagaga ctacagatgt
      661 ttgatgatcc ccgtaacaag agaggagtga taattaaagg tttagaagaa attacagtac
      721 acaacaagga tgaagtctat caaattttag aaaagggggc agcaaaaagg acaactgcag
      781 ctactctgat gaatgcatac tctagtcgtt cccactcagt tttctctgtt acaatacata
      841 tgaaagaaac tacgattgat ggagaagagc ttgttaaaat cggaaagttg aacttggttg
      901 atcttgcagg aagtgaaaac attggccgtt ctggagctgt tgataagaga gctcgggaag
      961 ctggaaatat aaatcaatcc ctgttgactt tgggaagggt cattactgcc cttgtagaaa
     1021 gaacacctca tgttccttat cgagaatcta aactaactag aatcctccag gattctcttg
     1081 gagggcgtac aagaacatct ataattgcaa caatttctcc tgcatctctc aatcttgagg
     1141 aaactctgag tacattggaa tatgctcata gagcaaagaa catattgaat aagcctgaag
     1201 tgaatcagaa actcaccaaa aaagctctta ttaaggagta tacggaggag atagaacgtt
     1261 taaaacgaga tcttgctgca gcccgtgaga aaaatggagt gtatatttct gaagaaaatt
     1321 ttagagtcat gagtggaaaa ttaactgttc aagaagagca gattgtagaa ttgattgaaa
     1381 aaattggtgc tgttgaggag gagctgaata gggttacaga gttgtttatg gataataaaa
     1441 atgaacttga ccagtgtaaa tctgacctgc aaaataaaac acaagaactt gaaaccactc
     1501 aaaaacattt gcaagaaact aaattacaac ttgttaaaga agaatatatc acatcagctt
     1561 tggaaagtac tgaggagaaa cttcatgatg ctgccagcaa gctgcttaac acagttgaag
     1621 aaactacaaa agatgtatct ggtctccatt ccaaactgga tcgtaagaag gcagttgacc
     1681 aacacaatgc agaagctcag gatatttttg gcaaaaacct gaatagtctg tttaataata
     1741 tggaagaatt aattaaggat ggcagctcaa agcaaaaggc catgctagaa gtacataaga
     1801 ccttatttgg taatctgctg tcttccagtg tctctgcatt agataccatt actacagtag
     1861 cacttggatc tctcacatct attccagaaa atgtgtctac tcatgtttct cagattttta
     1921 atatgatact aaaagaacaa tcattagcag cagaaagtaa aactgtacta caggaattga
     1981 ttaatgtact caagactgat cttctaagtt cactggaaat gattttatcc ccaactgtgg
     2041 tgtctatact gaaaatcaat agtcaactaa agcatatttt caagacttca ttgacagtgg
     2101 ccgataagat agaagatcaa aaaaaggaac tagatggctt tctcagtata ctgtgtaaca
     2161 atctacatga actacaagaa aataccattt gttccttggt tgagtcacaa aagcaatgtg
     2221 gaaacctaac tgaagacctg aagacaataa agcagaccca ttcccaggaa ctttgcaagt
     2281 taatgaatct ttggacagag agattctgtg ctttggagga aaagtgtgaa aatatacaga
     2341 aaccacttag tagtgtccag gaaaatatac agcagaaatc taaggatata gtcaacaaaa
     2401 tgacttttca cagtcaaaaa ttttgtgctg attctgatgg cttctcacag gaactcagaa
     2461 attttaacca agaaggtaca aaattggttg aagaatctgt gaaacactct gataaactca
     2521 atggcaacct ggaaaaaata tctcaagaga ctgaacagag atgtgaatct ctgaacacaa
     2581 gaacagttta tttttctgaa cagtgggtat cttccttaaa tgaaagggaa caggaacttc
     2641 acaacttatt ggaggttgta agccaatgtt gtgaggcttc aagttcagac atcactgaga
     2701 aatcagatgg acgtaaggca gctcatgaga aacagcataa catttttctt gatcagatga
     2761 ctattgatga agataaattg atagcacaaa atctagaact taatgaaacc ataaaaattg
     2821 gtttgactaa gcttaattgc tttctggaac aggatctgaa actggatatc ccaacaggta
     2881 cgacaccaca gaggaaaagt tatttatacc catcaacact ggtaagaact gaaccacgtg
     2941 aacatctcct tgatcagctg aaaaggaaac agcctgagct gttaatgatg ctaaactgtt
     3001 cagaaaacaa caaagaagag acaattccgg atgtggatgt agaagaggca gttctggggc
     3061 agtatactga agaacctcta agtcaagagc catctgtaga tgctggtgtg gattgttcat
     3121 caattggcgg ggttccattt ttccagcata aaaaatcaca tggaaaagac aaagaaaaca
     3181 gaggcattaa cacactggag aggtctaaag tggaagaaac tacagagcac ttggttacaa
     3241 agagcagatt acctctgcga gcccagatca acctttaatt cacttggggg ttggcaattt
     3301 tatttttaaa gaaaacttaa aaataaaacc tgaaacccca gaacttgagc cttgtgtata
     3361 gattttaaaa gaatatatat atcagccggg cgcggtggct catgcctgta atcccagcac
     3421 tttgggaggc tgaggcgggt ggattgcttg agcccaggag tttgagacca gcctggccaa
     3481 cgtggcaaaa cctcgtctct gttaaaaatt agccgggcgt ggtggcacac tcctgtaatc
     3541 ccagctactg gggaggctga ggcacgagaa tcacttgaac ccaggaagcg gggttgcagt
     3601 gagccaaagg tacaccacta cactccagcc tgggcaacag agcaagactc ggtctcaaaa
     3661 acaaaattta aaaaagatat aaggcagtac tgtaaattca gttgaatttt gatatctacc
     3721 catttttctg tcatccctat agttcacttt gtattaaatt gggtttcatt tgggatttgc
     3781 aatgtaaata cgtatttcta gttttcatat aaagtagttc ttttataaca aatgaaaagt
     3841 atttttcttg tatattatta agtaatgaat atataagaac tgtactcttc tcagcttgag
     3901 cttacatagg taaatatcac caacatctgt ccttagaaag gaccatctca tgtttttttt
     3961 cttgctatga cttgtgtatt ttcttgcctc ctccctagac ttccctattt cgctttctcc
     4021 tcggctcact tt
//