From OpenWetWare
Jump to: navigation, search

Assignment PDF

  • Download Assignment 1 PDF


Write your code so that it could take in any input file which has the following structure:
  • Please plan to submit one .py file containing the code for both question 1 and question 2, named as For example, for the first assignment, my file would be called
  • Your code should create two output files, one for question 1, called output1.txt, and one for question 2, called output2.txt.
  • NEW! output1.txt should contain only the DNA sequence as a single string.
  • NEW! output2.txt should contain one ORF per line, and nothing else.

Submission instructions

  • Please email your .py file to by 5pm Thursday. As stated above, this file will contain the code for question 1 and question 2. You do not need to send the input or output files since we will run your code with our Parts.txt file as the input. (Note capital P in Parts.txt!)
  • On a paper copy of the pset pdf, please hand write your answers to question 0 as well as the answer to this question: What will this composite part do when placed inside a living bacterium? Please place this paper in the box outside Drew Endy's office (68-564) by 5pm Thursday. This box will be available starting Wednesday at 5pm.
  • Late psets will NOT be accepted.

Questions and Clarifications

  • Note that the stop codon TAA must be in frame, i.e. a multiple of 3 basepairs away from the ATG. For example, ATGxxxxxxTAA would be in frame, but ATGxxxxxTAA would not be. (x is any basepair)
  • Is it significant that the barcode is CAPS and the other parts are lower case?
    • NO/no.
  • Can an ORF be any length over 50, or should its length be a multiple of some small integer?
    • An ORF should be a length that is a multiple of three, the number of base pairs that comprise a codon
  • Does the ORF include the start ATG and stop TAA? Suppose the DNA string is "ATG...TAA": is the ORF "..." or "ATG..." or "ATG...TAA" or "...TAA"?
    • The ORF includes the "start" ATG and "stop" TAA.
  • Can ORFs overlap? Suppose the DNA string is "ATG...TAAxxxTAA". The first ORF is obviously (modulo previous question) "ATG...TAA". Is "ATG...TAAxxxTAA" also an ORF? It meets the specification of "a string starting with ATG and ending with TAA". One could imagine a similar situation with overlapping starting tags: "ATG...ATGxxxTAA" might have both "ATG...ATGxxxTAA" and "ATGxxxTAA".
    • Yes, ORFs can overlap.
    • Although "ATG...TAAxxxTAA" has a small chance of occurring in biology, for the purposes of this programming assignment, please end ORFs at the first in-frame TAA.
  • For Q2, ATG...TAA...TAA isn't an ORF, but what if ATG...TAA is less than 50 bp and ATG...TAA...TAA is >50bp?
    • Still not an ORF (assuming the TAA's are in frame). The >50bp is something humans have used as a qualifier to weed out things that are not ORFs, since we've observed that ORFs are usually >50bp. The biology of translation will still see TAA as a stop codon and stop translation at the first TAA, making the sequence less than 50bp.

Solutions and General Comments

  • Answers to Q0:
    • False, because == checks for equality
    • 4
    • True, because we previously set a equal to b, so now a and b are equal
    • 1 2 3 4 (each of these will be on a new line)
    • 2 3 4 5 (each of these will be on a new line)

  • Answer to written Q1 section:
    • Under certain conditions, the bacteria will express the mRFP protein encoded by the ORF. (-2pts for not noting that the protein made is mRFP)

  • One possible way of writing the code for assignment 1 is provided here: File:Q1q2code.txt
    • Common mistakes for Q1: not printing the concatenated sequence to the screen or output file as the question asked (-5pts), naming files incorrectly (-5pts), not concatenating by calling the keys in the dictionary (you would get the incorrect answer for any other input file; -10pts)
    • This test file was used to grade Q1: File:Q1tester.txt Use this txt file as the file you read in for Q1 and compare your answer with the answer found with the Q1q2code.txt provided above.
    • Common mistakes for Q2: not checking for in-frame TAA's <50bps from an ATG and excluding them from your list of ORFs (-3pts), not stopping at the first in-frame TAA (-4pts), finding the first TAA for each ATG and not continuing to look for another TAA if the first is not in-frame, index out of range (-2pts), not printing ORFs to output file correctly (-2pts)
      • This test sequence was used to grade Q2: taaatgxxatgxxxatgxxxxxxxxxxxxxxxxxxxxxxxxxtaaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxtaaxxtaataa
      • Use this sequence as the input to Q2 (basically, overwrite your concatenated sequence with this one) and compare the ORFs you find with your code to the ORFs found with the Q1q2code.txt code provided above.

  • If you still have questions after reviewing these solutions and trying out the test sequence, please email the TAs.