Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-2-27

Code and Embedded Comments

 * 1) !/usr/bin/env python


 * 1) PART 1: import the GenBank tools necessary to complete this analysis

import os from Bio import GenBank, Seq, Clustalw from Bio.Align import AlignInfo from sets import Set
 * 1) To import the GenBank and Clustal Modules

seq_parser = GenBank.FeatureParser
 * 1) To parse sequence data from the GenBank Entry

ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = seq_parser)
 * 1) To interface to Genbank


 * 1) PART 2: load our ApoE as a reference string for sequence comparison

parsed_ref = ncbi_dict['178850']
 * 1) Download ApoE's record

ref = parsed_ref.seq.tostring print "ApoE has been loaded as the reference gene"
 * 1) Extract the sequence and save as a string

x = int(raw_input("How many sequences would you like to upload for comparison? ")) comparison_seqs = [] for i in range(0,x): g = int(raw_input("Please enter the GenBank ID for a sequence of interest ")) entry = ncbi_dict[str(g)] entry_seq = entry.seq.tostring comparison_seqs.append(entry_seq)
 * 1) PART 3: Input other sequences for comparison
 * 2) How many sequences to import?
 * 1) Now import them into a list...


 * 1) As I revise this code, I think I may want to rethink conversion of GenBank ID to string
 * 2) to fasta file.  As this website shows (http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html)
 * 3) I can probably go straight from GenBank to Fasta.  I could probably write some algorithm
 * 4) that takes this list of GenBank IDs and processes them into Fasta files.  With my existing code,
 * 5) the only revision I see necessary is the extraction of the ApoE record into string form, rather than
 * 6) a Fasta file.
 * 1) a Fasta file.

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline)
 * 1) PART 4: Align the Sequences
 * 2) As I continue to revise this script, I will need to change the following line to tolerate new inputs

for i in range(alignment.get_alignment_length): column = alignment.get_column(i) colset = Set for j in range(len(column)): colset.add(column[j]) card = len(colset) if card != 1: if '-' in colset: if '-' in column[0]: print "Insertion at basepair ", i, column else: print "Deletion at basepair ", i, column else: print "Point Mutation at basepair ", i, column
 * 1) PART 5: Analyze and Print the results
 * 2) Determine insertions, deletions, and point mutations in the DNA Sequence


 * 1) I'm not quite sure how I would design a code that would identify an insertion.  An insertion means that you
 * 2) enter a relationship where one strand is lagging behind another one, but there is still sequence conservation
 * 3) between the two sequences.  I'm pretty sure that, in the case of insertion, our reference strand would show a
 * 4) deletion compared to the sequence of interest.  So, to this end, if '-' were the first character inserted into
 * 5) our set, we would know that we'd have an insertion.


 * 1) Also, I have not had the opportunity to learn translation quite yet, so the issue of translational consequences
 * 2) is not addressed in my code.  I will plan to tackle this at the tutorial on Wednesday.  The other issue that
 * 3) remains unresolved is the establishment of a reading frame and length.  This issue as well will be resolved
 * 4) Wednesday.