Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-2-27
From OpenWetWare
Jump to navigationJump to search
Assignment Due 2/27
Code and Embedded Comments
#!/usr/bin/env python
# PART 1: import the GenBank tools necessary to complete this analysis
# To import the GenBank and Clustal Modules
import os
from Bio import GenBank, Seq, Clustalw
from Bio.Align import AlignInfo
from sets import Set
# To parse sequence data from the GenBank Entry
seq_parser = GenBank.FeatureParser()
# To interface to Genbank
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = seq_parser)
# PART 2: load our ApoE as a reference string for sequence comparison
# Download ApoE's record
parsed_ref = ncbi_dict['178850']
# Extract the sequence and save as a string
ref = parsed_ref.seq.tostring()
print "ApoE has been loaded as the reference gene"
# PART 3: Input other sequences for comparison
# How many sequences to import?
x = int(raw_input("How many sequences would you like to upload for comparison? "))
# Now import them into a list...
comparison_seqs = []
for i in range(0,x):
g = int(raw_input("Please enter the GenBank ID for a sequence of interest "))
entry = ncbi_dict[str(g)]
entry_seq = entry.seq.tostring()
comparison_seqs.append(entry_seq)
#####################################
# As I revise this code, I think I may want to rethink conversion of GenBank ID to string
# to fasta file. As this website shows (http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html)
# I can probably go straight from GenBank to Fasta. I could probably write some algorithm
# that takes this list of GenBank IDs and processes them into Fasta files. With my existing code,
# the only revision I see necessary is the extraction of the ApoE record into string form, rather than
# a Fasta file.
# PART 4: Align the Sequences
# As I continue to revise this script, I will need to change the following line to tolerate new inputs
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe.fasta'))
cline.set_output('test.aln')
alignment = Clustalw.do_alignment(cline)
# PART 5: Analyze and Print the results
# Determine insertions, deletions, and point mutations in the DNA Sequence
for i in range(alignment.get_alignment_length()):
column = alignment.get_column(i)
colset = Set()
for j in range(len(column)):
colset.add(column[j])
card = len(colset)
if card != 1:
if '-' in colset:
if '-' in column[0]:
print "Insertion at basepair ", i, column
else:
print "Deletion at basepair ", i, column
else:
print "Point Mutation at basepair ", i, column
# I'm not quite sure how I would design a code that would identify an insertion. An insertion means that you
# enter a relationship where one strand is lagging behind another one, but there is still sequence conservation
# between the two sequences. I'm pretty sure that, in the case of insertion, our reference strand would show a
# deletion compared to the sequence of interest. So, to this end, if '-' were the first character inserted into
# our set, we would know that we'd have an insertion.
# Also, I have not had the opportunity to learn translation quite yet, so the issue of translational consequences
# is not addressed in my code. I will plan to tackle this at the tutorial on Wednesday. The other issue that
# remains unresolved is the establishment of a reading frame and length. This issue as well will be resolved
# Wednesday.