Harvard:Biophysics 101/2007/Notebook:CChi/2007-2-27

Assignments due 2/20 and 2/27
This script compares an input sequences to a reference sequence and reports:

ORFs Deletions and Insertions Mutations
 * gives start and end index in terms of reference sequence
 * does not give nested ORFs
 * index and nucleotide of deletion or insertion
 * whether deletion or insertion caused a frameshift
 * index and nucleotide of mutation
 * amino acids involved
 * whether mutation was silent or not

Problems I ran into (and how I may have attempted to solve them):
 * insertions make the reference sequence different in the .aln output. This causes trouble when comparing pretty much any codon back to the reference, so I had to create a separate, running consensus sequence to reference.
 * I did not like the idea of a nested ORF, so I didn't use them.
 * indexes given for deletions, insertions, and mutations represent the index of the input sequence, not the representative index from the reference sequence.
 * have not yet addressed the specifics of the amino acid changes (hydro philic/phobic, etc.)

Code

 * 1) !/usr/bin/env python

import os from Bio import Clustalw, GenBank, Seq from Bio.Align import AlignInfo from sets import Set from Bio.Seq import Seq,translate

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'Apoe2.fasta')) cline.set_output('test.aln') alignment = Clustalw.do_alignment(cline)

deletions = [] # list of deletion indices insertions = [] # list of insertion indices mutations = [] # list of mutation indices mutcodon = ''  # temp codon with mutation of interest mutconsAA = [] # consensus amino acid where mutation of interest occurred mutAA = []     # amino acid after mutation ORFstarts = [] # list of ORF start sites ORFends = []   # list of ORF end sites ORF = 0        # boolean for whether we are currently in an ORF mut = 0        # boolean for whether the current codon contains a mutation codon = ''     # temp codon variable consensus = [] # consensus sequence without the pain of insertion '-'s nucwithindex = []# nested list within consensus list [index, nucleotide]

for i in range(alignment.get_alignment_length): col = alignment.get_column(i) s = Set # create a new set for c in range(len(col)): s.add(col[c]) # add each column element to the set if col[0] != '-': # add to consensus whenever there's no insertion nucwithindex = [i, col[0]] consensus.append(nucwithindex) if len(s) > 1: # multiple elements in s indicate a mismatch if col[1] == '-': # deletions deletions.append(i) else: if col[0] == '-': # insertions insertions.append(i) else: # mutations mutations.append(i) if ORF == 1 and mut != 1: # if there are 2 mutations in one codon, we only need to consider 1 codon unit mut = 1 else: # append dummy '-' to amino acids to avoid later confusion with indices mutAA.append('-') mutconsAA.append('-') # When not in ORF, we scan triplets for ATG, when in ORF, we look at codons if ORF == 0 and i>1 and col[0] != '-': # look for ORF's       # attempt to look beyond insertions at the consensus if len(consensus) >=2: codon = consensus[len(consensus)-3][1] + consensus[len(consensus)-2][1] + col[0] if codon=='ATG': ORFstarts.append(consensus[len(consensus)-3][0]) ORF = 1 codon = '' print "ORF start", consensus[len(consensus)-3][0] else: # ORF's       if len(codon) != 3: if col[0] != '-': codon = codon + col[0] else: if mut == 1: # add consensus and mutated amino acids to list mutcodon = alignment.get_column(i-1)[1] + alignment.get_column(i-2)[1] + alignment.get_column(i-1)[1] mutAA.append(translate(mutcodon)) mutconsAA.append(translate(codon)) mut = 0 if codon=='TAG' or codon=='TGA' or codon=='TAA': # end ORF ORFends.append(consensus[len(consensus)-3][0]) ORF = 0 print "ORF end", consensus[len(consensus)-3][0] codon = '' inORF = 0 # boolean for whether it's in ORF to help with printing

print "\nDELETIONS" for i in range(len(deletions)): col = alignment.get_column(deletions[i]) for j in range(len(ORFends)): # check whether the deletion was in an ORF if deletions[i] <= ORFends[j] and deletions[i] >= ORFstarts[j]: print "Index: ", deletions[i], col[0], "Frameshift!" inORF = 1 if inORF == 0: print "Index: ", deletions[i], col[0], "deletion in intron" inORF = 0

print "\nINSERTIONS" for i in range(len(insertions)): col = alignment.get_column(insertions[i]) for j in range(len(ORFends)): # check whether the insertion was in an ORF if insertions[i] <= ORFends[j] and insertions[i] >= ORFstarts[j]: print "Index: ", insertions[i], col[1], "Frameshift!" inORF = 1 if inORF == 0: print "Index: ", insertions[i], col[1], "insertion in intron" inORF = 0 print "\nMUTATIONS" for i in range(len(mutations)): col = alignment.get_column(mutations[i]) if mutAA[i] == '-': print "Index: ", mutations[i], col[0], "to", col[1] else: if mutAA[i] == mutconsAA[i]: # check whether mutation was silent print "Index: ", mutations[i], col[0], "to", col[1], "silent mutation, amino acid remains", mutAA[i] else: print "Index: ", mutations[i], col[0], "to", col[1], "changed amino acid", mutconsAA[i], "to", mutAA[i]

Output
ORF start 61 ORF end 1014 ORF start 1034

DELETIONS Index: 17 A deletion in intron Index: 236 A Frameshift! Index: 1108 C deletion in intron

INSERTIONS Index: 34 T insertion in intron Index: 380 G Frameshift! Index: 1093 A insertion in intron

MUTATIONS Index: 534 A to C Index:  535 C to A changed amino acid H to T Index:  537 T to C Index:  538 G to T changed amino acid C to L Index:  549 G to A changed amino acid V to I Index:  660 G to A changed amino acid W to G Index:  797 C to T silent mutation, amino acid remains R Index:  804 A to C changed amino acid W to G Index:  1077 T to A changed amino acid C to R