Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-4-17

Quality Checking Script
Here is a script for detecting mismatches between our sequence and returned SNPs. As I sat down to write this script, I realized that we are missing something important: a parser which can extract sequence from dbSNP. This is too much for me to tackle on my own; we will construct this in class tomorrow. This code will change when we interface it with the rest of our script. We can make this more object oriented at such a point.


 * 1) !/usr/bin/env python


 * 1) The purpose of this script is to detect mis-matches between our test sequence
 * 2) and the returned SNPs from dbSNP.  We will then inform the user if any SNPs
 * 3) had additional mutations that should question the validity of the SNP return.
 * 4) Once this initial construct is up and running, we will work to build in more
 * 5) detailed mis-match analysis so that, as linkage disequalibrium studies yield
 * 6) more information about when a SNP is significant, we can pass the information
 * 7) along to our users.


 * 1) Part 1: Condense test sequence and RS sequence to a single FASTA file

from Bio import GenBank, Seq from Bio import formats

# Assuming a list of RS id's and a test seqeunce as a FASTA.

snp_len = [] snp_parser = NEED SNP PARSER # I have spoken to Xiaodi about how we can # construct a parser to handle this task snp_entry = GenBank.NCBIDictionary('nucelotide', 'genbank', parser = snp_parser) for i in rsid: parsed_snp = snp_entry(rsid[i]) formatter = FormatIO(parsed_snp, formats["genbank"], formats["fasta"]) formatter.convert(parsed_snp, formatted_snp) updateseq = open('example.fasta', 'a') updateseq.write(formatted_snp) updateseq.close snp_len.append(len(formatted_snp))


 * 1) Part 2: Align the FASTA file using Clustal

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta')) cline.set_output('example.aln') alignment = Clustalw.do_alignment(cline) summary_align = AlignInfo.SummaryInfo(alignment) con = summary_align.dumb_consensus


 * 1) Part 3: Figure out where each SNP starts and stops relative to the test
 * 2)         sequence.  This allows us to establish a start and default stop for
 * 3)         comparing individual SNPs with the reference.

size_col = Set size_col = summary_align.get_column(1) snp_count = len(size_col)-1 colset = Set for i in range(snp_count): start = 0 while start == 0: for j in range(len(con)): col=summary_align.get_column(j) if col[i+1] =! '-':               start = j


 * 1) Part 4: Scan for and report mismatches in each SNP. Build in a function to tolerate
 * 2)         insertions/deletions.

print "Scanning %n for mismatches with reference sequence", rsid[i] for j in range(start, start+snp_len[i]): col=summary_align.get_column if col[0] == '-': print "Deletion at basepair ", j       elif col[i+1] == '-': print "Insertion at basepair ", j        elif col[0] =! col[i+1]: print "Mismatch: ", col[i+1], j, col[0] else: "No mismatches found in this SNP"

print "Done screening SNPs for mismatches. Any mismatches may call into question" print "the validity of a returned SNP. Further studies in linkage disequilibrium" print "will offer more information."

Potential Mutations to Consider

 * Insertions
 * Deletions
 * Truncations
 * Additional Point Mutations
 * Immediately Next to SNP
 * In large blocks but not close to SNP
 * Any Combination of these mutations
 * SNPs in the presence of other SNPs