Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-4-17
Tasks due today
Quality Checking Script
Here is a script for detecting mismatches between our sequence and returned SNPs. As I sat down to write this script, I realized that we are missing something important: a parser which can extract sequence from dbSNP. This is too much for me to tackle on my own; we will construct this in class tomorrow. This code will change when we interface it with the rest of our script. We can make this more object oriented at such a point.
#!/usr/bin/env python # The purpose of this script is to detect mis-matches between our test sequence # and the returned SNPs from dbSNP. We will then inform the user if any SNPs # had additional mutations that should question the validity of the SNP return. # Once this initial construct is up and running, we will work to build in more # detailed mis-match analysis so that, as linkage disequalibrium studies yield # more information about when a SNP is significant, we can pass the information # along to our users. # Part 1: Condense test sequence and RS sequence to a single FASTA file from Bio import GenBank, Seq from Bio import formats # Assuming a list of RS id's and a test seqeunce as a FASTA. snp_len =  snp_parser = NEED SNP PARSER() # I have spoken to Xiaodi about how we can # construct a parser to handle this task snp_entry = GenBank.NCBIDictionary('nucelotide', 'genbank', parser = snp_parser) for i in rsid: parsed_snp = snp_entry(rsid[i]) formatter = FormatIO(parsed_snp, formats["genbank"], formats["fasta"]) formatter.convert(parsed_snp, formatted_snp) updateseq = open('example.fasta', 'a') updateseq.write(formatted_snp) updateseq.close() snp_len.append(len(formatted_snp)) # Part 2: Align the FASTA file using Clustal cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta')) cline.set_output('example.aln') alignment = Clustalw.do_alignment(cline) summary_align = AlignInfo.SummaryInfo(alignment) con = summary_align.dumb_consensus() # Part 3: Figure out where each SNP starts and stops relative to the test # sequence. This allows us to establish a start and default stop for # comparing individual SNPs with the reference. size_col = Set() size_col = summary_align.get_column(1) snp_count = len(size_col)-1 colset = Set() for i in range(snp_count): start = 0 while start == 0: for j in range(len(con)): col=summary_align.get_column(j) if col[i+1] =! '-': start = j # Part 4: Scan for and report mismatches in each SNP. Build in a function to tolerate # insertions/deletions. print "Scanning %n for mismatches with reference sequence", rsid[i] for j in range(start, start+snp_len[i]): col=summary_align.get_column if col == '-': print "Deletion at basepair ", j elif col[i+1] == '-': print "Insertion at basepair ", j elif col =! col[i+1]: print "Mismatch: ", col[i+1], j, col else: "No mismatches found in this SNP" print "Done screening SNPs for mismatches. Any mismatches may call into question" print "the validity of a returned SNP. Further studies in linkage disequilibrium" print "will offer more information."
Potential Mutations to Consider
- Additional Point Mutations
- Immediately Next to SNP
- In large blocks but not close to SNP
- Any Combination of these mutations
- SNPs in the presence of other SNPs