Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-4-17
From OpenWetWare
Jump to navigationJump to search
Tasks due today
Quality Checking Script
Here is a script for detecting mismatches between our sequence and returned SNPs. As I sat down to write this script, I realized that we are missing something important: a parser which can extract sequence from dbSNP. This is too much for me to tackle on my own; we will construct this in class tomorrow. This code will change when we interface it with the rest of our script. We can make this more object oriented at such a point.
#!/usr/bin/env python
# The purpose of this script is to detect mis-matches between our test sequence
# and the returned SNPs from dbSNP. We will then inform the user if any SNPs
# had additional mutations that should question the validity of the SNP return.
# Once this initial construct is up and running, we will work to build in more
# detailed mis-match analysis so that, as linkage disequalibrium studies yield
# more information about when a SNP is significant, we can pass the information
# along to our users.
# Part 1: Condense test sequence and RS sequence to a single FASTA file
from Bio import GenBank, Seq
from Bio import formats
# Assuming a list of RS id's and a test seqeunce as a FASTA.
snp_len = []
snp_parser = NEED SNP PARSER() # I have spoken to Xiaodi about how we can
# construct a parser to handle this task
snp_entry = GenBank.NCBIDictionary('nucelotide', 'genbank', parser = snp_parser)
for i in rsid:
parsed_snp = snp_entry(rsid[i])
formatter = FormatIO(parsed_snp, formats["genbank"], formats["fasta"])
formatter.convert(parsed_snp, formatted_snp)
updateseq = open('example.fasta', 'a')
updateseq.write(formatted_snp)
updateseq.close()
snp_len.append(len(formatted_snp))
# Part 2: Align the FASTA file using Clustal
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta'))
cline.set_output('example.aln')
alignment = Clustalw.do_alignment(cline)
summary_align = AlignInfo.SummaryInfo(alignment)
con = summary_align.dumb_consensus()
# Part 3: Figure out where each SNP starts and stops relative to the test
# sequence. This allows us to establish a start and default stop for
# comparing individual SNPs with the reference.
size_col = Set()
size_col = summary_align.get_column(1)
snp_count = len(size_col)-1
colset = Set()
for i in range(snp_count):
start = 0
while start == 0:
for j in range(len(con)):
col=summary_align.get_column(j)
if col[i+1] =! '-':
start = j
# Part 4: Scan for and report mismatches in each SNP. Build in a function to tolerate
# insertions/deletions.
print "Scanning %n for mismatches with reference sequence", rsid[i]
for j in range(start, start+snp_len[i]):
col=summary_align.get_column
if col[0] == '-': print "Deletion at basepair ", j
elif col[i+1] == '-': print "Insertion at basepair ", j
elif col[0] =! col[i+1]: print "Mismatch: ", col[i+1], j, col[0]
else: "No mismatches found in this SNP"
print "Done screening SNPs for mismatches. Any mismatches may call into question"
print "the validity of a returned SNP. Further studies in linkage disequilibrium"
print "will offer more information."
Potential Mutations to Consider
- Insertions
- Deletions
- Truncations
- Additional Point Mutations
- Immediately Next to SNP
- In large blocks but not close to SNP
- Any Combination of these mutations
- SNPs in the presence of other SNPs