Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1
From OpenWetWare
Jump to navigationJump to search
Tasks Due Today
Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)
- This code should replace Xiaodi's section, which bears the same name.
# extracts snp data
def extract_snp_data(str):
dom = parseString(str)
variants = dom.getElementsByTagName("Hit")
if len(variants) == 0:
return
parsed = {}
for v in variants:
# now populate the struct
id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
# only consider it a genuine snp if the hit score is above 100
if int(score) > 100:
parsed[id] = parsed_seq
return parsed
- Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class. Here it is:
# BEGIN ACTUAL PROGRAM
# read sequence from file
sequence = get_sequence_from_fasta("example.fasta")
print "Sequence received; please wait."
# look up data
b = blast_snp_lookup(sequence)
# extract data
e = extract_snp_data(b)
outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n")
if len(e) > 0:
for i in e.iterkeys():
outputlist.append(i + "\n")
else:
outputlist.append("[none]\n")
sys.exit() # nothing more to be done if no snp found
#print '-' * 40
#print "\n"
print "got a snp... still thinking..."
# do an omim search on each snp
for i in e.iterkeys():
o = omim_snp_search(i)
if len(o) == 0:
outputlist.append("No information found for " + i + "\n")
continue # nothing more to be done if no records can be found
# otherwise, find the allelic variant data
outputlist.append(i + " details:" + "\n")
for j in o:
v = extract_allelic_variant_data(j.read())
if v != None:
for a in v:
outputlist.append(a.name + "\n")
outputlist.append(a.mutation + "\n")
outputlist.append(a.description + "\n")
#print '-' * 40
#print "\n"
Complete code which removes falsely-identified SNPs
# Mike and I are writing this code in tandem. The goal of this code
# is to remove falsely identified SNPs (we will integrate our two halves
# in class tomorrow).
# Parts go as follows:
# 1. take input (dict file with rsID and keyed seq)
# 2. align the SNP with reference sequence,
# 3. find ambiguous base-pair position.
# 4. check to see if the sequence BP is in the space of the SNP BP.
# 5. remove erroneous SNP entries
# 6. return output (updated dict)
# I am doing parts 4-6.
# given pos from part 3
for i in dict_file.iterkeys():
a=seq[pos]
snpseq=dict_file.get(i)
j = 0
k = 0
while j = 0:
if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters:
b=snpseq[j]
j = 1
else: k=k+1
b_values = IUPACData.ambiguous_dna_values[b]
if a not in b_values:
del dict_file[i]
return dict_file
Remaining Documented Code
Here is the remaining code I have written that needs to be integrated. I've adjusted everything to use the appropriate inputs from the dict file. I don't have a preference as to where this might go. We could make it more object-oriented, using it as a definition, or it could stand alone in the actual body of the program. The choice is up to you. That said, it should come after the code that Michael and I are currently writing, which will eliminate false positives from our queries. I think it should come after the OMIM reporting. Inputs are our query sequence and the parsed dict class object. Outputs are a list of iterated mutations, reported per SNP. Here is the code:
# error check for mis-matches
for i in e.iterkeys():
parsed_snp = e[i]
updateseq = open('example.fasta', 'a')
updateseq.write(parsed_snp)
updateseq.close()
snp_len.append(len(formatted_snp))
# Align the FASTA file using Clustal
cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta'))
cline.set_output('example.aln')
alignment = Clustalw.do_alignment(cline)
summary_align = AlignInfo.SummaryInfo(alignment)
con = summary_align.dumb_consensus()
# Figure out where each SNP starts and stops relative to the test
# sequence. This allows us to establish a start and default stop for
# comparing individual SNPs with the reference.
size_col = Set()
size_col = summary_align.get_column(1)
snp_count = len(size_col)-1
colset = Set()
for i in range(snp_count):
start = 0
while start == 0:
for j in range(len(con)):
col=summary_align.get_column(j)
if col[i+1] != '-':
start = j
# Scan for and report mismatches in each SNP. Build in a function to tolerate
# insertions/deletions.
print "Scanning %n for mismatches with reference sequence", e[i]
for j in range(start, start+snp_len[i]):
col=summary_align.get_column
if col[0] == '-': print "Deletion at basepair ", j
elif col[i+1] == '-': print "Insertion at basepair ", j
elif col[0] != col[i+1]: print "Mismatch: ", col[i+1], j, col[0]
else: "No mismatches found in this SNP"
print "Done screening SNPs for mismatches. Any mismatches may call into question"
print "the validity of a returned SNP. Further studies in linkage disequilibrium"
print "will offer more information."