Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3

Task Due Today
Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the updated code: def extract_snp_data(str): dom = parseString(str) variants = dom.getElementsByTagName("Hit") if len(variants) == 0: return parsed = [] for v in variants: # now populate the struct hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes) id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes) id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes) score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes) id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes) # extract position of the SNP from Hit Definition lower_bound = hit_def.find("pos=")+4 upper_bound = hit_def.find("len=")-1 position = int(hit_def[lower_bound:upper_bound]) # only consider it a genuine snp if the hit score is above 100, # the query/hit sequences are longer than the position of the SNP # and the query sequence matches the hit sequence at the SNP position if int(score) > 100 and position < len(id_hit) and id_query[position] == id_hit[position]: parsed.append(id) return parsed
 * 1) extracts snp data

I used our old friend, Apoe, and ran it through the old and updated versions of the group code. Here's the output from the old code: Sequence received; please wait. 10 single nucleotide polymorphism(s) found: rs769455 rs28931578 rs28931577 rs11083750 rs28931576 rs429358 rs769452 rs28931579 rs7412 rs12982192

No information found for rs769455 No information found for rs28931578 No information found for rs28931577 No information found for rs11083750 No information found for rs28931576 No information found for rs429358 No information found for rs769452 No information found for rs28931579 No information found for rs7412 No information found for rs12982192

Here's the output using the code I updated: Sequence received; please wait. 9 single nucleotide polymorphism(s) found: rs769455 rs28931578 rs28931577 rs11083750 rs28931576 rs429358 rs769452 rs28931579 rs7412

No information found for rs769455 No information found for rs28931578 No information found for rs28931577 No information found for rs11083750 No information found for rs28931576 No information found for rs429358 No information found for rs769452 No information found for rs28931579 No information found for rs7412

The removed SNP is one that was identified by BLAST, but the SNP itself actually lay after the termination of the query sequence. So, it's good that we remove erroneous SNPs.