Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-3: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
m New page: ==Tasks Due Today== Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of ... |
(No difference)
|
Revision as of 04:59, 3 May 2007
Tasks Due Today
Our goal for today was to establish a fully-operational version of the group code. Given that Mike wrote an expansive code for mis-match analysis, I dedicated all of my time to a task I would reasonably be able to complete by this date. The task I assumed was the elimination of 'false positives'--SNPs identified by the BLAST algorithm, which weren't actually found in the query sequence. I executed this task by placing additional constraints on current code to extract SNP data. Here is the code I updated:
# extracts snp data
def extract_snp_data(str):
dom = parseString(str)
variants = dom.getElementsByTagName("Hit")
if len(variants) == 0:
return
parsed = []
for v in variants:
# now populate the struct
hit_def = get_text(v.getElementsByTagName("Hit_def")[0].childNodes)
id_query = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
id_hit = get_text(v.getElementsByTagName("Hsp_qseq")[0].childNodes)
score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
# extract position of the SNP from Hit Definition
lower_bound = hit_def.find("pos=")+4
upper_bound = hit_def.find("len=")-1
position = int(Hit_def[lower_bound:upper_bound])
# only consider it a genuine snp if the hit score is above 100,
# the query/hit sequences are longer than the position of the SNP
# and the query sequence matches the hit sequence at the SNP position
if int(score) > 100 and position >= len(id_hit):
if id_query == id_hit: parsed.append(id_hit)
return parsed
I used our old friend, Apoe,