Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1

Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)
def extract_snp_data(str): dom = parseString(str) variants = dom.getElementsByTagName("Hit") if len(variants) == 0: return parsed = {} for v in variants: # now populate the struct id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes) score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes) parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes) # only consider it a genuine snp if the hit score is above 100 if int(score) > 100: parsed[id] = parsed_seq return parsed sequence = get_sequence_from_fasta("example.fasta") print "Sequence received; please wait." b = blast_snp_lookup(sequence) e = extract_snp_data(b) outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n") if len(e) > 0: for i in e.iterkeys: outputlist.append(i + "\n") else: outputlist.append("[none]\n") sys.exit # nothing more to be done if no snp found print "got a snp... still thinking..."
 * This code should replace Xiaodi's section, which bears the same name.
 * 1) extracts snp data
 * Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class.  Here it is:
 * 1) BEGIN ACTUAL PROGRAM
 * 2) read sequence from file
 * 1) look up data
 * 1) extract data
 * 1) print '-' * 40
 * 2) print "\n"

for i in e.iterkeys: o = omim_snp_search(i) if len(o) == 0: outputlist.append("No information found for " + i + "\n") continue # nothing more to be done if no records can be found # otherwise, find the allelic variant data outputlist.append(i + " details:" + "\n") for j in o:		v = extract_allelic_variant_data(j.read) if v != None: for a in v:				outputlist.append(a.name + "\n") outputlist.append(a.mutation + "\n") outputlist.append(a.description + "\n") #print '-' * 40 #print "\n"
 * 1) do an omim search on each snp

Complete code which removes falsely-identified SNPs

 * 1) Mike and I are writing this code in tandem.  The goal of this code
 * 2) is to remove falsely identified SNPs (we will integrate our two halves
 * 3) in class tomorrow).


 * 1) Parts go as follows:
 * 2) 1. take input (dict file with rsID and keyed seq)
 * 3) 2. align the SNP with reference sequence,
 * 4) 3. find ambiguous base-pair position.
 * 5) 4. check to see if the sequence BP is in the space of the SNP BP.
 * 6) 5. remove erroneous SNP entries
 * 7) 6. return output (updated dict)


 * 1) I am doing parts 4-6.


 * 1) given pos from part 3

for i in dict_file.iterkeys: a=seq[pos] snpseq=dict_file.get(i) j = 0 k = 0 while j = 0: if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters: b=snpseq[j] j = 1 else: k=k+1 b_values = IUPACData.ambiguous_dna_values[b] if a not in b_values: del dict_file[i]

return dict_file

Remaining Documented Code
Here is the remaining code I have written that needs to be integrated. I've adjusted everything to use the appropriate inputs from the dict file. I don't have a preference as to where this might go. We could make it more object-oriented, using it as a definition, or it could stand alone in the actual body of the program. The choice is up to you. That said, it should come after the code that Michael and I are currently writing, which will eliminate false positives from our queries. I think it should come after the OMIM reporting. Inputs are our query sequence and the parsed dict class object. Outputs are a list of iterated mutations, reported per SNP. Here is the code:

for i in e.iterkeys: parsed_snp = e[i] updateseq = open('example.fasta', 'a') updateseq.write(parsed_snp) updateseq.close snp_len.append(len(formatted_snp))
 * 1) error check for mis-matches

# Align the FASTA file using Clustal

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta')) cline.set_output('example.aln') alignment = Clustalw.do_alignment(cline) summary_align = AlignInfo.SummaryInfo(alignment) con = summary_align.dumb_consensus

# Figure out where each SNP starts and stops relative to the test # sequence. This allows us to establish a start and default stop for # comparing individual SNPs with the reference.

size_col = Set size_col = summary_align.get_column(1) snp_count = len(size_col)-1 colset = Set for i in range(snp_count): start = 0 while start == 0: for j in range(len(con)): col=summary_align.get_column(j) if col[i+1] != '-': start = j

# Scan for and report mismatches in each SNP. Build in a function to tolerate # insertions/deletions.

print "Scanning %n for mismatches with reference sequence", e[i] for j in range(start, start+snp_len[i]): col=summary_align.get_column if col[0] == '-': print "Deletion at basepair ", j       elif col[i+1] == '-': print "Insertion at basepair ", j        elif col[0] != col[i+1]: print "Mismatch: ", col[i+1], j, col[0] else: "No mismatches found in this SNP"

print "Done screening SNPs for mismatches. Any mismatches may call into question" print "the validity of a returned SNP. Further studies in linkage disequilibrium" print "will offer more information."