Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1
From OpenWetWare
Jump to navigationJump to search
Tasks Due Today
Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)
- This code should replace Xiaodi's section, which bears the same name.
# extracts snp data def extract_snp_data(str): dom = parseString(str) variants = dom.getElementsByTagName("Hit") if len(variants) == 0: return parsed = {} for v in variants: # now populate the struct id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes) score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes) parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes) # only consider it a genuine snp if the hit score is above 100 if int(score) > 100: parsed[id] = parsed_seq return parsed
- Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class. Here it is:
# BEGIN ACTUAL PROGRAM # read sequence from file sequence = get_sequence_from_fasta("example.fasta") print "Sequence received; please wait." # look up data b = blast_snp_lookup(sequence) # extract data e = extract_snp_data(b) outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n") if len(e) > 0: for i in e.iterkeys(): outputlist.append(i + "\n") else: outputlist.append("[none]\n") sys.exit() # nothing more to be done if no snp found #print '-' * 40 #print "\n" print "got a snp... still thinking..." # do an omim search on each snp for i in e.iterkeys(): o = omim_snp_search(i) if len(o) == 0: outputlist.append("No information found for " + i + "\n") continue # nothing more to be done if no records can be found # otherwise, find the allelic variant data outputlist.append(i + " details:" + "\n") for j in o: v = extract_allelic_variant_data(j.read()) if v != None: for a in v: outputlist.append(a.name + "\n") outputlist.append(a.mutation + "\n") outputlist.append(a.description + "\n") #print '-' * 40 #print "\n"
Complete code which removes falsely-identified SNPs
# Mike and I are writing this code in tandem. The goal of this code # is to remove falsely identified SNPs (we will integrate our two halves # in class tomorrow). # Parts go as follows: # 1. take input (dict file with rsID and keyed seq) # 2. align the SNP with reference sequence, # 3. find ambiguous base-pair position. # 4. check to see if the sequence BP is in the space of the SNP BP. # 5. remove erroneous SNP entries # 6. return output (updated dict) # I am doing parts 4-6. # given pos from part 3 for i in dict_file.iterkeys(): a=seq[pos] snpseq=dict_file.get(i) j = 0 k = 0 while j = 0: if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters: b=snpseq[j] j = 1 else: k=k+1 b_values = IUPACData.ambiguous_dna_values[b] if a not in b_values: del dict_file[i] return dict_file
Remaining Documented Code
Here is the remaining code I have written that needs to be integrated. I've adjusted everything to use the appropriate inputs from the dict file. I don't have a preference as to where this might go. We could make it more object-oriented, using it as a definition, or it could stand alone in the actual body of the program. The choice is up to you. That said, it should come after the code that Michael and I are currently writing, which will eliminate false positives from our queries. I think it should come after the OMIM reporting. Inputs are our query sequence and the parsed dict class object. Outputs are a list of iterated mutations, reported per SNP. Here is the code:
# error check for mis-matches for i in e.iterkeys(): parsed_snp = e[i] updateseq = open('example.fasta', 'a') updateseq.write(parsed_snp) updateseq.close() snp_len.append(len(formatted_snp)) # Align the FASTA file using Clustal cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta')) cline.set_output('example.aln') alignment = Clustalw.do_alignment(cline) summary_align = AlignInfo.SummaryInfo(alignment) con = summary_align.dumb_consensus() # Figure out where each SNP starts and stops relative to the test # sequence. This allows us to establish a start and default stop for # comparing individual SNPs with the reference. size_col = Set() size_col = summary_align.get_column(1) snp_count = len(size_col)-1 colset = Set() for i in range(snp_count): start = 0 while start == 0: for j in range(len(con)): col=summary_align.get_column(j) if col[i+1] != '-': start = j # Scan for and report mismatches in each SNP. Build in a function to tolerate # insertions/deletions. print "Scanning %n for mismatches with reference sequence", e[i] for j in range(start, start+snp_len[i]): col=summary_align.get_column if col[0] == '-': print "Deletion at basepair ", j elif col[i+1] == '-': print "Insertion at basepair ", j elif col[0] != col[i+1]: print "Mismatch: ", col[i+1], j, col[0] else: "No mismatches found in this SNP" print "Done screening SNPs for mismatches. Any mismatches may call into question" print "the validity of a returned SNP. Further studies in linkage disequilibrium" print "will offer more information."