Harvard:Biophysics 101/2007/Notebook:Christopher Nabel/2007-5-1

From OpenWetWare
Jump to navigationJump to search

Tasks Due Today

Change parse code to condense rsIDs and SNP seq lists from XML into a single dict file (appropriate input for code removing false SNPs)

  • This code should replace Xiaodi's section, which bears the same name.
# extracts snp data
def extract_snp_data(str):
	dom = parseString(str)
	variants = dom.getElementsByTagName("Hit")
	if len(variants) == 0:
	parsed = {}
	for v in variants:
		# now populate the struct
		id = get_text(v.getElementsByTagName("Hit_accession")[0].childNodes)
		score = get_text(v.getElementsByTagName("Hsp_score")[0].childNodes)
		parsed_seq = get_text(v.getElementsByTagName("Hsp_hseq")[0].childNodes)
		# only consider it a genuine snp if the hit score is above 100
		if int(score) > 100:
			parsed[id] = parsed_seq
	return parsed
  • Since this code changes the returned output from a list to a dict, we need to alter the rest of Xiaodi's code to handle this new input. I have altered the code to tolerate SNP ID's in a dict class. Here it is:
# read sequence from file
sequence = get_sequence_from_fasta("example.fasta")
print "Sequence received; please wait."
# look up data
b = blast_snp_lookup(sequence)
# extract data
e = extract_snp_data(b)
outputlist.append(str(len(e)) + " single nucleotide polymorphism(s) found:\n")
if len(e) > 0:
	for i in e.iterkeys():
		outputlist.append(i + "\n")
	sys.exit() # nothing more to be done if no snp found
#print '-' * 40
#print "\n"
print "got a snp... still thinking..."

# do an omim search on each snp
for i in e.iterkeys():
	o = omim_snp_search(i)
	if len(o) == 0:
		outputlist.append("No information found for " + i + "\n")
		continue # nothing more to be done if no records can be found
	# otherwise, find the allelic variant data
	outputlist.append(i + " details:" + "\n")
	for j in o:
		v = extract_allelic_variant_data(j.read())
		if v != None:
			for a in v:
				outputlist.append(a.name + "\n")
				outputlist.append(a.mutation + "\n")
				outputlist.append(a.description + "\n")
	#print '-' * 40
	#print "\n"

Complete code which removes falsely-identified SNPs

# Mike and I are writing this code in tandem.  The goal of this code
# is to remove falsely identified SNPs (we will integrate our two halves
# in class tomorrow).

# Parts go as follows:
# 1. take input (dict file with rsID and keyed seq)
# 2. align the SNP with reference sequence,
# 3. find ambiguous base-pair position.
# 4. check to see if the sequence BP is in the space of the SNP BP.
# 5. remove erroneous SNP entries
# 6. return output (updated dict)

# I am doing parts 4-6.

# given pos from part 3

for i in dict_file.iterkeys():
    j = 0
    k = 0
    while j = 0:
        if snpseq[k] in IUPACData.ambiguous_dna_letters and not in IUPACData.unambiguous_dna_letters:
            j = 1
        else: k=k+1
    b_values = IUPACData.ambiguous_dna_values[b]
    if a not in b_values:
        del dict_file[i]

return dict_file

Remaining Documented Code

Here is the remaining code I have written that needs to be integrated. I've adjusted everything to use the appropriate inputs from the dict file. I don't have a preference as to where this might go. We could make it more object-oriented, using it as a definition, or it could stand alone in the actual body of the program. The choice is up to you. That said, it should come after the code that Michael and I are currently writing, which will eliminate false positives from our queries. I think it should come after the OMIM reporting. Inputs are our query sequence and the parsed dict class object. Outputs are a list of iterated mutations, reported per SNP. Here is the code:

# error check for mis-matches
for i in e.iterkeys():
    parsed_snp = e[i]
    updateseq = open('example.fasta', 'a')

    # Align the FASTA file using Clustal

cline = Clustalw.MultipleAlignCL(os.path.join(os.curdir, 'example.fasta'))
alignment = Clustalw.do_alignment(cline)
summary_align = AlignInfo.SummaryInfo(alignment)
con = summary_align.dumb_consensus()

    # Figure out where each SNP starts and stops relative to the test
    # sequence.  This allows us to establish a start and default stop for
    # comparing individual SNPs with the reference.

size_col = Set()
size_col = summary_align.get_column(1)
snp_count = len(size_col)-1
colset = Set()
for i in range(snp_count):
    start = 0
    while start == 0:
        for j in range(len(con)):
            if col[i+1] != '-':
                start = j

    # Scan for and report mismatches in each SNP. Build in a function to tolerate
    # insertions/deletions.

    print "Scanning %n for mismatches with reference sequence", e[i]
    for j in range(start, start+snp_len[i]):
        if col[0] == '-': print "Deletion at basepair ", j
        elif col[i+1] == '-': print "Insertion at basepair ", j
        elif col[0] != col[i+1]: print "Mismatch: ", col[i+1], j, col[0] 
        else: "No mismatches found in this SNP"

print "Done screening SNPs for mismatches.  Any mismatches may call into question"
print "the validity of a returned SNP.  Further studies in linkage disequilibrium"
print "will offer more information."