Harvard:Biophysics 101/Notebook:ZS/2007-4-22
From OpenWetWare
Jump to navigationJump to search
Tasks for this Tues/Thurs
- Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below)
- I need a new direction for work - are there any tasks that need to be completed?
- I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent.
Update: Tues
after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation.
Slight modification to prev. program
#ICD9_prevalance.py #Zachary Sun #4.23.07, Biophysics 101 #This code does a couple of things: #1) Enables lookup of ICD9 ID numbers when given a search term # *This is particularly useful as all data (WHO, CDC) is based # *on the ICD9 ID, which is a method of classifying all known # *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com # *which has the database of diseases - it returns the best hits # *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description. #2) Enables lookup of prevalance data in databased # *Currently, I only have it hooked up to the State of CA prevalance data from # *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a # *lookup based on #1 and returns prevalance data. As soon as I find more databases # *I can extend the search; I haven't found much good though save the main CDC db # *at http://wonder.cdc.gov/. #To do: clean up output import os import string import urllib disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING ICD9code = [] found = 0 #### #queueing http://icd9cm.chrisendres.com for code lookup #### queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext=' queue_name = queue_name + disease_name code_lookup = urllib.urlopen(queue_name).read() #send queue request to site, returns dirty html out = open("index.txt", "w") #write dirty html to file out.write(code_lookup) out.close() readCode = open("index.txt", "r") #read dirty html lookup_line = readCode.readline() print "***ICD9 and hits, arranged by importance***\n" while lookup_line: w= lookup_line.find("<div class=dlvl>") #the unique marker before the disease if w != -1: #if it is found tempCode = lookup_line[32:40] #code in this section tempCode = string.split(tempCode, ' ') #split into number ICD9code.append(tempCode[0]) print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found lookup_line = readCode.readline() #### #searching incidence data in CA data file #note: returns class of hits #### print "\n\n***Prevalance data per IDC9 above***" print "Note: returns incidence data/yr, including subclasses\n" fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file for code in ICD9code: fh.seek(0) line = fh.readline() totalIncidence = 0 sumCount = 0 while line: #for every line in the data line = line[:-1] #remove /n lineVector = string.split(line, ',') #split to vector if lineVector[0].find(code) != -1: #if disease hit totalIncidence += int(lineVector[1]) sumCount += 1 else: if sumCount > 0: print "ID: ", code, "incidence: ", totalIncidence totalIncidence = 0 sumCount = 0 line = fh.readline()
output:
***ICD9 and hits, arranged by importance*** 493.0 Extrinsic asthma 493.1 Intrinsic asthma 493.9 Asthma, unspecified 495.8 Other specified allergic alveolitis and pneumonitis 493.2 Chronic obstructive asthma V17.5 Asthma</div> 493.8 Other forms of asthma</div> 493.82 Cough variant asthma</div> 507.8 Due to other solids and liquids 786.07 Wheezing ***Prevalance data per IDC9 above*** Note: returns incidence data/yr, including subclasses ID: 493.0 incidence: 13694 ID: 493.1 incidence: 148 ID: 493.9 incidence: 150445 ID: 495.8 incidence: 38 ID: 493.2 incidence: 50978 ID: V17.5 incidence: 1766 ID: 493.8 incidence: 368 ID: 493.82 incidence: 36 ID: 507.8 incidence: 232 ID: 786.07 incidence: 932