Harvard:Biophysics 101/Notebook:ZS/2007-4-22

Tasks for this Tues/Thurs

 * 1) Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below)
 * 2) I need a new direction for work - are there any tasks that need to be completed?
 * 3) *I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent.

Update: Tues
after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation.

Slight modification to prev. program

 * 1) ICD9_prevalance.py
 * 2) Zachary Sun
 * 3) 4.23.07, Biophysics 101


 * 1) This code does a couple of things:
 * 2) 1) Enables lookup of ICD9 ID numbers when given a search term
 * 3)    *This is particularly useful as all data (WHO, CDC) is based
 * 4)    *on the ICD9 ID, which is a method of classifying all known
 * 5)    *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com
 * 6)    *which has the database of diseases - it returns the best hits
 * 7)    *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description.
 * 8) 2) Enables lookup of prevalance data in databased
 * 9)    *Currently, I only have it hooked up to the State of CA prevalance data from
 * 10)    *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a
 * 11)    *lookup based on #1 and returns prevalance data. As soon as I find more databases
 * 12)    *I can extend the search; I haven't found much good though save the main CDC db
 * 13)    *at http://wonder.cdc.gov/.


 * 1) To do: clean up output

import os import string import urllib

disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING ICD9code = [] found = 0

queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext=' queue_name = queue_name + disease_name
 * 1) queueing http://icd9cm.chrisendres.com for code lookup
 * 1) queueing http://icd9cm.chrisendres.com for code lookup

code_lookup = urllib.urlopen(queue_name).read #send queue request to site, returns dirty html out = open("index.txt", "w") #write dirty html to file out.write(code_lookup) out.close

readCode = open("index.txt", "r") #read dirty html lookup_line = readCode.readline print "***ICD9 and hits, arranged by importance***\n" while lookup_line: w= lookup_line.find(" ") #the unique marker before the disease if w != -1: #if it is found tempCode = lookup_line[32:40] #code in this section tempCode = string.split(tempCode, ' ') #split into number ICD9code.append(tempCode[0]) print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found lookup_line = readCode.readline

print "\n\n***Prevalance data per IDC9 above***" print "Note: returns incidence data/yr, including subclasses\n"
 * 1) searching incidence data in CA data file
 * 2) note: returns class of hits
 * 1) note: returns class of hits

fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file

for code in ICD9code: fh.seek(0) line = fh.readline totalIncidence = 0 sumCount = 0 while line: #for every line in the data line = line[:-1] #remove /n lineVector = string.split(line, ',') #split to vector if lineVector[0].find(code) != -1: #if disease hit totalIncidence += int(lineVector[1]) sumCount += 1 else: if sumCount > 0: print "ID: ", code, "incidence: ", totalIncidence totalIncidence = 0 sumCount = 0 line = fh.readline

output:
 * ICD9 and hits, arranged by importance***

493.0 Extrinsic asthma 493.1 Intrinsic asthma 493.9 Asthma, unspecified 495.8 Other specified allergic alveolitis and pneumonitis 493.2 Chronic obstructive asthma V17.5 Asthma 493.8 Other forms of asthma 493.82 Cough variant asthma 507.8 Due to other solids and liquids 786.07 Wheezing

Note: returns incidence data/yr, including subclasses
 * Prevalance data per IDC9 above***

ID: 493.0 incidence:  13694 ID: 493.1 incidence:  148 ID: 493.9 incidence:  150445 ID: 495.8 incidence:  38 ID: 493.2 incidence:  50978 ID: V17.5 incidence:  1766 ID: 493.8 incidence:  368 ID: 493.82 incidence:  36 ID: 507.8 incidence:  232 ID: 786.07 incidence:  932