Harvard:Biophysics 101/Notebook:ZS/2007-4-22
From OpenWetWare
				
				
				Jump to navigationJump to search
				
				
Tasks for this Tues/Thurs
- Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below)
- I need a new direction for work - are there any tasks that need to be completed?
- I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent.
 
Update: Tues
after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation.
Slight modification to prev. program
#ICD9_prevalance.py
#Zachary Sun
#4.23.07, Biophysics 101
#This code does a couple of things:
#1) Enables lookup of ICD9 ID numbers when given a search term
#    *This is particularly useful as all data (WHO, CDC) is based
#    *on the ICD9 ID, which is a method of classifying all known
#    *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com
#    *which has the database of diseases - it returns the best hits
#    *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description.
#2) Enables lookup of prevalance data in databased
#    *Currently, I only have it hooked up to the State of CA prevalance data from
#    *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a
#    *lookup based on #1 and returns prevalance data. As soon as I find more databases
#    *I can extend the search; I haven't found much good though save the main CDC db
#    *at http://wonder.cdc.gov/.
#To do: clean up output
import os
import string
import urllib
disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING
ICD9code = []
found = 0
####
#queueing http://icd9cm.chrisendres.com for code lookup
####
queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext='
queue_name = queue_name + disease_name
    
code_lookup = urllib.urlopen(queue_name).read() #send queue request to site, returns dirty html
out = open("index.txt", "w") #write dirty html to file
out.write(code_lookup)
out.close()
readCode = open("index.txt", "r") #read dirty html
lookup_line = readCode.readline()
print "***ICD9 and hits, arranged by importance***\n"
while lookup_line:
    w= lookup_line.find("<div class=dlvl>") #the unique marker before the disease
    if w != -1: #if it is found
        tempCode = lookup_line[32:40] #code in this section
        tempCode = string.split(tempCode, ' ') #split into number
        ICD9code.append(tempCode[0])
        print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found
    lookup_line = readCode.readline()
####
#searching incidence data in CA data file
#note: returns class of hits
####
print "\n\n***Prevalance data per IDC9 above***"
print "Note: returns incidence data/yr, including subclasses\n"
fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file
for code in ICD9code:
    fh.seek(0)
    line = fh.readline()
    totalIncidence = 0
    sumCount = 0
    while line: #for every line in the data
        line = line[:-1] #remove /n
        lineVector = string.split(line, ',') #split to vector
        if lineVector[0].find(code) != -1: #if disease hit
            totalIncidence += int(lineVector[1])
            sumCount += 1
        else:
            if sumCount > 0:
                print "ID: ", code, "incidence: ", totalIncidence
                totalIncidence = 0
                sumCount = 0
        line = fh.readline()
output:
***ICD9 and hits, arranged by importance*** 493.0 Extrinsic asthma 493.1 Intrinsic asthma 493.9 Asthma, unspecified 495.8 Other specified allergic alveolitis and pneumonitis 493.2 Chronic obstructive asthma V17.5 Asthma</div> 493.8 Other forms of asthma</div> 493.82 Cough variant asthma</div> 507.8 Due to other solids and liquids 786.07 Wheezing ***Prevalance data per IDC9 above*** Note: returns incidence data/yr, including subclasses ID: 493.0 incidence: 13694 ID: 493.1 incidence: 148 ID: 493.9 incidence: 150445 ID: 495.8 incidence: 38 ID: 493.2 incidence: 50978 ID: V17.5 incidence: 1766 ID: 493.8 incidence: 368 ID: 493.82 incidence: 36 ID: 507.8 incidence: 232 ID: 786.07 incidence: 932