Harvard:Biophysics 101/Notebook:ZS/2007-4-22

From OpenWetWare
Jump to navigationJump to search

Tasks for this Tues/Thurs

  1. Advance CDC prevalence parsing program - though I don't think that needs more work (edit: done below)
  2. I need a new direction for work - are there any tasks that need to be completed?
    • I was thinking of ways to begin tackling the not-in-OMIM case - I can do some code consolidation for sequences that are CDS but not documented, and pursue the AA change/frameshift/etc... type analysis we tackled earlier in class into something coherent.

Update: Tues

after discussion I have decided to push back code consolidation, and focus on a more meaningful task Xiaodi and Katie have been having trouble with - finding out of a sequence is in a CDS or not. I have some ideas for parsing BLAST data (which prob. wont work) but also going through Entrez Gene which I've been able to determine gene range WITHOUT parsing the image - I think it may work after some experimentation.

Slight modification to prev. program

#ICD9_prevalance.py
#Zachary Sun
#4.23.07, Biophysics 101

#This code does a couple of things:
#1) Enables lookup of ICD9 ID numbers when given a search term
#    *This is particularly useful as all data (WHO, CDC) is based
#    *on the ICD9 ID, which is a method of classifying all known
#    *diseases. This code is able to queue a website, http://icd9cm.chrisendres.com
#    *which has the database of diseases - it returns the best hits
#    *in dirty HTML, which can then be parsed to obtain ICD9 #'s and description.
#2) Enables lookup of prevalance data in databased
#    *Currently, I only have it hooked up to the State of CA prevalance data from
#    *http://www.oshpd.ca.gov/hqad/PatientLevel/ICD9_Codes/index.htm, it does a
#    *lookup based on #1 and returns prevalance data. As soon as I find more databases
#    *I can extend the search; I haven't found much good though save the main CDC db
#    *at http://wonder.cdc.gov/.

#To do: clean up output

import os
import string
import urllib


disease_name = "asthma" #INSERT DISEASE NAME HERE FOR TESTING
ICD9code = []
found = 0


####
#queueing http://icd9cm.chrisendres.com for code lookup
####
queue_name = 'http://icd9cm.chrisendres.com/index.php?action=search&srchtype=diseases&srchtext='
queue_name = queue_name + disease_name

    
code_lookup = urllib.urlopen(queue_name).read() #send queue request to site, returns dirty html
out = open("index.txt", "w") #write dirty html to file
out.write(code_lookup)
out.close()

readCode = open("index.txt", "r") #read dirty html
lookup_line = readCode.readline()
print "***ICD9 and hits, arranged by importance***\n"
while lookup_line:
    w= lookup_line.find("<div class=dlvl>") #the unique marker before the disease
    if w != -1: #if it is found
        tempCode = lookup_line[32:40] #code in this section
        tempCode = string.split(tempCode, ' ') #split into number
        ICD9code.append(tempCode[0])
        print lookup_line[32:len(lookup_line)-7] #disable in final version - shows the hits found
    lookup_line = readCode.readline()


####
#searching incidence data in CA data file
#note: returns class of hits
####
print "\n\n***Prevalance data per IDC9 above***"
print "Note: returns incidence data/yr, including subclasses\n"

fh = open(os.path.join(os.curdir, "dx05.txt")) #the CA data file

for code in ICD9code:
    fh.seek(0)
    line = fh.readline()
    totalIncidence = 0
    sumCount = 0
    while line: #for every line in the data
        line = line[:-1] #remove /n
        lineVector = string.split(line, ',') #split to vector
        if lineVector[0].find(code) != -1: #if disease hit
            totalIncidence += int(lineVector[1])
            sumCount += 1
        else:
            if sumCount > 0:
                print "ID: ", code, "incidence: ", totalIncidence
                totalIncidence = 0
                sumCount = 0
        line = fh.readline()

output:

***ICD9 and hits, arranged by importance***


493.0 Extrinsic asthma
493.1 Intrinsic asthma
493.9 Asthma, unspecified
495.8 Other specified allergic alveolitis and pneumonitis
493.2 Chronic obstructive asthma
V17.5 Asthma</div>
493.8 Other forms of asthma</div>
493.82 Cough variant asthma</div>
507.8 Due to other solids and liquids
786.07 Wheezing


***Prevalance data per IDC9 above***
Note: returns incidence data/yr, including subclasses

ID:  493.0 incidence:  13694
ID:  493.1 incidence:  148
ID:  493.9 incidence:  150445
ID:  495.8 incidence:  38
ID:  493.2 incidence:  50978
ID:  V17.5 incidence:  1766
ID:  493.8 incidence:  368
ID:  493.82 incidence:  36
ID:  507.8 incidence:  232
ID:  786.07 incidence:  932