Harvard:Biophysics 101/2007/Notebook:Resmi Charalel/2007-5-3

Progress

In class on Tuesday and afterwards, Cynthia and I worked together to completely parse out the mesh terms (all as well as just the major) and increase the efficiency of our individual codes. We were able to slightly increase the efficiency of our look-up methods, but we are still working on how to search through PubMed only one time, so that the search is much faster (and we do not have any code for this latter part as of yet). The non-redundant parts were then combined to produce the code below, which returns mesh terms from two main sources (outlined below).

I have also been working on the documentation of our program and its functions. I am still working on updating this documentation as the fully compiled code comes together. So, I will continue to work on this past tomorrow, so that it is complete and as up-to-date as possible.

Annotation

The following code combines the work that both Cynthia and I have done to return mesh terms (all mesh terms as well as just the major mesh terms) that are derived from two sources:
- 1) From parsing OMIM for PMIDs and returning meshterms of these PMIDs
- 2) By searching PubMed for rs number and returning meshterms of the articles returned in the search

Code

from Bio.EUtils import DBIdsClient
import xml.dom.minidom
from xml.dom.minidom import parse, parseString

# C-style struct to pass parameters
class PubmedID:
        pass

# queries the database and returns all info in an XML format
def omim_snp_search(dnsnp_id):
        client = DBIdsClient.DBIdsClient()
        query = client.search(dnsnp_id, "omim")
        records = [i.efetch(rettype="xml") for i in query]
        return records

# basic text extraction from XML; based on http://docs.python.org/lib/dom-example.html
def get_text(node_list):
    rc = ""
    for node in node_list:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc

def extract_allelic_variant_pmid(str):
    dom = parseString(str)
    pmids = dom.getElementsByTagName("Mim-reference")
    if len(pmids) == 0:
        return
    ids = []
    for p in pmids:
        i = PubmedID()
        i.pmid = get_text(p.getElementsByTagName("Mim-reference_pubmedUID")[0].childNodes)
        ids.append(i.pmid)
    return ids

from Bio import PubMed
from Bio import Medline
import string

# parses a mesh term to remove * and /
def parse_term(str, bool):
    parsed_term = str
    if(bool):
        parsed_term = parsed_term.replace('*', '')
    if str.find('/') != -1:
       parsed_term = parsed_term.replace('/', ' ')
    return parsed_term

# parses list of mesh terms
# returns embedded list, one with all terms and one major  terms
def parse_mesh(list):
    all_mesh_terms = []
    major_mesh_terms = []
    mesh_term = ''
    for i in range(len(list)):
        major = False
        if list[i].find('*') == -1:
            mesh_term = parse_term(list[i], major)
            all_mesh_terms.append(mesh_term)
        else:
            major = True
            mesh_term = parse_term(list[i], major)
            major_mesh_terms.append(mesh_term)
            all_mesh_terms.append(mesh_term)
    all_mesh = [all_mesh_terms, major_mesh_terms]
    return all_mesh


rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)

all_mesh = []
all_mesh_terms = []
major_mesh_terms = []

for i in omim_snp_search("rs11200638"):
        p = extract_allelic_variant_pmid(i.read())
        if p != None:
               # for s in p:
               #         print p[0]
                        cur_record = medline_dict[p[0]]
         #   print '\n', cur_record.title, cur_record.authors, cur_record.source
                        mesh_headings = cur_record.mesh_headings
                        if len(mesh_headings) != 0:
                            all_mesh = parse_mesh(mesh_headings)
                            all_mesh_terms.extend(all_mesh[0])
                            major_mesh_terms.extend(all_mesh[1])

print '\n', "All mesh terms from OMIM PMIDs:  ", all_mesh_terms, '\n', "Major mesh terms from OMIM PMIDs:  ", major_mesh_terms

article_ids = PubMed.search_for("rs11200638")

all_mesh = []
all_mesh_terms = []
major_mesh_terms = []
for did in article_ids[0:5]:
    cur_record = medline_dict[did]
    #print '\n', cur_record.title, cur_record.authors, cur_record.source
    mesh_headings = cur_record.mesh_headings
    if len(mesh_headings) != 0:
        all_mesh = parse_mesh(mesh_headings)
        all_mesh_terms.extend(all_mesh[0])
        major_mesh_terms.extend(all_mesh[1])

print '\n', "All mesh terms from rs number:  ", all_mesh_terms, '\n', "Major mesh terms from rs number:  ", major_mesh_terms

#rest of code returns review articles on topic of interest by searching pubmed
disease = "Age-related Macular Degeneration" #should put a.name here when combined with Xiaodi's previous code
search_term = "Review[ptyp] "+disease
#print search_term

review_ids = PubMed.search_for(search_term)

count = 1

for did in review_ids[0:3]:
    cur_record = medline_dict[did]
    print '\n', count, ')  ', string.rstrip(cur_record.title), cur_record.authors, string.strip(cur_record.source)
    count=count+1

Documentation

How to Use Our Program:

Enter a sequence of interest into our web interface in a fasta format and click “Search” (subject to change). The program will then compare your sequence against all known sequences in NCBI BLAST and return an rs number for each unique SNP found (if one exists). Along with the rs number and SNP, details including any known phenotypic associations and relevant references are printed in a paragraph derived from OMIM. These references, review article references from PubMed, and other related disease information from CDC and news agengies are also returned for further reading on the subject.

How Our Program Works:

Our program is written in Python, a programming language, which allows access to NCBI BLASTsnp, PubMed, and OMIM as well as other biologically relevant databases through BioPython. Our program is designed to enter the given sequence into BLASTsnp and parse the resulting XML output to retrieve all rs numbers of any known SNPs in the sequence, which correspond to those in the database. These rs numbers are then entered into OMIM to search for any phenotypic relevance of the SNP and all returned entries are parsed to return the allelic variant name, mutation and description (including relevant references). The rs number and allelic variant name are then entered into NCBI PubMed and references specific to the allelic variant and the general phenotypic result are returned for further reading. In addition, mesh terms are parsed from these article entries and are used to search through news databases and CDC data. This last search returns relevant links and incidence data.

Potential Applications:

Our program can be used for a variety of applications. In particular, it is able to locate single nucleotide polymorphisms (SNPs) in an entered DNA sequence and return any corresponding phenotypic correlations listed in Online Mendialian Inheritance in Man (OMIM). Using the disease correlation, the program will then print a paragraph of the phenotypes associated with the SNP and return a series of references of related peer-reviewed articles. Currently, it is possible to enter a given disease-correlated SNP and return information about the disease itself and the risk of the individual whose sequence was entered to the disease. In the future, with the rise of personal genomics and more expansive databases containing sequence information, such an application will become more useful.

Harvard:Biophysics 101/2007/Notebook:Resmi Charalel/2007-5-3

Contents

Progress

Annotation

Code

Documentation

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools