Harvard:Biophysics 101/2007/Notebook:Resmi Charalel/2007-5-3

Progress
In class on Tuesday and afterwards, Cynthia and I worked together to completely parse out the mesh terms (all as well as just the major) and increase the efficiency of our individual codes. We were able to slightly increase the efficiency of our look-up methods, but we are still working on how to search through PubMed only one time, so that the search is much faster (and we do not have any code for this latter part as of yet). The non-redundant parts were then combined to produce the code below, which returns mesh terms from two main sources (outlined below).

I have also been working on the documentation of our program and its functions. I am still working on updating this documentation as the fully compiled code comes together. So, I will continue to work on this past tomorrow, so that it is complete and as up-to-date as possible.

Annotation

 * The following code combines the work that both Cynthia and I have done to return mesh terms (all mesh terms as well as just the major mesh terms) that are derived from two sources:
 * 1) From parsing OMIM for PMIDs and returning meshterms of these PMIDs
 * 2) By searching PubMed for rs number and returning meshterms of the articles returned in the search

Code
from Bio.EUtils import DBIdsClient import xml.dom.minidom from xml.dom.minidom import parse, parseString

class PubmedID: pass
 * 1) C-style struct to pass parameters

def omim_snp_search(dnsnp_id): client = DBIdsClient.DBIdsClient query = client.search(dnsnp_id, "omim") records = [i.efetch(rettype="xml") for i in query] return records
 * 1) queries the database and returns all info in an XML format

def get_text(node_list): rc = "" for node in node_list: if node.nodeType == node.TEXT_NODE: rc = rc + node.data return rc
 * 1) basic text extraction from XML; based on http://docs.python.org/lib/dom-example.html

def extract_allelic_variant_pmid(str): dom = parseString(str) pmids = dom.getElementsByTagName("Mim-reference") if len(pmids) == 0: return ids = [] for p in pmids: i = PubmedID i.pmid = get_text(p.getElementsByTagName("Mim-reference_pubmedUID")[0].childNodes) ids.append(i.pmid) return ids

from Bio import PubMed from Bio import Medline import string

def parse_term(str, bool): parsed_term = str if(bool): parsed_term = parsed_term.replace('*', '') if str.find('/') != -1: parsed_term = parsed_term.replace('/', ' ') return parsed_term
 * 1) parses a mesh term to remove * and /

def parse_mesh(list): all_mesh_terms = [] major_mesh_terms = [] mesh_term = '' for i in range(len(list)): major = False if list[i].find('*') == -1: mesh_term = parse_term(list[i], major) all_mesh_terms.append(mesh_term) else: major = True mesh_term = parse_term(list[i], major) major_mesh_terms.append(mesh_term) all_mesh_terms.append(mesh_term) all_mesh = [all_mesh_terms, major_mesh_terms] return all_mesh
 * 1) parses list of mesh terms
 * 2) returns embedded list, one with all terms and one major  terms

rec_parser = Medline.RecordParser medline_dict = PubMed.Dictionary(parser = rec_parser)

all_mesh = [] all_mesh_terms = [] major_mesh_terms = []

for i in omim_snp_search("rs11200638"): p = extract_allelic_variant_pmid(i.read) if p != None: # for s in p:              #         print p[0] cur_record = medline_dict[p[0]] #  print '\n', cur_record.title, cur_record.authors, cur_record.source mesh_headings = cur_record.mesh_headings if len(mesh_headings) != 0: all_mesh = parse_mesh(mesh_headings) all_mesh_terms.extend(all_mesh[0]) major_mesh_terms.extend(all_mesh[1])

print '\n', "All mesh terms from OMIM PMIDs: ", all_mesh_terms, '\n', "Major mesh terms from OMIM PMIDs:  ", major_mesh_terms

article_ids = PubMed.search_for("rs11200638")

all_mesh = [] all_mesh_terms = [] major_mesh_terms = [] for did in article_ids[0:5]: cur_record = medline_dict[did] #print '\n', cur_record.title, cur_record.authors, cur_record.source mesh_headings = cur_record.mesh_headings if len(mesh_headings) != 0: all_mesh = parse_mesh(mesh_headings) all_mesh_terms.extend(all_mesh[0]) major_mesh_terms.extend(all_mesh[1])

print '\n', "All mesh terms from rs number: ", all_mesh_terms, '\n', "Major mesh terms from rs number:  ", major_mesh_terms

disease = "Age-related Macular Degeneration" #should put a.name here when combined with Xiaodi's previous code search_term = "Review[ptyp] "+disease
 * 1) rest of code returns review articles on topic of interest by searching pubmed
 * 1) print search_term

review_ids = PubMed.search_for(search_term)

count = 1

for did in review_ids[0:3]: cur_record = medline_dict[did] print '\n', count, ') ', string.rstrip(cur_record.title), cur_record.authors, string.strip(cur_record.source)    count=count+1

Documentation
How to Use Our Program: How Our Program Works: Potential Applications:
 * Enter a sequence of interest into our web interface in a fasta format and click “Search” (subject to change). The program will then compare your sequence against all known sequences in NCBI BLAST and return an rs number for each unique SNP found (if one exists). Along with the rs number and SNP, details including any known phenotypic associations and relevant references are printed in a paragraph derived from OMIM. These references, review article references from PubMed, and other related disease information from CDC and news agengies are also returned for further reading on the subject.
 * Our program is written in Python, a programming language, which allows access to NCBI BLASTsnp, PubMed, and OMIM as well as other biologically relevant databases through BioPython. Our program is designed to enter the given sequence into BLASTsnp and parse the resulting XML output to retrieve all rs numbers of any known SNPs in the sequence, which correspond to those in the database. These rs numbers are then entered into OMIM to search for any phenotypic relevance of the SNP and all returned entries are parsed to return the allelic variant name, mutation and description (including relevant references). The rs number and allelic variant name are then entered into NCBI PubMed and references specific to the allelic variant and the general phenotypic result are returned for further reading. In addition, mesh terms are parsed from these article entries and are used to search through news databases and CDC data. This last search returns relevant links and incidence data.
 * Our program can be used for a variety of applications. In particular, it is able to locate single nucleotide polymorphisms (SNPs) in an entered DNA sequence and return any corresponding phenotypic correlations listed in Online Mendialian Inheritance in Man (OMIM). Using the disease correlation, the program will then print a paragraph of the phenotypes associated with the SNP and return a series of references of related peer-reviewed articles. Currently, it is possible to enter a given disease-correlated SNP and return information about the disease itself and the risk of the individual whose sequence was entered to the disease. In the future, with the rise of personal genomics and more expansive databases containing sequence information, such an application will become more useful.