Harvard:Biophysics 101/2007/Notebook:Michael Wang/2007-3-15

 Step 1  The first thing I did was identify the orfs in the sequence. I could use my orf function from the last assignment, but I did it manually and found a start codon at position 62 and a stop position at position 73 for a 7 Aa orf. Pretty short. >example1  CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACG C(ATGCCACCCACAACAACTTTTTAA)AAGAATCAGACGTGTGAAGGATTCTATTCGAATTA CTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTC CGCGGACGCTGCCTTCGTCCAGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGT ACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCG CAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCC GGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC

 Step 2 

This is useless without some comparison so I blasted it and it matched to a sequence on human chromosome 10 with one SNP at position 202.

>ref|NT_030059.12|Hs10_30314 Download subject sequence spanning the HSP Homo sapiens chromosome 10 genomic contig, reference assembly Length=44617998

Features flanking this part of subject sequence: 3895 bp at 5' side: hypothetical protein 425 bp at 3' side: HtrA serine peptidase 1

Score = 787 bits (397),  Expect = 0.0 Identities = 400/401 (99%), Gaps = 0/401 (0%) Strand=Plus/Plus

Query 1         CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACG  60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 42968870  CACCCTCGCCAGTTACGAGCTGCCGAGCCGCTTCCTAGGCTCTCTGCGAATACGGACACG  42968929

Query 61        CATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTA  120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 42968930  CATGCCACCCACAACAACTTTTTAAAAGAATCAGACGTGTGAAGGATTCTATTCGAATTA  42968989

Query 121       CTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTC  180 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 42968990  CTTCTGCTCTCTGCTTTTATCACTTCACTGTGGGTCTGGGCGCGGGCTTTCTGCCAGCTC  42969049

Query 181       CGCGGACGCTGCCTTCGTCCAGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGT  240 |||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| Sbjct 42969050  CGCGGACGCTGCCTTCGTCCGGCCGCAGAGGCCCCGCGGTCAGGGTCCCGCGTGCGGGGT  42969109

Query 241       ACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCG  300 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 42969110  ACCGGGGGCAGAACCAGCGCGTGACCGGGGTCCGCGGTGCCGCAACGCCCCGGGTCTGCG  42969169

Query 301       CAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCC  360 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 42969170  CAGAGGCCCCTGCAGTCCCTGCCCGGCCCAGTCCGAGCTTCCCGGGCGGGCCCCCAGTCC  42969229

Query 361       GGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC  401 ||||||||||||||||||||||||||||||||||||||||| Sbjct 42969230  GGCGATTTGCAGGAACTTTCCCCGGCGCTCCCACGCGAAGC  42969270

 Step 3 

This SNP does not fall within the contained ORF so there does not immediately appear to be anything to be worried about. However, it might be worth it to check it against OMIM. Searching for HtrA serine peptidase 1 on Chromosone 10 gives one SNP:

http://www.ncbi.nlm.nih.gov/SNP/snp_ss.cgi?subsnp_id=16082056

This SNP matches up to the one we have from the blast comparison. Apparently it increases the risk of age-related macular degeneration, which has been verified in both a Hong Kong and Utah based population. If I were a physician, I would recommend that the patient seek the opinion of an optometrist and have genetic testing done on relatives to see if they are also at risk.

A python implementation would basically follow these same steps, starting with the ORF identifier from the previous assignment, then automating a blast search and a subsequent OMIM query of any features identified by blast. To allow batch comparisons, the program should be able to suck up multiple sequences from various files (implemented in previous programs).

Here's my test sequence: >ExampleM CGTGGGCTGC TTCTTTCCCC AGGCGAAGCT CAACTTCCTC CCATTGTTCT GAACCTCTGT GTGGACATCT TCTTTCTTCA AACGCACCAC GGTAAAATTC TCGCCTGCCT CGAAACCCCG CCTACCTCTG AGATCTGAGG ACGGATACTA AACGCTGGAC TTAAGGCAAT GTACACATGT AAGCAGGCTC TGTAGGCACT CACTCCGCCC AGGTGCGCGC GTGGCGGAGG GGGAACAGAG AAGCAGGACA GCTCTCCATC CTTCCCGTGT TCAGTCGTGG GAGACAACAA GAGAGGTCAC AGCCTGGCGA CCAAAAAGTG CGGCTAACTT CCCTGCCCAA GCTGACTTTC TCTGCAGGGT TCAAGGTTAA TTGTGAGGAT TTACATTCGC ATGGCACACC CGCATCCCCC TCTACGTGGA AATATGTCTT AACTTTCATA ACTGCCTTGC CAGCAGGGTA TTTTTCGCTA GGGGCGAAGC GTCCTTCGCA AGCCACCCAG CTGACCGGCA G