User talk:Daniel M Jordan

From OpenWetWare
Jump to: navigation, search

User Page

Biophysics 101 Assignments

My Contribution to the Project

I've been a little lax in updating my wiki page (as you can see below and in my edit history), but that doesn't mean I haven't been contributing to the project! Here is my quick attempt to catalogue what I've been doing.

I am a member of the modeling group. You can read about the contributions of the group as a whole on the group's page. As the member of the group with the most biological background, I acted as a sort of biological sanity check on the rest of the group's ideas in the early stages of the project. In the later stages, it became clear that I also had the most experience programming in Python, so I had the clearest idea of what was and wasn't possible in Python. Once coding actually began, I gave some impromptu lessons in Python and generally helped proofread and clean up code.

If I had to name my own material contributions to the project as an individual rather than a member of the modeling group, I would say:

  • I contributed greatly to the formula for the original nonlinear regression model proposed by the modeling group, though we later discarded that particular model in favor of models proposed in literature and found by the biology group.
  • I did most of the debugging of the code to infer models from input data, and pinpointed the problem of having data that fits the model too exactly (briefly documented here).
  • I was largely responsible for the insight that our Python scripts could generate other Python scripts that could be uploaded as models to the infrastructure group's framework.
  • I wrote most of the code to convert the model parameters to a standalone Python script that can be uploaded to Trait-o-matic.

OMIM/SNPedia/GeneTests assignment

We briefly described what we did here. What we actually did was to look up loci implicated in type II diabetes on OMIM, then cross-index them with other disease mutations found in the same genes. A possible way to expand this would be to look also at a resource like STRING to find related genes.

Here is the list of diseases we found to be related to Type II Diabetes, by category:

  • Metabolic
    • Obesity (susceptibility)
    • Insulin resistance
    • Lipodystrophy
    • MODY — types I, II, III, IV, and IX
    • Diabetes mellitus — gestational, permanent neonatal, transient neonatal, ketosis-prone, insulin-dependent
    • Hyperinsulemic hypoglycemia
    • Hepatic lipase deficiency
    • HDL cholesterol level variation
  • Cardiovascular
    • Coronary artery disease (susceptibility)
    • Carotid intimal medial thickness
    • Generalized arterial calcification of infancy
    • Intracranial hemorrhage (susceptibility)
    • Insulin resistance-related hypertension (susceptibility)
  • Neoplastic
    • Glioblastoma (susceptibility)
    • Kaposisarcoma (susceptibility)
    • Hepatic adenoma (susceptibility)
    • Renal cell carcinoma (susceptibility)
  • Syndromic
    • Wolfram syndrome
    • Sensorineural low-frequency hearing loss
    • Renal cysts and diabetes syndrome
  • Autoimmune
    • Systemic juvenile rheumatoid arthritis
    • Chron disease-associated growth failure
  • Developmental
    • Pancreatic agenesis
  • Other
    • Ossification of posterior longitudinal ligament of spine

We haven't specifically tried it, but this same approach would probably work quite well with variants from SNPedia, HGMD, and GeneCards. The procedure is the same for each of them: search for the disease, compile the list of genes that returned hits, then reverse the search and compile the list of diseases for each gene. I don't know to what extent these different sources are in sync with each other -- that is, I don't know how much overlap there would be in the data returned. Maybe it would be ideal to run the search simultaneously on all of these tools, and compile all the results together.

Thoughts on Human 2.0

I think one of the most interesting things that's been mentioned is Bioweathermap. When sequencing is cheap enough that we can routinely produce human genomes, could we create the same kind of database for the human population, or some sample of it? What would we do with it? It could be used for public health purposes, to track the incidence of genetic diseases or the susceptibility of the population to various diseases, but I think a more exciting use would be to track the evolution of Homo sapiens in real time. After all, if we want to be able to control the future of our evolution, we will need some tools for monitoring the course of our evolution. With a large enough (and well-chosen) sample of genomes, we could identify rapidly changing and highly conserved loci. This would tell us about what selection pressures the species is facing, which in turn could suggest directions in both research and public policy. I think it would be interesting to develop some ideas about what information we will be able to extract from a data set like this. A possible short-term goal would be to determine how many genomes with what kind of distribution we would need to get useful information of this sort, and develop a plan for collecting them.

Some references:

  • The current state of the art in monitoring human evolution, which gives us essentially no time resolution: doi:10.1371/journal.pbio.0040072
  • A paper (from a lab I worked in this summer) that uses simulation to address the related question of how many genomes are needed to make interesting discoveries about human traits: doi:10.1073/pnas.0812824106

Python Code for Assignment 3

# encoding: utf-8
(Biophysics 101 homework)
d m jordanus scripsit

from __future__ import division
import random

p53seg  = "cggagcagctcactattcacccgatgagaggggaggagagagagagaaaatgtcctttag"
p53seg += "gccggttcctcttacttggcagagggaggctgctattctccgcctgcatttctttttctg"
p53seg += "gattacttagttatggcctttgcaaaggcaggggtatttgttttgatgcaaacctcaatc"
p53seg += "cctccccttctttgaatggtgtgccccaccccccgggtcgcctgcaacctaggcggacgc"
p53seg += "taccatggcgtagacagggagggaaagaagtgtgcagaaggcaagcccggaggcactttc"
p53seg += "aagaatgagcatatctcatcttcccggagaaaaaaaaaaaagaatggtacgtctgagaat"
p53seg += "gaaattttgaaagagtgcaatgatgggtcgtttgataatttgtcgggaaaaacaatctac"
p53seg += "ctgttatctagctttgggctaggccattccagttccagacgcaggctgaacgtcgtgaag"
p53seg += "cggaaggggcgggcccgcaggcgtccgtgtggtcctccgtgcagccctcggcccgagccg"
p53seg += "gttcttcctggtaggaggcggaactcgaattcatttctcccgctgccccatctcttagct"
p53seg += "cgcggttgtttcattccgcagtttcttcccatgcacctgccgcgtaccggccactttgtg"
p53seg += "ccgtacttacgtcatctttttcctaaatcgaggtggcatttacacacagcgccagtgcac"
p53seg += "acagcaagtgcacaggaagatgagttttggcccctaaccgctccgtgatgcctaccaagt"
p53seg += "cacagacccttttcatcgtcccagaaacgtttcatcacgtctcttcccagtcgattcccg"
p53seg += "accccacctttattttgatctccataaccattttgcctgttggagaacttcatatagaat"
p53seg += "ggaatcaggatgggcgctgtggctcacgcctgcactttggctcacgcctgcactttggga"
p53seg += "ggccgaggcgggcggattacttgaggataggagttccagaccagcgtggccaacgtggt"

standard = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
             'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
             'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
             'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',

             'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
             'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
             'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
             'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',

             'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
             'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
             'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
             'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',

             'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
             'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
             'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
             'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'

complement = {'a' : 't', 't' : 'a', 'g' : 'c', 'c' : 'g'}

# problem 1: determine GC content
print "1."
gc_count = p53seg.count('g') + p53seg.count('c')
total_count = len(p53seg)
percentage = gc_count / total_count * 100
print "GC count: {0}%".format(percentage)

# problem 2: find reverse complement sequence
print "2."
rev_compl_list = [complement[nucl] for nucl in reversed(p53seg)]
rev_compl_str = "".join(rev_compl_list)
print "reverse complement sequence:", rev_compl_str

# problem 3: find protein sequences
print "3."
def protein_sequence(sequence, offset):
    seq_iter = iter(sequence)
    for i in range(offset):
    aa_seq = ""
        while True:
            nuc1 = next(seq_iter)
            nuc2 = next(seq_iter)
            nuc3 = next(seq_iter)
            codon = nuc1 + nuc2 + nuc3
            aa_seq += standard[codon]
    except StopIteration:
        return aa_seq

plus1_sequence = protein_sequence(p53seg, 0) # saved for later comparison
print "+1 frame", plus1_sequence
print "+2 frame", protein_sequence(p53seg, 1)
print "+3 frame", protein_sequence(p53seg, 2)
print "-1 frame", protein_sequence(rev_compl_str, 0)
print "-2 frame", protein_sequence(rev_compl_str, 1)
print "-3 frame", protein_sequence(rev_compl_str, 2)

# problem 4: random mutation
print "4."
def mutate(sequence):
    new_sequence = ""
    mut_count = 0
    for character in sequence:
        new_char = character
        if random.random() < 0.01:
            mut_count += 1
            while new_char == character:
                new_char = random.choice(["a","c","t","g"])
        new_sequence += new_char
    print "made {0} nucleotide mutations".format(mut_count)
    return new_sequence

for i in range(4):
    print "mutation {0}".format(i + 1)
    mutated_sequence = protein_sequence(mutate(p53seg), 0)
    print mutated_sequence
    print "amino acid changes:"
    for j, (mutated, original) in enumerate(zip(mutated_sequence, plus1_sequence)):
        if mutated != original:
            print "{0}{1}{2}".format(original, j + 1, mutated)

print """
In about 1000 nucleotides, we expect about 10 nucleotide mutations.  Many of
these are silent when translated into amino acids, and every mutant typically
has at least one silent mutation.  Premature stop codons are rarer but 
definitely present.  I think the majority of mutants have no premature stops,
but many have 1 or 2.  Higher numbers are rarer.
(I wrote this note based on previous runs of this code; whatever random 
numbers it happens to have drawn this time may disagree with what I said.)

My Observations on Exponentials (Due Tue 15-Sep-2009)

  • The exponentials for values near k = 1 look like what I expect exponentials to look like, including the fact that they are inverted when k < 1 and get steeper as k gets higher above 1. The logistic functions at the same values look like s-shaped curves, also inverted for k < 1 (though there may not actually be an s in that graph, it's difficult to tell) and getting steeper as k gets higher.
  • For k = 3 and above, they start to look less like smooth curves and more like sharp spikes, but still with the same general shape. It's also very difficult to tell the three curves with high k apart by eye. By contrast, the logistic curves at high k have totally different behavior from the ones at low k and are easy to tell apart from each other — they rise up to what would be the top of the s-curve and then start oscillating, with wider oscillations for higher k.
  • When the oscillations are large enough to take the population down to 0, the oscillation stops dead, even though it was at 1 a moment before. This models the fact that a population that's gone extinct will never come back — an obvious conclusion if we're thinking in terms of populations.
  • The exponential function we're modeling seems to be kt, where t is time. The logistic function, according to google, is (1 + e-t)-1, but I can't figure out where the k comes in (which would have made writing it in python easier).