User talk:Kelly Brock

Final Reflections and Contributions
To begin with, allow me to express what a pleasure it has been to take this class. Ever since a biophysics/bioinformatics high school research job, I have been fascinated by the intersection between biology, physics, and computer science. I want to attend a graduate program in computational/systems biology, and Biophysics 101 provided a great overview of current research fields and allowed me to experience working as part of a team. If possible, I would love to continue working on this project next semester.

For my contribution to the project, I am consolidating and summarizing all of the information from the project into one unified, succinct document, since the biology, modeling, and infrastructure groups still have separate pages and relevant information has not yet been unified. I edited the wiki for the Project page and added this compendium to the last section (bulletin 12 on the Project talk page []). Furthermore, my more recent research has been to try to find a way to translate between rsid's and other forms of identification to further modularize Trait-o-Matic. I'm very sorry I haven't made more progress on this front - I thought the deadline was at the end of finals period, so I was planning to finish this over the weekend. My other tasks over the course of the class have included Python programming, setup (most notably getting a VM node and undergoing HETHR training), and contributing to class and project discussion.

A copy of the summary report is included below:

Final Report Summary
Harvard Biophysics 101: Computational Biology Fall 2009

Expansion of Trait-o-Matic

Introduction
Both the Human Genome Project and the advent of affordable, wide-scale computer technology occurred in the same time span at the end of the millennium. Therefore, the next logical step was to apply the rapidly-evolving field of computer science to the new, enormous amount of information obtained about the human genetic sequence. This correlated rise of two normally disjoint fields allowed researchers to explore mathematical possibilities not available from wet-lab experiments alone.

Furthermore, use of the Internet has had a profound impact on society in general; the same is true for biomedical research. Online databases put an enormous amount of raw information directly under researchers’ fingertips, and Web tools like Trait-o-Matic  have evolved to allow researchers to search for links between genotypes and phenotypes with a single algorithm. Trait-o-Matic was first envisioned and implemented as part of a Harvard undergraduate course on computational biology; the current (Fall 2009) incarnation of the class hoped to expand the utilities of Trait-o-Matic as a term project. The overarching aim was to both further modularize Trait-o-Matic with the long-term goal of allowing the algorithm to handle polygenic inheritance.

Methods and Objectives
For these purposes, the class was split into three teams. Although each subgroup was autonomous, overlap between groups, joint meetings, and multiple memberships were far from uncommon. The main goals and accomplishments of each group are listed below.

Biology
The primary objective of the biology group was to provide the scientific background for expanding Trait-o-Matic and to serve as a link between class aspirations and real-world possibilities. The first part of their role was to conduct an exhaustive literature search to find potential examples of epistatic interactions in polygenic traits. After encountering difficulties obtaining authentic data from the Rotterdam study associated with a paper on eye color and gene influence on phenotype (1), the Biology group generated three model data sets for use by the other groups as a concrete development tool. Each set consisted of a matrix detailing the presence or absence of either heterozygous, homozygous-dominant, or homozygous-recessive single nucleotide polymorphisms associated with eye color, with 1’s indicating a success and 0’s indicating that the particular combination was not present. Mathematical relations between matrix values were then used to calculate whether the “subject” in the data set would have blue, brown, or intermediate eye color. The following link contains full information for the Biology group:  Members: Ridhi Tariyal, Jackie Nkuebe, Anugraha Raman, Anna Turetsky

Modeling
The Modeling group, on the other hand, was tasked with relating information taken from the Biology group’s efforts and producing a mathematical algorithm to predict phenotypic response. For this purpose, genetic expression was divided into three main categories: continuous, discrete, and binary. Continuous refers to a trait where the phenotype can take on any value within a certain range, with height as a primary example. Discrete corresponds to where there are several independent possibilities, like eye color. Finally, binary traits only have two possibilities, like the inclusion/exclusion of a certain disease. Both logistic and non-linear regression models were developed to predict phenotypic responses. The logistic regression model was implemented in Python for binary traits and had some success. Therefore, it became the primary model because of its expandability from binary to continuous traits and its higher computational efficiency. Artificial neural networks were also explored as a future way to implement the model, but time and resource constraints made them physically impossible to use during the time period of this project. Future projects include using neural networks and designing the software to be used with large-scale, genuine data sets. 

Members: Ben Leibowicz, Zach Frankel, Alex Lupsasca, Joe Torella, Azari Hossein

Infrastructure
The job of infrastructure was to bring the technical expertise to the project and to implement everything on the Trait-o-Matic framework. Several freelogy accounts were obtained to enable Trait-o-Matic modification and testing, most notably filip.freelogy.org. Time was spent in familiarization with the tool’s coding schematic in Codelgniter and learning how to integrate new implementations with the existing software. Besides technical improvements, the main achievement was developing a tool to take models and incorporate them into Trait-o-Matic. These models can either be based on the work of the Modeling group or can be taken from an individual paper; examples of these were developed and uploaded to the site. We are optimistic about the future of this project and further modularizing and expanding Trait-o-Matic. The group’s website is included in the following link: 

Members: Filip Zembowicz, Alex Ratner, Brett Thomas, Kelly Brock, Mindy Shi

Results and Conclusion
Overall, this class has made definite strides towards our overall goal of enabling Trait-o-Matic to model multigenic effects. Using data and information from the Biology group, the Modeling group developed a simplified algorithm to predict certain phenotypes and the Infrastructure group added this functionality to Trait-o-Matic. The different subgroups did an excellent job of bridging different fields of expertise, ultimately coming together to deliver a final product. However, it must be understood that this project is still in its infancy. We have a great deal of room to expand with an enormous “wish-list” task list ranging from implementing a neural-network model to modularizing our product enough so that it will take different types of genotypic indicators outside of rsid’s. In the future, we hope to continue our work and utilize the new biological and technical advantages that the future will undoubtedly bring – perhaps even the full implementation of Human 2.0.

Literature Cited
1. Liu, Fan et al.  “Eye Color and the Prediction of Complex Phenotypes from Genotypes.” Current Biology Vol 19 No 5 R192.

Update for Last Class
For the final project, I'm trying to find out how to further modularize the project in terms of OMIM and SNPedia as part of the Infrastructure group. While SNPedia is categorized by rs ID's - meaning that they are officially registered in the NIH's public database of single nucleotide polymorphisms - OMIM uses a different categorization scheme. Their information is classified using a six-numeral scheme based on location in the genome []. I'm trying to find out how I can translate between the two classification schemes in order to more fully integrate OMIM into our application of Trait-o-Matic,along with studying how different databases could be integrated into the system.

I've found out that the OMIM site also has a genetic map displaying where its entries are found in the genome. Current research suggests that there should be a way to translate this information into the rs ID's already used.

I'm not sure when the project is due, but barring any complications I plan on figuring this out and implementing it this weekend. The infrastructure group has been talking about meeting over this time span, so we should be able to present a polished final project. I would also like to keep working on this for the remainder of final exam period.

Also, I would definitely be up for helping anybody document their stuff, as a side note.

Yay - a general Project Idea Exists!
If we're looking at gene interactions, I'm not sure how much information is available biologically. The first step would be to decide how we're going to go about doing the research. Should we look at gene networks themselves, and see what metabolites coexist in a certain process a la FBA analysis, or should we go through literature and make a list of what genes have been shown to interact with which other genes? There will be a lot of biology to go through before we get to the hard coding. That being said, I would rather work on coding details than biology details, although I'm looking forward to doing both!

Thoughts on Class Meeting October 22 2009
I really liked the idea of exploring a genealogy-type experiment where we randomly pair males and females in our database, predict their offspring, and recurse to see how the population would change over time. This tool could help evolutionary biologists to test for a specific mutation and how likely it is to be present after k generations. Furthermore, this project would encompass SNPcupid - we would have to write the functionality to simulate offspring genomes, after all - and it would also give a more-research-oriented focus to the project. Then, everybody's happy! We could make the different parts available individually so that couples who wanted to learn about any potential time bombs in their potential child's DNA could get a preliminary screening through our program. It's the best of both worlds! The biggest issue, I think, would be computing time, since we would potentially be dealing with an entire database's worth of human DNA sequences, or at least the coding regions.

Flow of program:

Have options for: -How many generations you want to simulate, how many offspring per couple, etc. -The parents and any parameters (i.e. just test two, randomly select pairs in database, how many pairs, etc.) -How many random mutations are introduced into the genome -Have the option to look at entire genome or just a portion of it -Could highlight "interesting" portions of genome that have changed -Might also get into predicting traits based on genes, another idea that was discussed (i.e. Trait-o-matic) - could use those handy-dandy databases for that

Over and out!

Kelly

Assignment 5: 10/13/2009 and 10/20/2009
I'm not sure exactly what the assignment is referring to - we have to specifically use the databases? Also, I'm not sure if this is due today. ..

So Anugraha's and my algorithm would give an output something like:

Genes in Common: AACTG....GTA  location  trait in common ATGACCT...TA  location  trait in common ... AGTCA.....AA  location  trait in common

Then we could use this data structure to look up the gene in OMIM. OMIM defines its genes by a six-number code [], so we would first have to find the translation key between the position and what gene it corresponds to. For example, if we identify a sequence associated with susceptibility to malaria, then we would cross-check it against the OMIM database. Then we could actively search for that gene to see if the trait has already been documented. Traits are also listed as a six-number code, so we would also need translation services for that component. I am sure that this information, to find which codes correspond to which location, is floating around on the internet. Everything's on the internet.

Update: Actually, when I was looking at GeneCards, the database that links all "known and predicted human genes" and includes disease relationship information, I found that it would actually be better than OMIM to use. HGMD, a different database, also lists gene locations up front, which could definitely be helpful in cross-correlating between different data sets if one set is incomplete (or to help modularize so that individual gene name designations are not as necessary).

For the information part of the assingment:

QTL refers to quantitative trait locus, which describes polygenic traits and how each gene might contribute to the overall phenotype; epistasis is "the interaction between genes," which seems to be a fancier way of saying "gene interactions." I didn't change anything on the wiki - I figured my classmates pretty much had it covered and I didn't want to be a source of incorrect knowledge. For the record, our algorithm can - conceivably, at least - work with epistasis!

Is this what the assignment's asking for?

Kelly

Assignment 4: 10/08/09
Anugraha and I met and talked about implementing the Phenomenal Pheno-matic, a program to generate hypotheses based on data. We could search our databases for genes that were overexpressed to try to find a correlation between different gene types and their phenotypes. This would involve use of OMIM, GeneTests, SNPedia, and PGP, and would be a good interdisciplinary project. For the longer, more detailed description, please see the edited Project talk page [].

Assignment 3. Brainstorming Human 2.0
Looking through OMIM, GeneTests, and SNPedia was definitely an adventure - I've never felt so paranoid about my own genes before! (unless you include that time I went to the fourth grade in enormous bell-bottom pants with cartoon characters drawn on them). I think the most surprising gene from SNPedia was Rs3057, which has been linked to having perfect pitch. [].

If we want to create a Human 2.0, I wonder if we want to start by focusing on artistic abilities. Are painters more likely to have a certain mutation, for example? In my MCB80 course, we talked about how a not-insignificant number of famous painters have bad depth perception - maybe seeing the world as flat helps them translate their visions onto flat paper. We could document musicians with sequenced genes and try to find correlations between their genotypes and musical phenotypes. We might could also study how big a role the environment plays in determining somebody's artistic abilities as a side project by seeing how dissimilar the genes are.

Alternatively, another cool project would be to use genetic information to determine a person's risk for becoming addicted to things like nicotine and alcohol. We could do this by refreshing a list of genes currently thought to be associated with these diseases and implementing a fast search algorithm to go through a given genome. Gaining large data sets and comparing our predicted results with reality would be an interesting assignment as well.

Assignment 2. Python Epicness
I organized the Python code by questions 1,2,3, and 4 as indicated in the comments. I redirected the output stream into a txt file, which is what I turned in on Thursday for my answers. This was by far my favorite homework assignment this week! I like python as a language because it seems like a really good mix of C, Scheme (especially the lists and dictionaries), and Matlab. As far as the experiment itself goes, part 4 was the most interesting to me because it modeled actual mutations instead of providing intrinsic data about the sequence. I hope I did the "get 1% and randomly mutate them" like y'all wanted!


 * 1) Kelly Brock
 * 2) File: BiophysAsst3P1.py
 * 3) Answer four parts of Assignment 3

import random

TRIALS = 6
 * 1) For part 4, when we have to do multiple experiments

sequence = "cggagcagctcactattcacccgatgagaggggaggagagagagagaaaatgtcctttag" sequence +=	"gccggttcctcttacttggcagagggaggctgctattctccgcctgcatttctttttctg" sequence +=	"gattacttagttatggcctttgcaaaggcaggggtatttgttttgatgcaaacctcaatc" sequence +=	"cctccccttctttgaatggtgtgccccaccccccgggtcgcctgcaacctaggcggacgc" sequence +=	"taccatggcgtagacagggagggaaagaagtgtgcagaaggcaagcccggaggcactttc" sequence +=	"aagaatgagcatatctcatcttcccggagaaaaaaaaaaaagaatggtacgtctgagaat" sequence +=	"gaaattttgaaagagtgcaatgatgggtcgtttgataatttgtcgggaaaaacaatctac" sequence +=	"ctgttatctagctttgggctaggccattccagttccagacgcaggctgaacgtcgtgaag" sequence +=	"cggaaggggcgggcccgcaggcgtccgtgtggtcctccgtgcagccctcggcccgagccg" sequence +=	"gttcttcctggtaggaggcggaactcgaattcatttctcccgctgccccatctcttagct" sequence +=	"cgcggttgtttcattccgcagtttcttcccatgcacctgccgcgtaccggccactttgtg" sequence +=	"ccgtacttacgtcatctttttcctaaatcgaggtggcatttacacacagcgccagtgcac" sequence +=	"acagcaagtgcacaggaagatgagttttggcccctaaccgctccgtgatgcctaccaagt" sequence +=	"cacagacccttttcatcgtcccagaaacgtttcatcacgtctcttcccagtcgattcccg" sequence +=	"accccacctttattttgatctccataaccattttgcctgttggagaacttcatatagaat" sequence +=	"ggaatcaggatgggcgctgtggctcacgcctgcactttggctcacgcctgcactttggga" sequence +=	"ggccgaggcgggcggattacttgaggataggagttccagaccagcgtggccaacgtggtg"
 * 1) Input genetic sequence into memory


 * 1) Part 1 - CG content

print "Kelly Brock\n" print "Biophysics 101 Asst 3\n" print "Part I\n\n"

count = 0
 * 1) Variable to keep track of how many 'cg's we've encountered

for i in range(0, len(sequence)): if (sequence[i] == 'g') | (sequence[i] == 'c'): count += 1
 * 1) Check each character in our genetic string

answer = count*1.0/len(sequence) print "CG fraction is: " + str(answer) + "\n"
 * 1) Compute fraction of total characters equal to c or g


 * 1) Part 2 - Find reverse complement

print "\nPart II\n"

RevSeq = list((sequence[::-1]))
 * 1) Make list to hold reversed sequence

for i in range(0,len(RevSeq)): if RevSeq[i] == 'c': RevSeq[i] = 'g'	elif RevSeq[i] == 'g': RevSeq[i] = 'c'	elif RevSeq[i] == 't': RevSeq[i] = 'a'; elif RevSeq[i] == 'a': RevSeq[i] = 't'
 * 1) Change all values to their complements

RevSeq = "".join(RevSeq) print "Reverse Complement Sequence" print RevSeq
 * 1) Recast our sequence back into a string


 * 1) Part 3 - Determining protein sequence

print "\nPart III\n"

standard = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C', 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C', 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*', 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
 * 1) Hardcode the protein dictionary

'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R', 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R', 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R', 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',

'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S', 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S', 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R', 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',

'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G', 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G', 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G', 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'		}

def proteinabbr(sequence, RevSeq, posneg):
 * 1) Make function to find protein abbreviations
 * 2) sequence is forward genetic seq, RevSeq is reverse
 * 3) complement, and posneg indicates whether we want to
 * 4) do all frames (>1) or just the positive ones (1)

# Will hold the list of one-letter abbreviations for the proteins # encoded by p53 in different frames protein = list totalprot = list

# Top loop chooses + open frame (0) or - open frame (1) for l in range(0,posneg):

# Use + open frame with normal sequence if l == 0: sign = " + " seq = sequence # The second time, use reverse complement sequence else: sign = " - " seq = RevSeq # There are 3 possible reading frames for both normal and # reverse complement sequences for m in range(0,3): print "\nFrame" + sign + str(m+1) # Go through each triple in our frame for i in range(m,len(seq),3): # Prevents error if not evenly divisible by 3 if (i+2) < len(seq): # Lookup protein value in dictionary and add it to the # protein list protein.append(standard[seq[i:(i+3)]]) # Prints result as string and clears list print "".join(protein) totalprot.append(protein) protein = list return totalprot

original = proteinabbr(sequence, RevSeq, 2)
 * 1) Do all that we just defined and store as original, unmutated sequence
 * 1) Part 4

print "\nPart IV\n"

for j in range(0,TRIALS):
 * 1) Have to do this for a certain number of trials

# Make list to hold the random numbers - there should be 1% of total mutspot = random.sample(range(0,(len(sequence)-1)),len(sequence)/100)

# Make string of sequence mutable mutseq = list(sequence)

# Possibilities for mutation for each character A = ['c','g','t'] C = ['a','g','t'] G = ['a','c','t'] T = ['a','c','g']

# Choose another nucleotide for each spot where you assigned a mutation for i in mutspot: if mutseq[i] == 'a': mutseq[i] = random.choice(A) elif mutseq[i] == 'c': mutseq[i] = random.choice(C) elif mutseq[i] == 'g': mutseq[i] = random.choice(G) else: mutseq[i] = random.choice(T) print "\nMutated Protein Sequence for Frames +1,2,3 in Trial " + str(j) print "\nMUTATED SEQUENCE" print "".join(mutseq) # Translate mutated string into 3-frame protein abbreviations mutprotseq = proteinabbr("".join(mutseq), RevSeq, 1) print "\nNumber of immature stop codons: " # Go through each ORF we computed for k in range(0,3): # We also want to see how many changes were introduced countmut = 0 # We want to count how many times a new stop codon is introduced # into the code, compared to the original sequence. * = stop countstop = 0 # Find each protein within each ORF for l in range(0,len(mutprotseq[k])): # Is it a mutation? if mutprotseq[k][l] != original[k][l]: countmut += 1 # Did you introduce a new stop codon? if mutprotseq[k][l] == '*': countstop += 1 print "\nThe total number of protein mutations was " + str(countmut) print "Of these, " + str(countstop) + " were incorrect stop codons."

Assignment 1. Python and Excel
I'm currently having technical difficulties getting Python to run - it doesn't want to recognize the matplotlib or numpy   libraries. However, I did complete the Excel graphs - with increasing k for the first equation, the function values also increased as expected, resulting in different endpoints of the curve. For the second equation, the curve decreased to very negative numbers, like a reflection of a normal exponential curve. For the third graph, I got that all values were zero since (k* e^x * (1-e^x)) would always be <= 0, and the max would automatically choose zero.