User talk:Brett Thomas

From OpenWetWare
Jump to: navigation, search

Thoughts on a data model

I've been thinking about what type of data the Personal Genome Project could collect, and what an ideal researcher-facing tool would look like. I put down some thoughts on this page Harvard:Biophysics_101/2009:Data_Plan I don't think this will be too useful but it was really helpful for me to organize my thoughts and gave me a better (perhaps) idea of what our end goal could be.

Brett's Impassioned Plea

I think: That helping improve the mechanism for GWAS within the PGP will be more valuable toward discovering gene interactions than the synthetic process than the proposed synthetic approach.

Case Study: I think a VERY common mistake with software design is not considering the end use at every step. So, I want to start with a case study of a problem that would be great to solve. Consider it's 2012 and we have 100,000 genomes. One of you has a kid (ie. not me) and we want to know what the probability is that they receive late-onset Alzheimers - and, more importantly, what you can do to prevent it. Read up on genetics/Alzheimers [here].

  1. APOE is one gene that definitely has some correlation, particularly the A4 allele. But there are clearly other genes, too, and almost certainly environmental factors.
  2. So in 2012 we want trait-o-matic to output the probability of alzheimers. How would its ability to do this improve under the current project, or under my proposal?
  3. Current project: We'd scan the databases to find all genes associated with proteins that are involved in the process, then find out which "make sense" to cause AD given our model. I think this is pretty hard...
  4. Under my project: We'd say "20% of these people had late-onset AD, let's run a correlation to see what alleles might be correlated." Say we get 100 hits. Researches could investigate each of the genes and figure out if they make sense.
    1. How this relates to our project: If two genes are related to the cholesterol pathway, we can infer that it is likely that they work together...and then take on the task of figuring out why. I imagine that in a vast majority of cases, multiple genes will need to be considered.
    2. BUT - multiple phenotypes are also important. What if said pathway is that "people with APOE A4 are much more likely to get AD if they have physician damage to their head, so don't let your kid box or play football." I think it would take much longer to discover this prediction by inferring metabolic pathways. It would be easier to say "90% of people with APOE4 AND a physical athletic background get AD." This is a basic statistical analysis!

'A note about how this relates to our current project:

  1. The current project relies on three leaps:
    1. Gene to protein
    2. Protein to metabolic pathway
    3. Metabolic pathway to disease (environment goes here!)
  2. I think that, in essence, the ability of our current proposal to be successful will depend on a robust instance of the tool I am proposing.
  3. Because of this I think it will be easier to create a useful map from phenotype to gene given the databases we have now.
  4. With the current map, false negatives will be prevalent. We aren't going to discover metabolic pathways, as Joe has argued. Supporting evidence: if it were possible to look at the genes associated with the proteins in the metabolic pathway for HDL cholesterol production and infer the cause, somebody would have already done this
  5. False positives will need to be tested with my pro proposed mechanism. Otherwise, there is no way to quantify error.
  6. With my proposal, false negatives on collected phenotypes will be nearly impossible
  7. False positives can be quantified with an error prediction

October 13th: My Thoughts On The Project

High Level: I keep returning to the question of how we can use artificial intelligence and user community input to make a gene expression tool self learning. I think this would add another layer of analysis that traditional gene databases are missing.

The Problem(s): The central mechanism that each of these tools is trying to improve is the following: 1) identify a person's gene; 2) identify what non-genomic information can affect how this gene is expressed; 3) give the person information about future probabilities of outcome. It seems that the resources we've seen try to map step 1) to step 3) and on the whole do a poor job accounting for step 2).

This is actually a combination of two problems. The first is that information is not personalized enough. Companies like 23andme and Navigenics provide the above diagnostic tool, but it doesn't seem that they ask a person about their lifestyle. They give a bucket of information of the form: "You have gene 1 out of 3. If you have lifestyle A, you'll be susceptible to outcome X; and if you have lifestyle B, you'll be susceptible to outcome Y." It would be better if they provided targeted information of the form: "[First ask user their lifestyle] -> Since you have lifestyle B, you'll be susceptible to outcome Y. If you switch to lifestyle A, you'll transfer to outcome [somewhere between X and Y]" This is a subtle difference, and in these simple examples it doesn't seem important. But as research improves and environmental information becomes more targeted, I propose that users will begin to demand more and more targeted information.

How can research improve to make this information more targeted? That reveals the second problem: analyzing observational data is crucial. It seems there are two ways to figure out a genetic/environmental determinism: 1) do lab research to figure out the chemical mechanism that's happening in the cell; or 2) do a population study to figure out a correlation and try to determine cause and effect. It seems to me that (2) is much more promising in the near term. But the academic research method is insufficient - it'd take decades to acquire all the information we want through academic papers of the sort referenced in SNPedia.

The Solution(s): So the two problems are: 1) Need more targeted information; and 2) Need a way to expedite gene expression research. I propose that these two problems can be solved together by making a gene expression engine that is self-learning on the environmental observations of users.

The mechanism would work as follows:

  • User's genes are stored in a database along with their observations
  • User is given a secure web portal to record obsevations when they are triggered
  • Researchers can query this database for anonymized data through an API. This already improves the value of research.
  • In addition to researchers, an engine is built on top of the database that mines data for potentially statistically significant relationships. The standard model is to map two observational data pieces, and see if any genes seem to effect outcome. For example, engine could map an obesity question to a diet question, and scan the gene database to see if any genes cause diet to affect obesity differently. Statistically, this engine can follow the logarithmic sample size -> potential outcomes to test relationship that we saw in the last lecture.
  • What data is collected? A research expert is assigned to manage what information is requested from what users. Over time, researchers (and maybe users?) can request new information when they think it could be relevant. It is key that this information be targeted. Going on the previous example, a targeted diet question can be asked of all obese people that identifies diets high in salt/protein/etc. Then, researchers and the envine can go to town trying to identify more causal relationships.

Our Project: Obviously this is way beyond the scope of a final project, but instead a vision of how these things will work in the long run. How could we make something useful and/or insightful from it? I think we can start small and build one of these components. More to come...

Relationships to Current Resources:

Asst. 3: Project Ideas

The concept: One common theme in the resources we've looked at is linking DNA data to personal data. One of the lessons I took from the last discussion was that dynamic data collection in the PGP would allow an important new layer of data analysis. I returned to this idea after looking at the gene identifier sites assigned this week - thye were trying to link traits and genes by working around what I see as the most basic way to do this: looking at people's genes and then asking them if they have a certain trait.

Within the Personal Genome Project, I think such a mechanism would work as follows: researchers propose that a certain gene is associated with a certain trait. Researchers pose the question so it can be mapped to a discrete data set, and then send the questions to a targeted set of PGP-ers to get responses.

Implementation: I think this could be implemented as an extension to the PGP site. I think it'd take three infrastructure pieces:

  • Researcher facing engine: a platform that allows researchers to create questionnaires and specify which users they'll go to. Will also email users to say "we want to ask you another question."
  • User facing: a secure site for users to log in to quickly answer questions. Could be an app on the PGP site or standalone, depending on which aligns with the current rules.
  • Data: an extension to the current PGP data storage system to store data that is collected. Also could be directly integrated or a separate relational database with linked tables.


  • Accounting for privacy: I think privacy would definitely be the biggest obstacle, particularly if we allow data to be cross tabulated, as many PGP-ers would be easy to uniquely identify.
  • What data to collect: I think the most important initial research would be to identify exactly what data researchers would want to collect. That work
  • API: I think a natural extension is to provide an API for the public to use. This would allow other gene sites to submit a (user, gene, trait) triplet. This would be an extensive undertaking, but may be worthwhile if such a service doesn't already exist.
  • Third party platforms: Another thought is that we could take advantage of a third party platform to create a quick app, like Google Health, Healthvault, or (the company that I worked for this summer) Keas.

Asst. 2: Modelling Gene Mutations

Here is a link to my code. It' spretty long (I did way too much) so I didn't want to clutter this page..

Asst. 1: Modeling Exponential Growth

I have some experience with python and excel, so the programming part of this asst wasn't very time consuming for me. I'm just going to throw out a few random notes:

  • The Model I was actually pretty confused by this model. These functions are some variant of Current Pop * constant factor. Seems like a more appropriate general model is Current pop to the power of constant factor. I just realized this a couple minutes ago, I'm sure I'll reconcile the difference before class.
  • Slide Without thinking, I copy/pasted Dr. Church's functions into excel as written. Then when I did the coding, I took it to mean linear growth with A2 representing the independent variable and A3 representing the output. This was dumb...and made the python coding like 10X harder too :)
  • Practicality When I actually understood what we were doing, I was able to analyze the biological component. In short: I really don't think exponential growth is a very practical model on either a population or evolutionary scale.
  • Population It seems that there have to be thousands of feedback loops when analyzing growth in a population. In the rabbit example, the true growth was probably only exponential for a short time before food enforced a negative feedback. On the other hand, if the first rabbits crowded out competitors, would have caused a positive feedback. The more I think about such examples, the more I think that exponential growth is more a corner case than a model.
  • Evolution Exponential growth makes even less sense to me when discussing evolutionary progress, because it seems evolution evolution of a species would "conquer" the lowest hanging fruit first. What I mean by this is that increases in brain size that were most effective probably came first, and then brain evolution would become subject to diminishing returns. If brain size is an indicator of progress, this would contradict the hypothesis from Slide 10: one has to be wrong..
  • Evolution vs. Technology Since I'm skeptical of the exponential model of evoluton, the analogy to Moore's Law becomes more interesting. Why should evolutionary vs. technological innovation be different? I wish I had the time to give this more thought, and hope we can in class today. One idea is that the pressures are different: transistor technology is measured absolutely, whereas in evolution a relative advantage is probably more important than an absolute advantage. Another is that it is more difficult for evolution to adjust the fundamental building blocks of a species, while Intel can easily switch from silicon to a graphite transistors if they can be abstracted to the same old x86 standards.