Harvard:Biophysics 101/2009:Data Plan

I've been thinking a ton about what type of data the Personal Genome Project could collect - and how we could make it useful for researchers. I put some of my random thoughts here. Below is an attempt at categorizing data, then a model of a query language we could use to derive conclusions from this data when it is collected, and finally some high level observations of mine. Not sure if this will be useful at all for the project, but I've learned a ton thinking about this stuff and would be curious if any of you guys have thoughts.

Types Of Data
Here is an attempt to categorize the different types of data we could collect from a representative participant:
 * 1) Genome sequence: Obviously we will have a full genome. See section on primary/derived attributes for why I think this could kind of complicated.
 * 2) Physical Profile: Basic information ie. height, weight, hair and eye color...the basics
 * 3) Health History: Health-related information ie. blood pressure, blood type, whether they have had cancer
 * 4) Medical History: Types of medicines used. Different from health history because it focuses on treatments
 * 5) Psychological Profile: Personality test data, SAT scores
 * 6) Lifesytle data: Stress level of job, whether they play physical contact sports
 * 7) "When I was growing up" data: Separate section for history of upbringing. Though this could also be decomposed into the other categories, I think people treat this as a fundamentally different data set. Critical for nature/nurture research. Stuff like school history, whether they lived with both parents, income level growing up.

Organizing This Data
I think these data pieces can be classified like this:
 * 1) Type: Either 1) Numeric; 2) Boolean; 2A) Discrete Type (Things like race that people think of as "which of the above" but can be reduced to boolean types)
 * 2) Duration: A data piece represents either a one time event (heart attack) or continuous characteristic (blood pressure)
 * 3) Primary or Derived: Data piece can be either primary attribute, which is collected directly from user; or derived attribute, which is calculated from primary attribute(s). Here is what I mean by a primary attribute:
 * 4) Primary Genotype Attributes: The only "primary" attribute is the actual 2bit-stored sequence data.
 * 5) Primary Phenotype Attributes: Each of the above categories are primary attributes themselves.
 * 6) Derived Attributes: Important: I think that derived attributes can be calculated from both genotype and phenotype data, as well as other derived attributes. Here's what I mean:
 * 7) Derived attributes that are derived from only genotype include alleles, but also length of a promoter region, amount of junk DNA, etc. The key is that each of these are nothing more than analysis tools: they can be derived with no information other than the underlying sequence.
 * 8) More interesting derived attributes are those that include both genotype and phenotype data. An example is heart attack risk. An ideal risk model likely considers both genotype and phenotype.

Research Query Language
I think that all GWOS studies can essentially be broken down to researching new derived attributes. A basic "research query" involves taking a known model and trying to improve it. Suppose you are investigating the relationship between heart disease risk and athletics. There is already a heart disease risk model that incorporates current "conventional wisdom" about the causes of heart disease that doesn't include athletics. You basically want to run a regression that finds the correlation between the current heart disease risk calculated attribute and whatever measure you have for athletics.


 * 1) So the basic query is a correlation: CORR(Attribute 1, Attribute 2)
 * 2) Attributes can be numeric, enum or bool, so have to find a way to measure correlation between the types.
 * 3) In an API, some other queries will be helpful:
 * 4) % of participants with said attribute
 * 5) % similarity between two genomes
 * 6) % similarity with reference genome
 * 7) I'm sure there are others...

My Two Cents: Some Random Observations

 * 1) I think phenotype data should be collected as time-series data (all data tagged with a date), for two reasons:
 * 2) Identifying causal relationships in phenotype data may require identifying changes i9n phenotype. Suppose the secret of how the genome causes height is the length of a promoter region that turns on during puberty. The correlation between promoter length and adult height would be interesting. But the correlations between promoter length and duration of puberty, total height increase in puberty, and age that puberty started seem like they would be more valuable.
 * 3) As society develops more robust medical records, this data will be easier to get. In 18 years, every student graduating high school will probably have a profile of their heights each year from birth that could be linked directly to PGP
 * 4) When it comes to research value added, I actually think that there is NOT a clear dichotomy between genotype and phenotype data. Here's what I mean, using the example of Alzheimer's. Suppose we use the PGP to test how likely you are to develop late onset Alzheimer's given two factors: 1) whether you have APOE E4; and 2) whether you are a football player.
 * 5) Regardless of our conclusions, a regression between football players and Alzheimer's is fascinating
 * 6) For complicated phenotypes (heart attack risk, height) there are likely so many genes and so many phenotypes that have causation that the distinction becomes trivial.