Harvard:Biophysics 101/2009:Modeling

Documentation PDF // 12-18 (Alex L)
Please note that I have made written in Latex a synthesis of the modeling group's work, which amounts to a Documentation of our software and model. It can be found on my talk page or directly here:

Documentation & Final Report Hope you enjoy! -- Alex L :)

(Note that this is in the process of being integrated with this page! The code section has already been merged with the PDF.)

Note on Documentation // 12-11
As I posted on my talk page I am starting to use this page to store documentation of the work the modeling group has done - I would like to encourage all members of the modeling team to collaboratively edit the documentation I started below - this means (Also, after we finish the documentation someone please delete this post from the page) I have started a skeleton with sections I believe we should include so lets all fill them in, and if there are sections to add by all means do so - also if you have any comments on documentation but do not want to make explicit changes to the entire documentation below, please post thoughts/ideas with your name within this section. Cheers!-- Zach

Motivation / Project Overview
We proceed to explain the underlying motivation of the project of modeling, what our goals were and how we implemented them.

The Modeling Group's task, as part of the class project, was to produce software that would take as input information about a person's genotype, and that would then output a prediction about the person's phenotype.

This initial idea later evolved in several ways. Perhaps the most important one is the form taken by the output, which is actually not a prediction, but rather a model for making predictions. The major advantage of this approach is that these models need only be generated once (when the data is first obtained), and then other software such as Trait-o-Matic can call upon the models to make trait predictions based on available genotypic information. The upshot of all this is the high computational efficiency and modularity of the code.

Overview
We proceed to provide an overview of the types of traits we wish to model. In the course of the project, we encountered various types of phenotypic data, each which is best modeled in a certain way. Below we detail each of these types, provide examples of them, and explain the demands they put on modeling relative to each other. Methods described here were researched in various sources including an introduction to generalized linear models

Continuous(observable)
Continuous data in general is any data or information that is measurable on a continuous scale. In the context of something observable we could consider measuring height as a continous trait, as we could measure the height of all subjects to high precision. However, in practice, most supposedly continuous data is expressed in discrete terms. For instance, we round heights to the nearest inch or weights to the nearest pound. Thus we end up with discrete data, even if it contains many values. Continuous data has different demands in terms of modeling. In particular, it does not lend itself well to a logistic regression and instead may require non-linear methods to approach it. There is some evidence that artificial neural networks may be useful in dealing with such data.

Discrete/Nominal
Nominal classifications of data are discrete separations of data points based on non-quantitative classifications - for instance the measurement of eye color as blue, brown, or hazel. Often there is more quantitative data underlying such information ( to continue with the eye-color example one could consider pigment counts in the eyes ) - however there are methods for approaching this kind of data, particularly when ti is ordinal - ie. when the classifications have a natural order (such as young, middle-aged, old).

Binary
A binary or dichotomous variable is a nominal categorization where there are only two categories. For instance, contracting a disease or not contracting a disease, alive or dead, male or female. This too intersects with ordinal data as there can be an inherent order to such classifications.

Appropriate methods for each type
Based on our research, we have methods to approach intersections of these data types. In particular we consider combination of explanatory and response variables and provide the theoretically viable methods for each. The first entry corresponds to the explanatory, and the second corresponds to the response variable (taken from Ch 1. of an introduction to generalized linear models).


 * 1) Binary-Binary: logistic regression or log-linear models
 * 2) Binary-Nominal/Discrete: Contingency tables and log-linear models
 * 3) Binary-Continuous: t-tests
 * 4) Nominal-Binary: Generalized logistic regression and log-linear models
 * 5) Nominal-Nominal: Contingency tables and log-linear models
 * 6) Nominal-Continuous: Analysis of variance
 * 7) Continuous-Binary: Dose-response models including logistic regression
 * 8) Continuous-Binary: Multiple regression

Types of Models (and evaluations)
We provide brief descriptions of some of the models we both considered and implemented. In each section you will find both a description of the model as well as an evaluation of its strengths and weaknesses.

Linear Regression
Linear regression is the most basic modeling method in which one considers a model of the form $$ y = X\beta + e$$ ie. it is modeling the relationship between a response and a vector of explanatory variables by fitting a linear equation for the response to observed explanatory data. For a more detailed review please see this explanation. Unfortunately, this model while simple does not provide a strong framework from which to approach questions of genetic interaction. First, the relationship between genetic factors and phenotype is clearly nonlinear (particularly in cases of dichotomous results such as disease risk) and second it has no means of measuring interactions.

Logistic Regression
Logistic regression is a parametric statistical method that is often used in the genetic study of diseases. One of its advantages is that it can test for both genetic and environmental predictors, which it relates to via the logit function. It is also particularly well suited for binary or dichotomous responses - ie. whether or not one has a genetic disease. Moreover, it has been proven effective in case-control studies. However, it is not without its disadvantages. First, it performs very poorly when faced with dimensionality problems (an issue we will certainly face as we consider more and more factors) and it also is prone to false positives. Finally, it has shown weaknesses in detecting interactions of various effects. For further detail please refer to [[Media:snpanalysis.pdf | this paper]] on it use in genetic epidemiology and these notes for a rigorous mathematical description. We implemented this regression method and found some limited success, however at the time of this documentation we had not yet received an appropriate training data-set. Note if you are dealing with non-dichotomous response variables then one can consider the method of ordinal regression where you consider orders to the various nominal categories and hence can take probabilities of a response being less than a certain threshold (and accordingly summing probabilities) or one can consider the multinomial method where you consider relative probabilities of one response or another but no absolutes (though starting with one absolute probability others can be calculated).

Non-linear Regression
This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion.

Computer Learning and Neural Networks
First, note that all models are a form of computer learning - namely we take dynamic datasets and use them to train the model on and hence this is a form of learning. However, a more explicit form of this is the use of artificial neural networks to recognize patterns in observed genomic data. In these models we use input layers(usually just one), hidden layers, and an output layer. Here a layer is something composed of nodes, and each connects to the subsequent layer. These connections have assigned weights from which subsequent values are calculated. This is called a feed-forward structure because information flows from the input to the output layer through the nodes of the hidden layer. In this process we use explanatory variable data for the input layer. In training the network adjusts weights to reduce error as much as possible. A clear strength of this approach is that this method is more general than any particular model as we can use it to construct equivalent models. For instance, a neural network with a logistic transfer function on one hidden layer is the same as our logistic regression. Additionally, it does not suffer from the same dimensionality problems as the logistic regression. This approach is also more flexible and likely more accurate in general. Moreover, it is better able to detect gene-gene interactions. However, there is not extensive literature on this method but we plan to investigate it more in the future.

Project Code / Implementation (integrated with PDF documentation)
In this section you will find complete documentation, as well as links to source code on all of the programming we implemented in the context of the project.

Overview
Our software implementation model fulfills, at least partially, the initial goal we had set for ourselves: given a model-building set of individuals, it generates the vector I=(p_1,p_2,...,p_k) of learned probabilities and outputs a model in a form which is understood by Trait-o-Matic (as per the Infrastructure Group's specifications). In turn, that model takes in an arbitrary genotype vector x and predicts the likelihood (x . I) of that individual expressing phenotype y.

Walkthrough
Our implementation was done in Python, which is a very flexible and powerful high-level language. It was chosen for its ease of use and modularity -- indeed, we tried to make our code highly readable (by including plenty of comments) and easily modifiable (by implementing different aspects of the model in different modules).

As such, our program consists of three parts: logregoutput.py, dataread.py and ols.py. The function of each of these parts will now be described; as to the code itself, it shall not be examined in this documentation (we reiterate that is heavily commented and should be easy to read, and understand).

The observed data (i.e. the set of all vectors d corresponding to the individuals in the model-building set) is written into a simple .csv file. This stands for ``comma-separated values'' and is the simplest format for storing arrays of information such as the one from the preceding section. It is also recognized by Microsoft Excel, which makes it easy to share and modify. A sample data set testdata.csv would thus take the following form: where snp} is the name of SNP 1 and represents the binary value of x1, and so on. Similarly, phenotype is the name of the binary trait y under consideration.

Now, the main routine is logregoutput.py. It can be called at the command line with a data file such as the above as argument: for instance, provided that a file testdata.csv of the above type is located in the same directory as the program, one would run the software with the command The program will then call the routine dataread.py to read in the data and produce the previously described array, together with the associated probability vector P. logregoutput.py will then proceed to compute the logit of the probabilities, and then will call on ols.py to actually perform the linear regression. Note that this file was not written by us; it is a standard library that is made freely available online.

Finally, logregoutput.py will print to screen the following output:

rsid:snp1 rsid:snp2 rsid:snp3 This is an automatically generated model file created by logistic regression on the SNPs listed above. from numpy import * from sys import argv genotypes = argv[1:] genotype_list = [1] + [int(genotype) for genotype in genotypes] genotype_array = array(genotype_list) PVALS GO HERE pvals = array([ 0.00589634, 0.00456349,  0.27948336,  0.0756931 ]) val = -1*inner(genotype_array, pvals) risk = 1/(1+exp(val)) print risk <\code>

Output and interface with Trait-o-Matic
An output of the type seen above is itself a model. Its format obeys the specifications passed on to us by the Infrastructure Group, and is understandable by Trait-o-Matic. Note that the name of SNP 1, snp1 becomes an rsid tag for Trait-o-Matic; this allows for the automation of the genomic information retrieval process.

Another advantadge is that this type of output can also be easily generated by humans. Furthermore, should a human wish to modify the output of our model (in the case that our automated output suffers from an unforeseen problem, for instance), then they could easily do so in a text editor.

Finally, the model can be run either in Trait-o-Matic, or directly at the command line. Assuming that the above output model was stored in a file model.py, we could compute the likelihood of an individual expressing phenotypic trait y as follows: supposing that they have genotype x=(1,0,1) (that is, they have SNPs 1 and 3, but not SNP 2), we would call python model.py 1 0 1 <\code>

This would output something of the form 0.521524920396 <\code>

which indicates a probability of 52\% of expressing the trait.

Future Direction / Other Ideas
Here we discuss various ideas we generated throughout the course which we did not have either the time or resources to pursue as well as a discussion of various directions which we would like to take the project in the future.

Methods of Training
We see the issue of training data as one of the major hurdles for the project and just as important as the models themselves. In this regard we would long-term like to integrate the use of PGP-type data in training these models automatically rather than requiring uploaded CSV files from biologists. While currently there is not enough data available to allow for meaningful training in this way we can imagine a process in which
 * 1) Many people have their genomes sequenced and also have a phenotypic profile taken of themselves
 * 2) This information is stored in a publicly (or researcher) accessible database where genomes match up with traits
 * 3) Whenever a model is being trained for predictive purposes it identifies the SNPs relevant in the genomes of the database as well as the corresponding phenotype, and then uses this for training
 * 4) A model is outputted based on this training data

Such a process seems far more powerful in the long term than manually curated datasets on individual studies.

Future Models to Include
We plan to include variants of the logistic regression. In particular we need to provide options for predictions of continuous traits by automating the process of threshholding. These threshold logistic regressions would be flexible enough for continuous and multi-valued nominal traits. In this regard we would implement a multinomial logistic regression in addition to an ordinal one. We also would like to include a Fuzzy C-Means Clustering model, a strong Neural Networks model, and a Classification Tree model.

Parental Origin of SNPS
A recent paper in Nature entitled [[Media:parental.pdf | parental origin of sequence variants associated with complex diseases]] presents a compelling case for the importance of parental origin of SNPs in determining their effect on complex traits such as disease risk. The examples they present include SNPs that are protective when inherited from your mother but are harmful when inherited from your father. For more on this see Zach's on the subject. In this regard we would like to build into the project in the future capabilities to find such associations. This first requires datasets which include parental origin - finding such data would be facilitated by genotyping entire families and looking at the children for such aspects. We could then generalize our methods to consider maternal and paternal SNPs as separate variables and all of our infrastructure would still work for them. Thus we would have
 * 1) New datasets with parental information to facilitate training
 * 2) Prediction would require submitting familial infromation (parental sequences perhaps) but would not require a strong difference from our current model in terms of mathematical infrastructure.

Explanatory Methods
The focus of the project so far has been on predicting phenotype based on multiple genes, but has not focused at all on understanding these interactions. That is to say, we treat interaction as a black box in our models and attempt to determine what will come out. However, from a biological perspective (in particular if considering treatments) we are interested in how genes interact and what the nature of their epistasis is. This question is addressed in this [[Media:epipaper.pdf | paper]] which is further explored in Zach's post. We would like in the future to develop methods as in the paper to determine what interactions actually are (which is what we attempted to do in our first failed non-linear approach). Thus we could display information on the trait-o-matic page on how various genes interact and perhaps even visualize it. This owuld be a useful tool for researchers attempting to understand various pathways or for people seeking to understand what is actually happening biologically when they see their traits.

Lessons Learned
In this project we learned a great deal not only about statistics and biology, but also about the way they interact. Some key points we would like to pass on to anyone continuing this project or looking into something similar


 * 1) Explicit non-linear regressions, while ideal in terms of explanatory power, are not computationally feasible and require datasets which are too large. In particular, the explanatory power we wish to find from such a model exceeds the information going into it.
 * 2) Logistic Regression and Neural Networks are the two most promising means of predicting phenotypes based on genotypes. Neural networks in the long term are likely preferable given their flexibility and superior ability to deal with interactions of genes and avoid false positives ( a problem we found in test data sets for the logistic regression )
 * 3) It is important to approach these problems after considering biological nuances, and different types of data will demand different sorts of models. An appreciation of the biological intricacies is necessary for setting up a framework to distinguish between such things.
 * 4) Large and realistic data sets are essential.  Without them, more complicated methods of model inference can fail to see even associations that we find obvious.  This includes the presence of noise in simulated data sets — logistic regression actually failed to infer several simple models because of the absence of noise in our data set.

Update // 12-6 (Zach)
Just finished implementing the basics of the logistic regression in python. A nice feature is that this is all modular so we can reuse the code quite easily even if our model changes a bit or the way we wish to interface is different. We will make sure to provide a rigorous discussion of why we chose to use the model we did. Clearly this regression is best for binary traits but it is nonetheless a nice start, I believe. So as of now we have


 * 1) A model for continuous traits with explanatory power - this is not implemented but we will provide a discussion of its merits, notes on its implementation, and possible future plans in our writeup
 * 2) A logistic regression model for binary traits - I will post this code and explicate a bit further in my writeup
 * 3) A clear extension of the logistic regression into threshholded traits

Hello World // 12-4
Hello all - this is Zach. I have been keeping track of my work on my talk page/the project page but to better organize our efforts leading up to the end of hte course I wanted to make sure we have a page for modeling! I'll post some stuff on here soon, but I wanted a centralized hub.