User talk:The Biology Group

Final Progress and Contributions
Please see the main-page.

From Meeting on November 9th
Our discussion yielded three ‘models’ to reverse engineer datasets based on various observations/readings. These are by no means meant to be exhaustive, but they should provide a good starting point for toy datasets the model needs to be able to deal with. We broke these down into three initial groups

1. Complete Epistasis/Dominance - in various settings, one gene completely dominates (epistatically) the other. For example, baldness is completely epistatic to whether or not one has widow’s peak. Certainly, this type of epistasis should be recognized by the model. As such we have the following model for eye pigment. Consider genes x, y, x being whether or not you have eyes, and y being a pigment score between 0 and 1. Ie. x = {0, 1}, y \in (0,1). We claim the relationship between these two in determining your trais is x * y. That is if x is 0 and you have no eyes, no matter what you have no eyes, whereas if x is 1 you get whatever score y gives you. For simplicity (in consultation with the math team) we would only consider 10 discrete eye colors, at .1 through 1.0. 2. Multiplicative effects – here we consider an additional gene for eyecolor, let us say ‘z’ – which can divide the amount of pigment one has. In particular it assumes values, 1, .5, .3333, or .25 – and the relationship is x*z – in other words it must recognize the same relationship as 1 but not when one of the values can be zero.

3. Additive effects – Here we consider an effect which is completely additive. The example we decided on was inspired by the paper posted by Prof. Church on height. We wish to consider genes a and b which regulate height. For simplicity consider a and b are \in (-10,10) and each map to the number of inches above or below the mean height for your group. Let us say though that a is linked to leg length and b is linked to torso size. So their effects are combined and we wish to find the relationship a+b as the final trait determinant.

Clearly there is more work to be done, but we want to make sure the modeling is cognizant of being able to do these and can demonstrate their success on reverse engineered data.

Questions for the Math Modeling Folks
3. How will the language of our program account for varying susceptibility loci among different racial groups?

4. Upstream of the black box of the prediction, one can imagine that each of the genes we're analyzing are undergoing some epistasis within the pathways that they occupy...how to quantify to assess in prediction?

5. Also like (4) in creating some sort of numerical, quantitative assertion in our prediction we may run into the problem of pathway redundancy in risk assessment:  [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0002663, Sanjuan et. al] show that, in small networks with multifunctional nodes, lack of redundancy, and absence of alternative pathways, epistasis is antagonistic on average. In contrast, lack of multi-functionality, high connectivity, and redundancy favor synergistic epistasis. They suggest that the former epistasis is often present in simple organisms (viruses and bacteria), but that the later is present in higher eucaryotes. They also suggest that in systems of high pathway connectivity or redundncy, the observation was that the effect of single mutations decreased with increasing connectivity. Moreover, this trend was accompanied by a shift from multiplicativity to synergistic epistasis. So--the point is that we need to think about pathway/genetic redundancy...

Possible Test Traits...
1. Rheumatoid Arthritis:  Investigating the viability of genetic screening/testing for RA susceptibility using combinations of five confirmed risk loci

2. Psoriasis: Identification of a novel psoriasis susceptibility locus at 1p and evidence of epistasis between PSORS1 and candidate loci

Epistasis--what it means, what it doesn't mean...
Cordell Epistasis 2002

Questions to Consider thus Far
1. Two-loci modeling, three-loci, four-loci...How complex do we want to get? (Answer: every potentially deleterious locus) 2. How do we model compensatory protein effects due to SNPs and their downstream effect on phenotype/risk assessment? (Potential answer: Question of Machine Learning)

Disease Investigations for Method Validation
Type 2 Diabetes: A late-onset disease that may be of interest, as it is both polygenic and includes behavioral/environment risk. Janssens and van Duijn point out that rather than being predictive, genes contributing to heart disease and diabetes can lead to behavioral changes which try to lower risk of developing the disease 1.

Prior to this, Weedon et al. showed that having multiple allele copies increases risk in accordance with a multiplicative model 2 (this type of statistical information can be used in affirming the effectiveness of our modeling). However, other studies such as here and here  found that lifestyle/phenotypic factors and family history were more predictive that genetics in whether someone would actually develop diabetes.

As a side-note, I was slightly amused that a google scholar search for "highly predictive polygenic disease" turns up zero hits. Hopefully this will change in the years to come...





Type I Diabetes: From the interacting chromosomal regions explored by Bergholdt et. al, the WDR1, LMO7, HNRPLL and RPS15A genes are potential T1D candidate genes. These genes are involved in transcriptional regulation, DNA binding, RNA binding, ion channel activity, ATP synthesis, actin binding and natural killer cell mediated cytotoxicity and cell proliferation. Other networks with TNFA (a gene proposed to be essential to the onset of T1D because of its locus near HLA) include genes involved in signal transduction, regulation of transcription, protein biosynthesis and folding, histone activity, ubiquitin-protein ligase activity, as well as response to oxidative stress, also of potential relevance in T1D pathogenesis.

In their analysis, Bergholdt et. al obtained protein interaction data from the databases BIND, MINT, IntAct, KEGG annotated protein-protein interactions (PPrel), KEGG Enzymes involved in neighboring steps (ECrel) and Reactome proteins involved in the same complex, indirect complex or same or indirect reaction. These databases could be useful to us in potentially determining SNP effects on protein-protein interactions, and we may be able to incorporate their method of visualization in our own program.

Here's their paper: Integrative analysis for finding genes and networks involved in diabetes and other complex diseases

SNPedia lists these 12 SNPs (from a recently published large, multi-lab consortium's effort, [PMID 17554300]) as essential to the onset of T1D: * rs6679677 * rs9272346 * rs11171739 * rs17696736 * rs12708716 * rs2639703 * rs17388568 * rs2544677 * rs17166496 * rs2104286 * rs11052552 and * rs2542151

Other Projects for the Trait-O-Matic Add-on
Modeling SNP Interactions It's clear from these reports that predicting the combined effect of several SNPs will change on a case-by-case basis, and that simple epistasis models (Bliss independence, Loewe additivity) combined with reasonable assumptions about typical epistatic interactions, may not be a practical way to approach gene interactions.
 * Cumulative Association of Five Genetic Variants with Prostate Cancer
 * Combination of Multiple Genetic Risk Factors Is Synergistically Associated With Carotid Atherosclerosis in Japanese Subjects With Type 2 Diabetes
 * Proof of principle of potential clinical utility of multiple SNP analysis for prediction of recurrent venous thrombosis

Looking into Pharmacogenetics


 * I found this cool article Pharmacogenomics: Tanslating Functional Genomics into Rational Theraputics| that would really help out with the pharmacogenetics project :D


 * If we look at table 1, we find a list of polymorphisms of genes important to drug metabolism, and how they would effect different phenotypes. We could start immediately searching for these polymorphisms in the genomes entered as input and scan for these specific mutations, thus being able to readily point spew out a phenotype.


 * Perhaps in order to make our searching method more efficient, we could first look for genes involved in the most number of pathways such as CYP3A4, and look for mutations in those, and then work our way from most common to least common. It is nice that in this picture we can start looking at genes in terms of frequency.


 * Another interesting find in this article was that pharmacogenetic polymorphisms differ in frequency among ethnic and racial groups. So now we would know to include these as a primary criteria when we choose to look at external factors.

Further Reading which has helped me (Ridhi) understand some of the factors / variables that will impact a mathematical model:
I am holding off on finding polygenic diseases with well defined predictive models and trying to develop a framework to understand the different components that can effect gene / protein expression and collectively impact or bring upon a disease or a condition. Some primary literature and google searches produced the following articles which I am still reviewing. Some of the information is helpful in providing an overarching framework. I hope that we can eventually build a systematic and modular biological model which can then be more cleanly be translated into math and eventually code. Articles below.