Harvard:Biophysics 101/2007/ProjectIdeas

=Project Ideas=
 * Linked from the Project page.

Project ideas that came up in the class February 22 are posted [here]

Application1 - ApoE
Alzheimers disease

Input Data
ApoE sequences

Data Characterization and analysis
Identify variation and search OMIM for similar variation and relationship to desease

Action
Suggest clinical testing actions and lifestyle changes

Identifying Common Genetic Motifs in Disease
We can write a script to interface all input genotypes with phenotypes for disease (note: we don't specifically have to look for motifs common to disease, but that seems pretty practical to me. Any phenotype will do, though).

Input Data
Since this script would theoretically cross-reference genotype and phenotype, we would need:
 * Genotypic Inputs
 * 1) database of personalized genome/partial genomes from which to determine a consensus/which base pairs are of interest in a specific disease
 * 2) personal genome of the subject we want to analyze
 * Phenotypic Inputs
 * 1) database like OMIM which includes many entries of alleles and the corresponding phenotype
 * 2) large study such as Framingham or Nurses Health Study data which may be able to correlate family history with phenotype
 * 3) medical history of subject in the form of either objective medical tests or a questionnaire

Data Characterization and analysis
I think we could design an algorithm to go through and scan for varying numbers of motifs of varying lengths found in specific population subsets, but absent in others. Are there any significant patterns found in a diseased group of people? Significant motifs present in sick populations? Significant motifs absent?

We will certainly have to perform quality-control, and perhaps we can model Cystic Fibrosis, color blindness, sickle cell (etc) to optimize our detection methods.

I think in addition to potentially actually identifying motifs of disease, we could design a program that would be able to assess the availabel information and determine some sort of percentage for the likelihood of developing a specific disease or phenotype of interest.

Action
How can we use these data to help people? Any identified motifs could certainly direct our research efforts, implicating new sites and players in the molecular mechanisms of disease. I'm a little confused by the recommendation on the project page of 'medical/dietary action'. Certainly we could use our data to inform someone of their risk for disease (note, this information could also be abused. Perhaps that would better inform their life-style choices?  Prevention is an ideal solution to disease, but, for the inevitable genetic ones, we direct research towards therapy and subversion of the identified molecular mechanisms.

I think the medical/dietary action that would be helpful and feasible would include changing one's lifestyle choices to avoid the development of a specific disease if one finds that they have a predisposition to developing it genetically. I think it may also be possible to include environmental recommendations/factors by searching through HapMap and combining that information with OMIM to figure out if potentially higher disease prevalences are due to genetic factors or if there are environmental factors that contribute to them as well. I am not sure that we coudl determine the exact environmental factors, but I think we could identify regions that may include some environmental disposition to developing a certain phenotype/disease.

Input Data
I am imagining a future where a billion dollar project can sequence a billion genomes. This data is limited as of now.
 * Whole genomic data of organisms, spanning the whole tree of life
 * Localization data
 * What phage goes with which cell/organism, and the genomes for both of them
 * Geographic localization: the source of the genome (extra-terrestrial?)
 * Information about the genome source

Data Characterization and analysis
Think of the whole genomic space. It is equal to an infinite-dimensional space of natural numbers (mod 4). In nature, however, very few of these actually occur, and furthermore, they all had to 'evolve' from a common ancestor. Thus given that we pick the right metric, the space is somewhat continuous - the natural library was not designed, but evolved. Most of this 'continuity' has been provided by the classical evolutionary mechanisms we've known about, but some similarity and convergence is due to similar environmental pressures 'directing' evolution in certain ways, and some are a result of the newly emerging concept of horizontal transfer of genomic information. We would like to know which subset of the whole genomic space actually occurs in nature, and what forces constrained nature to explore that subset and not others. Are there repeating patterns, limited by chemistry, physics, or biological constraints? Are there particular bottlenecks, or design constraints?
 * For example, a design constraint for a brain is coming up with changes one mutation at a time while keeping the brain operational. I don't remember where I've heard the analogy, but it is similar to changing wheels on a moving car. Similar arguments go for the eye, etc.
 * Data analysis would involve a range of tools for visualization, statistics, clustering, and comparison

Action
We could use this for some remarkable applications
 * Synthetic biology: Having an idea of how the 'natural library' looks like, we can make connections and information transfer between nodes of the space that otherwise would not have communicated. In other words, we can short-cut nature because we do not have to obey the design constraints mentioned above. The natural library would give us a better understanding of what we are doing (we can literally map our new 'synthetic' contribution), and also be a tremendous inspiration (we can copy tricks from nature)
 * It would clarify relationships between many organisms, why we evolved the way we did, and so on. It is the ultimate 'comparative genomics' platform -- mapping the whole genomic space of life.

BioWeather(ish): Influenza
Wouldn't it be cool if we could track mutations in influenza viruses and determine spatially and locationally where they occured, what virulence changes resulted, and what likely mutations (and properties of spread of these mutated viruses) might occur in the future, as our algorithm is updated with real-time epidemiological information?

Input Data
Sequences from TIGR Influenza A pages, and possibly real-time WHO data on (roughly characterized, if not genomically so) strains and spread rates.

Data Characterization and analysis

 * These are just some qualities we could analyze, but...


 * What are the differences in sequence, as tracked over:
 * Time
 * Location
 * What are the physical meanings of those differences (ie. protein changes)?
 * What changes, if any, are predictable in certain regions
 * (Note: I think this paper is a significant contribution to the influenza field and could inform this project -- CSN)
 * I'm currently unsure, but I've heard that extra-virulent strains may occur when there's a mix of:
 * Different human strains, or
 * Human strains and animal (bird, pig, etc.?) strains
 * Different animal strains
 * So perhaps we may spot where spatial distances between strains are getting smaller and make hypotheses about new hybrid strain creation (and virulence) based on that
 * Real-time inputs will allow prediction of spread characteristics as well as, possibly, predictions of virulence

Action
If we can predict where some particularly virulent strain will hit, and what its genomic characteristics will be, perhaps we can avert it with vaccines or quarantine measures.

Graphics-based visualization of polymorphism data
Inspired by the brilliant work at gapminder.org, which is looking at data of a different sort.

Input Data
HapMap data, genomic data, etc.

Data Characterization and analysis
Visualize data as data points in two or three dimensional space, and then using a combination of graphics and genomics algorithms, process this data to find points of interest. For example, haplotypes could be plotted with loci along one axis and individuals on another and some other factor on a third. Recombination frequency data could be gathered for various SNPs, for example, and any that stand out as compared with a theoretically model would be points of interest. Individual genomes could be binned by some sort of graphic algorithm that orders people along the axis in such a way as to minimize chaos.

Action
Though this analysis, we should be able to gain an understanding of which alleles 'work together'; not only would this help elucidate certain protein-protein interactions, we would also be able to locate in each personal genome potentially hazardous combinations of alleles, etc. and suggest therapeutic methods to address the phenotypes that result.

Agroplanning
The idea of this project is to determine the optimal usage of land and resources to grow grasses which can be used as an ethanol alternative energy source. Although technology is not there yet, eventually cellulose should be able to be efficiently converted to ethanol; when that occurs prairie grasses have high potential to be a much more viable ethanol source than conventional sources such as corn. However, different grasses take different conditions and can produce different theoretical limits to ethanol; a challenge will be to optimize growth areas and conditions.

Bkg:

Input Data

 * Grass types
 * Land information, such as population density /soil pH / weather / etc...

Data Characterization and analysis

 * Figure out the amount of ethanol theoretically possible to be produced per grass
 * Create a model which maximizes ethanol production from grass while minimizing land usage

Action

 * Output is the above model, which can be used for agroplanning

Disease-Host Coevolution
Our pathogens live in a genetically determined environment - us. Their genetic polymorphisms will be shaped as a response to that environment. By correlating the genomes of pathogen and host, it may be possible to identify new loci involved in pathogenesis and disease resistance, and potential new drug targets.

Input Data

 * Unfortunately, this is the limiting factor - anybody have a good source?
 * Maybe in the future, genetic testing at hostpials might have some usefulness here, but for now maybe at least there might be some published geographically linked HIV strain map to at least get that distribution....
 * On the host side of things (human), if we're able to intelligently identify genes that are highly likely to be relevant to disease, maybe we can use OMIM or something like MutationDiscovery.com to do some sort of meta-analysis of the geographic distribution of a small number of genes and their variations.
 * This could be an excellent tool for the coming flood of genomes, given we could correlate pathogen genomes with personal genomes
 * For now, we could just use Hapmap and the geographic localization of diseases?

Data Characterization and analysis

 * For a simple case, we could imagine taking a known resistance trait - like CCR5 for HIV - and identifying the adaptations that allow HIV to infect resistant hosts.
 * With more data from host genomes, we could find new resistance traits.
 * A graphics-based visualization of polymorphism data (idea 5) could be helpful. We could plot mutant strain on one axis, individual on another, etc.

Action

 * Suggest strongest candidates for empirical testing.
 * If targets are confirmed, investigate for potential as drug target, ease of evasion by pathogen, level in population, etc.

Evolution of Regulatory Networks
Gene chips are becoming common; there's a lot of microarray-based transcriptome data begging to be analyzed. There's software out there that will take in large sets of microarray data and give you clusters of co-expressed genes (see this 2001 Science paper for an example and a nifty visualization as well).

Input Data
Microarray data, from NCBI-GEO and/or culled from individual papers

Data Characterization and analysis
Project 1: Using these clusters, it should be possible to identify motifs that predict what cluster nearby genes belong in. This could be a route to identifying regulatory elements. We could then back-correlate to find where and how these motifs are used.

Project 2: These clusters probably evolve. Performing cluster analysis on a broad range of species, we could track these changes - over evolutionary time, do regulatory clusters adapt by merging and splitting, or capturing single genes from other clusters, or what? This information would tell us a lot about genomic evolution. Maybe we could even correlate significant changes with the phylogeny. (This one's cooler, but rates high on the nontrivial scale...)

Action
Understanding the transcriptome better will give us a leg up when it's time to apply all this data - whether designing new drugs, new genomes, etc. It could also lead to some metrics for evolution and divergence that aren't limited to the single gene level.