Harvard:Biophysics 101/2007/Project: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 64: Line 64:
===Input Data===   
===Input Data===   


Sequences from [http://msc.tigr.org/infl_a_virus/status.shtml|TIGR Influenza A pages], and possibly real-time WHO data on (roughly characterized, if not genomically so) strains and spread rates.
Sequences from [http://msc.tigr.org/infl_a_virus/status.shtml TIGR Influenza A pages], and possibly real-time WHO data on (roughly characterized, if not genomically so) strains and spread rates.


===Data Characterization and analysis===
===Data Characterization and analysis===

Revision as of 16:03, 9 March 2007

Biophysics 101: Genomics, Computing, and Economics

Home        People        Schedule        Project        Python        Help       

Project Ideas

Project ideas that came up in the class February 22 are posted [here]

Application1 - ApoE

Alzheimers disease

Input Data

ApoE sequences

Data Characterization and analysis

Identify variation and search OMIM for similar variation and relationship to desease

Action

Suggest clinical testing actions and lifestyle changes

Identifying Common Genetic Motifs in Disease

We can write a script to interface all input genotypes with phenotypes for disease (note: we don't specifically have to look for motifs common to disease, but that seems pretty practical to me. Any phenotype will do, though).

Input Data

Since this script would theoretically cross-reference genotype and phenotype, we would need:

  • Genotypic Inputs
    • 1) database of personalized genome/partial genomes from which to determine a consensus/which base pairs are of interest in a specific disease
    • 2) personal genome of the subject we want to analyze
  • Phenotypic Inputs
    • 1) database like OMIM which includes many entries of alleles and the corresponding phenotype
    • 2) large study such as Framingham or Nurses Health Study data which may be able to correlate family history with phenotype
    • 3) medical history of subject in the form of either objective medical tests or a questionnaire

Data Characterization and analysis

I think we could design an algorithm to go through and scan for varying numbers of motifs of varying lengths found in specific population subsets, but absent in others. Are there any significant patterns found in a diseased group of people? Significant motifs present in sick populations? Significant motifs absent?

We will certainly have to perform quality-control, and perhaps we can model Cystic Fibrosis, color blindness, sickle cell (etc) to optimize our detection methods.

I think in addition to potentially actually identifying motifs of disease, we could design a program that would be able to assess the availabel information and determine some sort of percentage for the likelihood of developing a specific disease or phenotype of interest.

Action

How can we use these data to help people? Any identified motifs could certainly direct our research efforts, implicating new sites and players in the molecular mechanisms of disease. I'm a little confused by the recommendation on the project page of 'medical/dietary action'. Certainly we could use our data to inform someone of their risk for disease (note, this information could also be abused. Perhaps that would better inform their life-style choices? Prevention is an ideal solution to disease, but, for the inevitable genetic ones, we direct research towards therapy and subversion of the identified molecular mechanisms.

I think the medical/dietary action that would be helpful and feasible would include changing one's lifestyle choices to avoid the development of a specific disease if one finds that they have a predisposition to developing it genetically. I think it may also be possible to include environmental recommendations/factors by searching through HapMap and combining that information with OMIM to figure out if potentially higher disease prevalences are due to genetic factors or if there are environmental factors that contribute to them as well. I am not sure that we coudl determine the exact environmental factors, but I think we could identify regions that may include some environmental disposition to developing a certain phenotype/disease.

Mapping the Natural Genetic Library

Input Data

I am imagining a future where a billion dollar project can sequence a billion genomes. This data is limited as of now.

  • Whole genomic data of organisms, spanning the whole tree of life
  • Localization data
    • What phage goes with which cell/organism, and the genomes for both of them
    • Geographic localization: the source of the genome (extra-terrestrial?)
    • Information about the genome source

Data Characterization and analysis

Think of the whole genomic space. It is equal to an infinite-dimensional space of natural numbers (mod 4). In nature, however, very few of these actually occur, and furthermore, they all had to 'evolve' from a common ancestor. Thus given that we pick the right metric, the space is somewhat continuous - the natural library was not designed, but evolved. Most of this 'continuity' has been provided by the classical evolutionary mechanisms we've known about, but some similarity and convergence is due to similar environmental pressures 'directing' evolution in certain ways, and some are a result of the newly emerging concept of horizontal transfer of genomic information. We would like to know which subset of the whole genomic space actually occurs in nature, and what forces constrained nature to explore that subset and not others. Are there repeating patterns, limited by chemistry, physics, or biological constraints? Are there particular bottlenecks, or design constraints?

  • For example, a design constraint for a brain is coming up with changes one mutation at a time while keeping the brain operational. I don't remember where I've heard the analogy, but it is similar to changing wheels on a moving car. Similar arguments go for the eye, etc.
  • Data analysis would involve a range of tools for visualization, statistics, clustering, and comparison

Action

We could use this for some remarkable applications

  • Synthetic biology: Having an idea of how the 'natural library' looks like, we can make connections and information transfer between nodes of the space that otherwise would not have communicated. In other words, we can short-cut nature because we do not have to obey the design constraints mentioned above. The natural library would give us a better understanding of what we are doing (we can literally map our new 'synthetic' contribution), and also be a tremendous inspiration (we can copy tricks from nature)
  • It would clarify relationships between many organisms, why we evolved the way we did, and so on. It is the ultimate 'comparative genomics' platform -- mapping the whole genomic space of life.


BioWeather(ish): Influenza

Wouldn't it be cool if we could track mutations in influenza viruses and determine spatially and locationally where they occured, what virulence changes resulted, and what likely mutations (and properties of spread of these mutated viruses) might occur in the future, as our algorithm is updated with real-time epidemiological information?

Input Data

Sequences from TIGR Influenza A pages, and possibly real-time WHO data on (roughly characterized, if not genomically so) strains and spread rates.

Data Characterization and analysis

  • These are just some qualities we could analyze, but...
    • What are the differences in sequence, as tracked over:
      • Time
      • Location
    • What are the physical meanings of those differences (ie. protein changes)?
    • What changes, if any, are predictable in certain regions
    • (Note: I think this paper is a significant contribution to the influenza field and could inform this project -- CSN)
      • I'm currently unsure, but I've heard that extra-virulent strains may occur when there's a mix of:
        • Different human strains, or
        • Human strains and animal (bird, pig, etc.?) strains
        • Different animal strains
      • So perhaps we may spot where spatial distances between strains are getting smaller and make hypotheses about new hybrid strain creation (and virulence) based on that
    • Real-time inputs will allow prediction of spread characteristics as well as, possibly, predictions of virulence

Action

If we can predict where some particularly virulent strain will hit, and what its genomic characteristics will be, perhaps we can avert it with vaccines or quarantine measures.

Graphics-based visualization of polymorphism data

Inspired by the brilliant work at gapminder.org, which is looking at data of a different sort.

Input Data

HapMap data, genomic data, etc.

Data Characterization and analysis

Visualize data as data points in two or three dimensional space, and then using a combination of graphics and genomics algorithms, process this data to find points of interest. For example, haplotypes could be plotted with loci along one axis and individuals on another and some other factor on a third. Recombination frequency data could be gathered for various SNPs, for example, and any that stand out as compared with a theoretically model would be points of interest. Individual genomes could be binned by some sort of graphic algorithm that orders people along the axis in such a way as to minimize chaos.

Action

Though this analysis, we should be able to gain an understanding of which alleles 'work together'; not only would this help elucidate certain protein-protein interactions, we would also be able to locate in each personal genome potentially hazardous combinations of alleles, etc. and suggest therapeutic methods to address the phenotypes that result.

Agroplanning

The idea of this project is to determine the optimal usage of land and resources to grow grasses which can be used as an ethanol alternative energy source. Although technology is not there yet, eventually cellulose should be able to be efficiently converted to ethanol; when that occurs prairie grasses have high potential to be a much more viable ethanol source than conventional sources such as corn. However, different grasses take different conditions and can produce different theoretical limits to ethanol; a challenge will be to optimize growth areas and conditions.

Bkg: [1]

Input Data

  • Grass types
  • Land information, such as population density /soil pH / weather / etc...

Data Characterization and analysis

  • Figure out the amount of ethanol theoretically possible to be produced per grass
  • Create a model which maximizes ethanol production from grass while minimizing land usage

Action

  • Output is the above model, which can be used for agroplanning

Disease-Host Coevolution

Our pathogens live in a genetically determined environment - us. Their genetic polymorphisms will be shaped as a response to that environment. By correlating the genomes of pathogen and host, it may be possible to identify new loci involved in pathogenesis and disease resistance, and potential new drug targets.

Input Data

  • Unfortunately, this is the limiting factor - anybody have a good source?
    • Maybe in the future, genetic testing at hostpials might have some usefulness here, but for now maybe at least there might be some published geographically linked HIV strain map to at least get that distribution....
    • On the host side of things (human), if we're able to intelligently identify genes that are highly likely to be relevant to disease, maybe we can use OMIM or something like MutationDiscovery.com to do some sort of meta-analysis of the geographic distribution of a small number of genes and their variations.
  • This could be an excellent tool for the coming flood of genomes, given we could correlate pathogen genomes with personal genomes
  • For now, we could just use Hapmap and the geographic localization of diseases?

Data Characterization and analysis

  • For a simple case, we could imagine taking a known resistance trait - like CCR5 for HIV - and identifying the adaptations that allow HIV to infect resistant hosts.
  • With more data from host genomes, we could find new resistance traits.
  • A graphics-based visualization of polymorphism data (idea 5) could be helpful. We could plot mutant strain on one axis, individual on another, etc.

Action

  • Suggest strongest candidates for empirical testing.
  • If targets are confirmed, investigate for potential as drug target, ease of evasion by pathogen, level in population, etc.