User:Pakpoom Subsoontorn/Notebook/Genetically Encoded Memory/2008/10/14

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Project name
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Quantitative functional profiles of site-specific recombinase

 * Compiling together known information about recombinase
 * Focusing on information that will be useful for in vivo DNA manipulation. For now, let's focus on simple model organism like E.coli and S.cerevisiae
 * Two major sources of information:
 * mining from literature
 * large scale, parallel experiment

Informatics challenges

 * The site-specific recombination system generally have a group of enzyme that can cut & paste DNA at very specific sequence. There are, as far as I know, hundreds of such system exists both in nature and that developed in laboratory. They have been exploited extensively for making transgenic organism and in gene therapy.
 * There are tons of literatures mentioning such information. The problem is people used different assays, different host cells, different interpretations. It would be great if we can organize such information together. We would like to be able to update, compare and systematically make reference to the source existing information.
 * Moreover, we would like to know "how much we already know" and "how did we know it." Let's say, I have two stored pieces of information that enzyme X has N terminal domain for catalytic and C terminal for binding,  while enzyme Y has intermixed domains for catalytic and binding domain. I would like to have systematic way to go back and check whether such conclusion come from totally different methods, from two different research groups, and were reported 15 years aways from each other!
 * The information of interest include:
 * The sequence of recombinase enzyme and target DNA sequence. We probably get the protein sequence from GenBank. However, as far as I know, there is no public database for the target sequences yet. Determining the minimal target sequences are still subjects of intensive research, even for the best known recombinase systems.
 * Structural information. In particular, functional domain, DNA binding domain, catalytic domain, etc.
 * Mutation/ chimeric studies.
 * Efficiency in different host cells. What're the rates of recombination? How specific is the recombination? toxicities to the host cells?
 * Different Assays used for the studies.

Database/method features

 * Key feature of the database:
 * Quantitative description,
 * Capabilities to compare and contrast
 * Standardized description
 * Expandable size of database without messing up with core structure
 * literature cross reference, updating
 * scalable experiment/measurement

Quantitative functional information
The list of information fields (need to decide: mining VS experimenting, set priority, black-box):
 * Enzyme name
 * Natural source of enzyme, host
 * Coding sequence for enzyme itself
 * Enzyme structure: amino acid sequence, functional domain,
 * Target DNA sequence: Natural target, specificity (sequence wobble)
 * Natural/synthetic topologies of the target DNA (inter/intramolecule, distance, coiling)
 * Required Axillary factors:
 * Recombination Efficiency in standard host, natural host, etc.
 * Recombination speed
 * Studies in mutated enzyme VS target sequences
 * Mechanism:
 * Toxicity

Standard System,
 * What should be our standard:
 * Host cells
 * Plasmid
 * System for expressing enzyme

Parameter we want to tune,
 * Enzyme concentration
 * Expression time
 * How long do we have to have enzyme around?
 * Inter/intermolecular assay
 * Intramolecular distance
 * directionality of sites
 * Target sequence (wobble test)
 * Enzyme sequence (mutation study)

Note:
 * Type of information
 * Quantitative: for example, sequences, rate, yield, boolean
 * Qualitative: for example, mechanism

Database Design Phase-1'
 * For each piece of information, like to original article and specific info (figure or sentences)

(experiment oriented) Enzyme-in-action: one paper generally has more than one experiment. We should also use some keyword in the field of interest to accommodate for detailed difference
 * experimental-enzyme-sequence (like lambda integrase without the last 20 aa; name or seq)
 * original-enzyme-sequence (like lambda integrase; name or seq)
 * experimental-host (like E.coli K-12 without lacZ gene, with plasmid...)
 * original-host (like E.coli K-12)
 * auxillary factor
 * expression switch
 * expression time
 * Intended target (sequence used for target, plasmid platform)
 * Readout method (like..GFP signal etc)
 * Readout result

Fields
 * Enzyme name
 * Natural source of enzyme, host
 * Coding sequence for enzyme itself
 * Enzyme structure: amino acid sequence, functional domain,
 * Target DNA sequence: Natural target,
 * Natural/synthetic topologies of the target DNA
 * Required Axillary factors:
 * Recombination Efficiency in standard host, natural host, etc.
 * Recombination speed
 * Studies in mutated enzyme VS target sequences
 * Mechanism:
 * Pubmed ID reference
 * Abstract
 * Sentence
 * Raw Data

Tools
Informatics tools
 * automated system of updating the list of information in public domain.
 * The list of enzymes and the reference scientific literature (i.e. from pubmed)
 * Roughly split-up information according to fields:
 * Standard information such as sequence, structure, ontology, etc from GenBank or PDB...keep them update
 * Decouple methods (mutant, assay, etc.) and implications (sequence, structure, models) from literature
 * Need effective ways to make reference. Even better to refer to the experiment paper

Molecular Biology tools
 * Standard cassettes on plasmids so that we can try wide varieties of recombination sites and recombination enzymes
 * Tunable promoter and some reporter tags that report the levels of recombinase
 * Method for timing the recombination process. At what time point the recombination of DNA is complete? This could time scale could be much shorter than the time until the reporter of the new DNA configuration become observable. Can we pause/terminate recombination at different time point?
 * Some mechanism to accommodate the fact that the substrate (sites on genomic DNA) has very low copies.

Notes/questions

 * Maybe, we should build an entity around experiment
 * It would be nice to decouple "claim" and "evidences", for example,
 * claim: recombinase X has DNA binding domain in C-terminal and catalytic domain at N-terminal
 * Evidence:
 * phylogenic
 * reference:
 * ref1
 * ref2
 * activities of mutant
 * X-ray crystallography
 * Even just the classification of raw data alone is already cool. For example, given recombinase X, I can search for
 * the list of all mutant/chimeric enzyme that has been made and studied
 * the list of all alternative target site that have been tested
 * the list of all crystal structure that have been reported
 * the list of all studies in different hosts
 * the list of all activities assays that have been done
 * Note: Usually, many publications give the information in the same category (i.e., two different paper might both report part of the crystal structure of enzyme.) One publication usually give information that is belong to more than one category (i. e. The same publication might both report activities assay and crystal structure)


 * what're the public databases we should focus on?
 * what're the public database that're similar to this database? how they're organized? how they're created?
 * I would suggest that we start at the systems that we have strong references first: lambda and phiC31


 * }