User:Pakpoom Subsoontorn/Notebook/Genetically Encoded Memory/2008/10/14

Project name

Main project page

Previous entry Next entry

Quantitative functional profiles of site-specific recombinase

Compiling together known information about recombinase
Focusing on information that will be useful for in vivo DNA manipulation. For now, let's focus on simple model organism like E.coli and S.cerevisiae
Two major sources of information:
- mining from literature
- large scale, parallel experiment

Informatics challenges

The site-specific recombination system generally have a group of enzyme that can cut & paste DNA at very specific sequence. There are, as far as I know, hundreds of such system exists both in nature and that developed in laboratory. They have been exploited extensively for making transgenic organism and in gene therapy.
There are tons of literatures mentioning such information. The problem is people used different assays, different host cells, different interpretations. It would be great if we can organize such information together. We would like to be able to update, compare and systematically make reference to the source existing information.
Moreover, we would like to know "how much we already know" and "how did we know it." Let's say, I have two stored pieces of information that enzyme X has N terminal domain for catalytic and C terminal for binding, while enzyme Y has intermixed domains for catalytic and binding domain. I would like to have systematic way to go back and check whether such conclusion come from totally different methods, from two different research groups, and were reported 15 years aways from each other!
The information of interest include:
- The sequence of recombinase enzyme and target DNA sequence. We probably get the protein sequence from GenBank. However, as far as I know, there is no public database for the target sequences yet. Determining the minimal target sequences are still subjects of intensive research, even for the best known recombinase systems.
- Structural information. In particular, functional domain, DNA binding domain, catalytic domain, etc.
- Mutation/ chimeric studies.
- Efficiency in different host cells. What're the rates of recombination? How specific is the recombination? toxicities to the host cells?
- Different Assays used for the studies.

Database/method features

Key feature of the database:
- Quantitative description,
- Capabilities to compare and contrast
  - Standardized description
Expandable size of database without messing up with core structure
literature cross reference, updating
scalable experiment/measurement

Quantitative functional information

The list of information fields (need to decide: mining VS experimenting, set priority, black-box):

Enzyme name
Natural source of enzyme, host
Coding sequence for enzyme itself
Enzyme structure: amino acid sequence, functional domain,
Target DNA sequence: Natural target, specificity (sequence wobble)
Natural/synthetic topologies of the target DNA (inter/intramolecule, distance, coiling)
Required Axillary factors:
Recombination Efficiency in standard host, natural host, etc.
Recombination speed
Studies in mutated enzyme VS target sequences
Mechanism:
Toxicity

Standard System,

What should be our standard:
- Host cells
- Plasmid
- System for expressing enzyme

Parameter we want to tune,

Enzyme concentration
Expression time
- How long do we have to have enzyme around?
Inter/intermolecular assay
- Intramolecular distance
- directionality of sites
Target sequence (wobble test)
Enzyme sequence (mutation study)

Note:

Type of information
- Quantitative: for example, sequences, rate, yield, boolean
- Qualitative: for example, mechanism

Database Design Phase-1'

For each piece of information, like to original article and specific info (figure or sentences)

(experiment oriented) Enzyme-in-action: one paper generally has more than one experiment. We should also use some keyword in the field of interest to accommodate for detailed difference

experimental-enzyme-sequence (like lambda integrase without the last 20 aa; name or seq)
original-enzyme-sequence (like lambda integrase; name or seq)
experimental-host (like E.coli K-12 without lacZ gene, with plasmid...)
original-host (like E.coli K-12)
auxillary factor
expression switch
expression time
Intended target (sequence used for target, plasmid platform)
Readout method (like..GFP signal etc)
Readout result

Fields

Enzyme name
Natural source of enzyme, host
Coding sequence for enzyme itself
Enzyme structure: amino acid sequence, functional domain,
Target DNA sequence: Natural target,
Natural/synthetic topologies of the target DNA
Required Axillary factors:
Recombination Efficiency in standard host, natural host, etc.
Recombination speed
Studies in mutated enzyme VS target sequences
Mechanism:
Pubmed ID reference
Abstract
Sentence
Raw Data

Tools

Informatics tools

automated system of updating the list of information in public domain.
- The list of enzymes and the reference scientific literature (i.e. from pubmed)
- Roughly split-up information according to fields:
- Standard information such as sequence, structure, ontology, etc from GenBank or PDB...keep them update
Decouple methods (mutant, assay, etc.) and implications (sequence, structure, models) from literature
Need effective ways to make reference. Even better to refer to the experiment paper

Molecular Biology tools

Standard cassettes on plasmids so that we can try wide varieties of recombination sites and recombination enzymes
Tunable promoter and some reporter tags that report the levels of recombinase
Method for timing the recombination process. At what time point the recombination of DNA is complete? This could time scale could be much shorter than the time until the reporter of the new DNA configuration become observable. Can we pause/terminate recombination at different time point?
Some mechanism to accommodate the fact that the substrate (sites on genomic DNA) has very low copies.

Notes/questions

Maybe, we should build an entity around experiment
It would be nice to decouple "claim" and "evidences", for example,
- claim: recombinase X has DNA binding domain in C-terminal and catalytic domain at N-terminal
- Evidence:
  - phylogenic
    - reference:
      1. ref1
      2. ref2
  - activities of mutant
  - X-ray crystallography
Even just the classification of raw data alone is already cool. For example, given recombinase X, I can search for
- the list of all mutant/chimeric enzyme that has been made and studied
- the list of all alternative target site that have been tested
- the list of all crystal structure that have been reported
- the list of all studies in different hosts
- the list of all activities assays that have been done
Note: Usually, many publications give the information in the same category (i.e., two different paper might both report part of the crystal structure of enzyme.) One publication usually give information that is belong to more than one category (i. e. The same publication might both report activities assay and crystal structure)

what're the public databases we should focus on?
what're the public database that're similar to this database? how they're organized? how they're created?
I would suggest that we start at the systems that we have strong references first: lambda and phiC31

User:Pakpoom Subsoontorn/Notebook/Genetically Encoded Memory/2008/10/14

Quantitative functional profiles of site-specific recombinase

Informatics challenges

Database/method features

Quantitative functional information

Tools

Notes/questions

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools