Revision as of 18:25, 13 February 2008

Data Exchange Standards

This working group aims to define standards for the description of biobricks and formats / technologies for the exchange (or networking) of biobrick-related data.

This falls into the following questions (Discuss and answer!):

0. Aim / Application scenarios

1. What is a Biobrick?

2. What is the data model needed to describe a biobrick?

3. What is the best format / technology for exchange?

Aim / Application scenarios for this standard

Application scenarios [please discuss]:

data exchange between local / central part registries

_{Example: "We have a local registry and want to publish the finished Biobricks to the MIT registry."
See [[ http://brickit.wiki.sourceforge.net/ | BrickIt project]] for an example local registry system.}

download biobrick data into local computer programs

_{Example: "We want to simulate the behavior of device X and Y with the GePasy program." or "We want to develop bio-circuit design programs."}

find suitable parts

_{Example: "I need a 10-fold PoPs amplifier (input range 0 - 8 PoPs) that works in S. cerevisiae at 25 C temperature; response time doesn't matter but protein production load needs to stay below 100000 AA consumed; Sub-components must not interfere with the MAPK pathway [enter reactions]."}

distributed annotation of Biobricks

_{Example: "We have measured the toxicity of 1000 BioBricks from MIT and two other registries. Can we cross-link this data with the registy?"}

What is a Biobrick?

Definition

A final definition is beyond the scope of this group. For data exchange purposes we adopt the following draft:

BioBricks™ are standard DNA parts that encode basic biological functions. see BBF home
A BioBrick has a unique DNA sequence.
Basic parts are defined by this DNA sequence.
Composite parts are defined as "sequence" of Basic BioBricks, along with intervening "scar" sequences.

Issue: BioBrick formats

(Raik) You can have the "same" Biobrick in different formats, e.g. with prefix/suffix from one of the two suggested protein fusion formats. Now the sequence is exactly the same, but having a sample of biobrick X with biofusion flanks may be of no use if the other biobricks in you freezer are formatted differently. *Does a different prefix / suffix create a different biobrick?* To the assembling experimentalist in the lab it does; to the user of gene synthesis it doesn't really; the system designer or analyst couldn't care less...

Issue: closely related BioBricks

(Mac) should there be a one-to-one relationship between a part 's functional definition and its sequence? What if you introduce a silent mutation into a BioBrick - is there a "different sequence, different part" doctrine, even if the two are functionally equivalent? ... Is this a source code vs. compiled code issue?

(Raik) We right now seem to follow the unspoken rule that a part is defined by its exact DNA sequence. Any modification creates a new part, which is kind of logical to the experimentalist because it maps a biobrick to exactly one DNA fragment (which you either have in your freezer or not) and vice versa. Options:

keep/fix the sequence-based definition but introduce relations like "ortholog to", "equivalent to", etc.
define "reference biobricks" and link variants to them
find a more abstract definition ... and create the concept of BB 'implementation' or 'instance'.

(Mac) Perhaps we could do both? Assuming a biobrick always has one and only one dna sequence, perhaps we could build the data model to support organizing biobricks into families or sets of functionally related parts? Each family could have one canonical biobrick associated with it that works, is available, and exemplifies the function that the family is supposed to have.

What is the data model needed to describe a biobrick?

Following Ralph's and Barry's mails, Raik suggests to split this into the following sub-topics (re-organize at leisure).

minimal Biobrick information

The set of minimal information aims to (1) uniquely identify a biobrick, (2) provide sufficient detail for its application and handling in the lab and during assembly, (3) describe its origin/source and references for human study.

unique ID
DNA sequence / basic building blocks
format ??? (see issue above)
short description for humans
long description for humans
target chassis
"collaborating"/complementing biobricks if any
feature annotation

experience flag
? bug tracker ?
? version / supersedes / history ?

source genebank ID if applicable (with position?)
source organism
source lab/person
references (web / literature)

Biobrick classification

Categorization and anything that helps (1) fishing this part out of the registry and (2) deciding what extra information may be needed.

Intrinsic Classification

Intrinsic classification covers those aspects of Biobrick classification which are defined by the Biobricks themselves. For these the primary focus is defining the vocabularies used to describe Biobricks to the outside world. Broadly speaking, this can include:

Identifiers
Biobrick taxonomy: defining types or species of Biobricks based on composition, function, etc.

Possible intrinsic classifiers include:

DNA category: [ AA coding, RNA-coding[m-/t-/nc-/mi-/si-], regulatory [promoter,rbs,terminator,enhancer], unknown, ...]

Extrinsic Classification

Extrinsic classification refers to those aspects of Biobrick classification which are attributed to Biobricks from external sources or references. The focus is defining the vocabularies for those aspects of the outside world which are related to biobricks.

Functional Performance Parameters
Function...

Characterization

Quantitative data about the part, important for design and implementation of devices containing it,

A) independent of the parts category:

genetic stability
?

B) depending on the parts category.

Static device behavior
Dynamic device behavior
Device compatibility (with other devices, environmental conditions etc.)
Device interactions (including quantitative data)
Device reliability (RNA half-life, protein half-life)
Power requirements of the device

Further annotation

Higher level descriptions for automated design & simulation?
references to High-throughput data ?
references to outside, non-standardized information about this part

What is the best format / technology for exchange?

Once the data model is firmly in place, the format should follow as the one that best implements that data model. For example, if we settle on an RDF-like 'everything is a relationship triplet' approach, then some format that can handle these triplets would be most appropriate. In addition, with a model like this, there are XML-based and more human-readable formats that can both implement the model equally well.

I think that tying our selves to a format too early will make us not have a clear model in mind, and will cause us to hack up the format. It is best to do model, then format.

Suggestions

Please fill in these sections with details

create a new XML format

adapt existing CellML, SBML XML formats

create a custom file format

use Turtle/N3 notation for semantic web documents

Example of N3

I somewhat share the reservation about completely new file formats, but the non-readability and general nastiness of XML is also an issue. A good solution, IMO, would be to use the Notation3 format developed by the semantic web folks. It is concise, human-readable and editable (i used it myself some years ago) *AND* is equivalent to XML. That means there is a well defined translation back and for and many libraries and tools do the conversion. Being semantic web, it also solves the linking problem (everything is a link).

Quick Example:

# shortcut definition for frequently used ressources ...
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix bbf: <http://biobricks.org/ontology/1.1/>.
@prefix harvard: <http://harvard.edu/registry/parts#>.

# define a biobrick hosted at this address
:BBa_0001
       rdf:type        bbf:biobrick;
       bbf:sequence    "AAACCCGGG";
       bbf:similarTo  [:BBa_0003, harvard:BBa_J1000, :BBa_00010].

# add information to a biobrick defined elsewhere
harvard:HBB_J1000
       rdf:sameAs      :BBa_0001.

OK, one can argue about human-readability but it's at least possible to understand and edit these documents (and much better than the equivalent xml).

@@ Line 118: / Line 118: @@
 * Functional Performance Parameters
-** Static Device Behavior
-** Dynamic Device Behavior
-*** Response Time
-*** Hill Coefficient
-** Device Reliability
-** (others..?)
 * Function...
-==== Relation to Characterization ====
-Extrinsic classification and functional characterization can be quite intimately related with one another, so to clarify the relationship it may be worth considering:
-* As scientists and other users characterize biobricks, those characterizations themselves will likely become data which will be queried to select biobricks.  It seems appropriate that the terms and vocabularies used to describe those findings should be incorporated into the classification ontologies over time, while leaving the actual findings as characterization data.
 === Characterization ===

The BioBricks Foundation:Standards/Technical: Difference between revisions