The BioBricks Foundation:Standards/Technical/Exchange/Old Discussion

Biobrick Data Exchange Standards: This working group aims to define formats / technologies for the description of biobricks and the exchange (or networking) of biobrick-related data. This document is part of the ongoing discussion on the technical standards mailing list. The main questions to tackle are:


 * Aim -- goal and application scenarios for this standard
 * Biobrick definition -- What is a Biobrick?
 * Data model -- What is the data model needed to describe a biobrick?
 * Technology -- What is the best format / technology for exchange?

= Aim / Application scenarios for this standard =

Concise aims
[please add]

Application scenarios [please discuss]
Example: "We have a local registry and want to publish the finished Biobricks to the MIT registry." See BrickIt project for an example local registry system.
 * data exchange between local / central part registries

Example: "Please send me DNA samples of the Biobricks you have constructed for your recent paper!"
 * biobrick sample exchange

Example: "I need a 10-fold PoPs amplifier (input range 0 - 8 PoPs) that works in S. cerevisiae at 25 C temperature; response time doesn't matter but protein production load needs to stay below 100000 AA consumed; Sub-components must not interfere with the MAPK pathway [enter reactions]."
 * find suitable parts

Example: "I want to make a bioinformatic analysis of all RNA-biobricks in the MIT registry." Example: "I want to write a Biobrick DNA design program."
 * download biobrick data into local computer programs

"We want to simulate the behavior of device X and Y with the GePasy program." or "We want to develop simulation-based bio-circuit design programs."
 * simulate Biobrick devices & simulation-aided design

Example: "We have measured the toxicity of 1000 BioBricks from MIT and two other registries. Can we cross-link this data with the registy?"
 * distributed annotation of Biobricks

An open-source-style license will most likely need to be anchored in an exact description of a Biobrick or Biobrick device. Copyright may be applicable, if this description fulfills certain criteria. This may in turn allow to expand the protection into areas that do not fall under Copyright.
 * legal protection of open exchange

= What is a Biobrick? =

Biobrick Definition
A final definition is beyond the scope of this group. For data exchange purposes we suggest the following draft:


 * A BioBrick™ is a standardized, continuous DNA sequence encoding a basic biological function. see BBF home
 * A BioBrick has a unique DNA sequence.
 * Basic Biobricks are defined by this DNA sequence.
 * Composite Biobricks are defined as "sequence" of Basic BioBricks, along with intervening "scar" sequences.

Background
-- Part / Standard Biological Part / Biobrick (Reshma Shetty):


 * A biological part is a DNA sequence that encodes a basic biological function.
 * Biological parts that conform to prescribed technical standards are standard biological parts.
 * Parts that conform to the BioBrick assembly standard are BioBrick standard biological parts. Additional technical standards defining BioBrick standard biological parts will be set via an open standards setting process led by The BioBricks Foundation.

(Note: parts do not include either the plasmid that propagates that part or sequences whose function is to help physically compose parts (e.g., the BioBrick prefix and suffix). Parts can come in different formats (the BioBrick standard, the BamHI/BglII standard promulgated by folks at Berkeley, no "scar" sequences whatsoever because of direct synthesis etc.)

Some confusion arises because in the Registry of Standard Biological Parts, everything with a BioBrick part number is called a part. The Registry distinguishes between basic parts (equivalent to "biological part" definition above) and composite parts (combinations of two or more basic parts). Based on the definitions above, a device can be either a basic part or a composite part. But not all basic or composite parts are devices.

It is not clear whether we need to reconcile the Registry's terminology with that of parts versus devices described above or if both sets of definitions can stand in parallel since they serve slightly different purposes.

Issue: BioBrick formats
(Raik) We can have the "same" Biobrick in different formats, e.g. with prefix/suffix from one of the two suggested protein fusion formats. Now the sequence is exactly the same, but having a sample of biobrick X with biofusion flanks may be of no use if the other biobricks in you freezer are formatted differently. *Does a different prefix / suffix create a different biobrick?* To the assembling experimentalist in the lab it does; to the user of gene synthesis it doesn't really; the system designer or analyst couldn't care less...

Device defintion

 * Devices are combinations of one or more parts that have a human-defined function.
 * Some devices can be encoded in a single stretch of DNA (a basic or composite part), others encompass disconnected parts (e.g. encoded in two different locations, possibly even cells).
 * (suggestion Reshma) Devices expose specified interfaces for their functional connection with other devices (example: PoPS)
 * (suggestion Raik) A Biobrick device is defined by a unique combination of unique Biobricks

Reshma Shetty proposes defining devices based upon commonly recognized biochemical signal carriers, writing:

I prefer an even more specific definition in which devices not only have human-defined function but also are functionally composable using common signal carriers such as transcription rate (PoPS), translation rate (RiPS), phosphorylations per second etc. Devices can have an input of a common signal carrier, an output of a common signal carrier or both. For instance, a promoter is both a part (its basic biological function is to initiate transcription) and a device (its human-defined function is a PoPS source and it produces the common signal carrier PoPS). A transcriptional inverter is a device since logical "NOT" is a human-defined operation and an inverter has an input of PoPS and an output of PoPS.

Biobrick & Device families
Suggestion -- Concluding from frequent discussions on the mailing list:


 * Biobricks can be part of one or more Biobrick families
 * Devices can be part of one or more Device families
 * families can be themselves part of one or more families (allowing flexible, open hierarchies)

(Mac) should there be a one-to-one relationship between a part 's functional definition and its sequence? What if you introduce a silent mutation into a BioBrick - is there a "different sequence, different part" doctrine, even if the two are functionally equivalent? ... Is this a source code vs. compiled code issue? (Raik) We right now seem to follow the unspoken rule that a part is defined by its exact DNA sequence. Any modification creates a new part, which is kind of logical to the experimentalist because it maps a biobrick to exactly one DNA fragment (which you either have in your freezer or not) and vice versa. Options:

(Mac) Perhaps we could do both? Assuming a biobrick always has one and only one dna sequence, perhaps we could build the data model to support organizing biobricks into families or sets of functionally related parts? Each family could have one canonical biobrick associated with it that works, is available, and exemplifies the function that the family is supposed to have.
 * keep/fix the sequence-based definition but introduce relations like "ortholog to", "equivalent to", etc.
 * define "reference biobricks" and link variants to them
 * find a more abstract definition ... and create the concept of BB 'implementation' or 'instance'.

= What is the data model needed to describe a biobrick? =

Following Ralph's and Barry's mails, Raik suggests to split this into the following sub-topics (re-organize at leisure).

minimal Biobrick information
The set of minimal information aims to (1) uniquely identify a biobrick, (2) provide sufficient detail for its application and handling in the lab and during assembly, (3) describe its origin/source and references for human study.


 * unique ID
 * DNA sequence /sequence of basic building blocks
 * format [specifying: prefix, suffix, scar, name, description]
 * short description for humans
 * long description for humans
 * target chassis


 * experience record (works in ... chassis, in the hands of ...)[tp be sorted out]


 * source genebank ID if applicable (with position?)
 * source organism
 * source lab/person


 * references (web / literature)

some extended fields [useful but less obvious]

 * sequence feature annotation (for genebank-formatted export)
 * ? bug tracker ?
 * ? version / supersedes / history ?
 * parent device(s)
 * "collaborating"/complementing biobricks if any

Biobrick classification
Categorization and anything that helps (1) fishing this part out of the registry and (2) deciding what extra information may be needed. Depending on the data exchange format, classification may be realized by grouping into families (see above).

suggestion: Biobrick class / family
Biobrick class (example, Promoter): The corresponding RDF terms are in "[]"


 * unique ID (URI in case of RDF)
 * parent classes (example, regulatory DNA / non-coding DNA), [owl:subClassOf]
 * human short description
 * human long description
 * obligate/guaranteed property types of member Biobricks (e.g. transcription factor) [owl:constraint]

Biobrick family (example, constitutive yeast promoters for Zn-finger TF):


 * unique ID (URI in case of RDF)
 * human short description
 * human long description
 * type of family -> Biobrick class [rdfs:type]
 * parent Biobrick families -> Biobrick family
 * properties valid for all members (e.g. transcription_factor -> constitutive zinc finger yeast TF)

Biobrick:


 * families -> Biobrick family (example constitutive yeast promoters for Zn-finger TF)

Notes: (1) The same scheme could be used for device classification. (2) Implementation is straightforward for RDF based formats but more towards nightmarish for a relational database scheme. (3) The BBF should define some base classes of families that are used in the central registry. See also mailing list discussion:

(Josh) I think that as a general statement the phrase "basic biological function" is fine, but what would be most helpful in the standards setting is precisely defining standardized basic functions which parts can have, and that properties which must be defined for parts belonging to these functions.

Suggestion: hierarchical IDs 

(Jack) (1) Associate the ID numbers with the precise sequences ... TTL Databook approach to classifying digital circuits packages. The IDs of TTL devices encode their function, format, tolerance, and package in different portions of the ID.

(Raik) This boils down to a Biobrick / Biobrick-family distinction where one can browse the catalog at different levels of the family hierarchy. For the task ahead, I would prefer, if this familiy tree would be flexible (multiple inheritance) rather than strictly hierarchical. The last two tiers of families could be:


 * variants of the same (reference) Biobrick
 * versions of each variant
 * format versions of each version

Intrinsic Classification
Intrinsic classification covers those aspects of Biobrick classification which are defined by the Biobricks themselves. For these the primary focus is defining the vocabularies used to describe Biobricks to the outside world. Broadly speaking, this can include:


 * Biobrick taxonomy: defining types or species of Biobricks based on composition, function, etc.

Possible intrinsic classifiers include:


 * DNA category: [ AA coding, RNA-coding[m-/t-/nc-/mi-/si-], regulatory [promoter,rbs,terminator,enhancer], unknown, ...]
 * part category: [ basic, composite ]
 * implementation status: [ planning, building, sequence-verified, function-verified, works ]

Extrinsic Classification
Extrinsic classification refers to those aspects of Biobrick classification which are attributed to Biobricks from external sources or references. The focus is defining the vocabularies for those aspects of the outside world which are related to biobricks.


 * Function: GO identifiers
 * Structure: PFam / Smart protein domains (but see: further annotation)

Biobrick Characterization
Quantitative data about the part, important for (1) design and (2) implementation of devices containing it but also for (3) simulation (+design) in network models.

A) for device implementation

 * Device reliability (RNA half-life, protein half-life)
 * Device stability (genetic stability)
 * Device compatibility (with other devices, environmental conditions etc.)

B) for device simulation & design

 * Static device behavior
 * Dynamic device behavior
 * Device interactions (including quantitative data)
 * Power requirements of the device
 * Device reactions + reaction rates

C) Further annotation

 * references to High-throughput data ?
 * references to outside, non-standardized information about this part

= What is the best format / technology / architecture for exchange? =

Model-first: There is the concern that tying ourselves to a format too early will make us not have a clear model in mind, and will cause us to hack up the format. Model-parallel: On the other hand, the technology choices are important also for the data model discussion -- differerent technologies imply different possibilities but also different costs.

Proposed Architectures
For reference, I'm considering a piece of web-accessible software, like the MIT Registry or BrickIt, that has BB data in some sort of persistence layer (be it a relational DB, an object DB, an XML store, a hash store like CouchDB/SimpleDB, or a triple store), offers a human-facing UI, and a programmatic interface for 3rd party software integration that allows *read/write access* with authentication and authorization rules. (See section on 'Application Scenarios' above)

XML/DB backend, REST API
The REST-architecture becomes popular as a clean approach for point-to-point data exchange across the web. (see 'say no to web services' thread ) REST is a simpler approach to data access than SOAP; REST is easy to work with since it's simply HTTP (GET + POST), and software support is plentiful.


 * Wikipedia article
 * Introduction to REST Web Services
 * Ch 05 of Fielding's thesis (theory behind REST)

This approach involves a layer of abstraction over the persistence layer. The disadvantage is, compared to offering a straight up SQL/etc interface, is the additional step necessary to write the layer. However, you'll have to design a layer of abstraction anyhow for the UI (such as a web application serving HTML) and frameworks such as Django and Rails can make it easy to expose alternative content types (XML, JSON) in parallel with your human-consumable HTML data views.


 * Rails resource_controller plugin
 * Django rest interface

Advantage: you get to decouple the internal representation from the public view. The underlying data store (database, schema, etc.) can be modified w/o breaking the interface that your clients are using. It also allows your application to perform data validation, and allows you to write that in the higher-level language of your application rather than in SQL triggers/keys. Also, you do not have to repeat this validation logic across both your application and in the database. It also affords you more power in the authentication/authorization department than simple database logins. This approach (doing validation/auth in the application later) is that of an Application Database and essentially precludes you from offering a raw SQL interface.

Triple backend, SPARQL/SPARUL API
If, on the other hand, we elect a triple-based storage format, query languages such as SPARQL and SPARQL/Update (aka SPARUL) offer great power.


 * SPARQL in a nutshell - presentation
 * SPARQL/Update

Note that, with this approach, the tool could expose the underlying RDF as a SPARQL/SPARUL endpoint, and both the application's web interface and the API interface could work against that. The point here is that triples are likely flexible enough to withstand a "schema change" and providing a SPARQL-adhering endpoint is a layer of abstraction that allows you to swap out the underlying triple store if necessary. I am not sure how authentication/authorization and data validation happen in this scenario, as I am less familiar with it.

For rolling up your sleeves and hacking around, you might like to check out object/RDF modeling libraries such as:
 * ActiveRDF (Ruby)
 * Oort (Python)
 * Arc (PHP)

The following articles contain a good deal of discussion on the topic of building web applications for the semantic web:
 * Using RDF on the Web: A Survey
 * Using RDF on the Web: A Vision

any backend, RDF REST
(Raik) A web server serving and consuming RDF/XML/N3 documents would combine a REST-architecture with the triple format. It's technical quite easy to implement and would allow the growth of a semantic web around the original biobrick definition. So picture an automatic data exchange between a hypothetical Brickit server in Barcelona (brickit.crg.es) and the MIT registry:

1. update notification
 * parts.mit subscribes to brickit.crg RSS-feed
 * parts.mit receives RSS digest that there is a new biobrick record BBb_F0101 available on brickit.crg

2. Read access
 * parts.mit loads http://brickit.crg/parts.n3#BBb_F0101
 * -> brickit.crg serves the internal record as N3 document (technically no problem, as I discussed on the mailing list before)

3. Write or rather "inverse read"
 * parts.mit parses the document (using rdflib, redland...)
 * parts.mit verifies the ontology/content
 * parts.mit inserts a new record (ignores any properties that are not defined in its own ontology -- the crg people may be experimenting with additional categories/data)
 * parts.mit adds a property "owl:sameAs <http://brickit.crg/parts.n3#BBb_F0101>;" to the new record
 * ...or it may not copy it at all (DRY, Dont Repeat Yourself), but just link it into the appropriate local biobrick families and cache it for faster queries ...

Note: There is no write access in this scenario. That means, there is no authentication needed either. It's up to the receiver to decide whether or not to ignore the RSS and what to do about the new record.

Discussion
The registry biobrick documents can serve as hooks (unique addresses) to link further information into the developing knowledge graph. However, software tools that can gather and integrate distributed RDF information are not yet really available, it seems. Changes to the ontology are decoupled from the data -- the data model can evolve over time with minimal perturbation of existing data. The graph-model also fits much better with the family tree concept outlined in the Data model and would encourage (1) dense linking between parts, devices and families, (2) outside annotation, (3) data schema evolution (which is a nightmare and fiercely opposed in the other scenarios)

The question is whether this model is feasible with the available tools.

Potential Benchmarks
(ralph) As the standard evolves it will be necessary to test that it accomplishes what we set out to do with it. Since the standard shall likely have many moving parts, and shall be set to a wide range of applications, a variety of tests for different features and aspects of the standard will be necessary.

This section attempts to define a variety of potential benchmarks for the standard. Ideally they define a specific problem whose results are clearly interpretable while describing the problem in a manner as technologically neutral as possible. The working group should regard these as rough suggestions or guidelines.

Please add benchmark ideas of your own as you think of them. They may be more technically specific versions of the application scenario questions if you wish to restate it as a more precise test, or they may be distinct problem descriptions.

Benchmark Descriptions
Each of these descriptions should define the following:
 * A starting set of data and/or conditions
 * A result
 * Some indication of the use and value of the benchmark, which can include
 * Defining some feature of the standard to be tested
 * Mentioning some criterion to be evaluated in either the benchmark results or the overall procedure to achieve the benchmark.

Data Tabulation
Start with:
 * Set of all Biobrick descriptions
 * A set of attributes

Result:
 * Table of biobrick data (possibly sparse), rows indexed by BioBrick ID and columns labeled and defined by the given attributes.

Benchmark uses:
 * Evaluate basic annotation features of the standard
 * Evaluate coverage of biobrick annotations

Functional Composition
Start with:
 * Set of all biobrick descriptions

Result:
 * Set of ordered pairs (i,j) where "i" is a Biobrick output and "j" is a BioBrick input, covering all the BioBricks in the set, plus data indicating whether each pairing is functional (i.e. the output can drive the input), and optionally some description of the system thus composed.

Benchmark uses:
 * Demonstrate elementary functional annotation with a trivial system composition task.

SBML Translation
Start with:
 * An arbitrary Biobrick

Result:
 * An SBML description of the same

Benchmark uses:
 * Evaluate standard data model and semantics by the feasibility or definition of the SBML translation algorithm.
 * Possibly build tools to leverage SBML deployment

create a new XML format
REST interfaces (like for Django) publish automatically as XML.

use Turtle/N3 notation for semantic web documents
I somewhat share the reservation about completely new file formats, but the non-readability and general nastiness of XML is also an issue. A good solution, IMO, would be to use the Notation3 format developed by the semantic web folks. It is concise, human-readable and editable (i used it myself some years ago) *AND* is equivalent to XML. That means there is a well defined translation back and for and many libraries and tools do the conversion. Being semantic web, it also solves the linking problem (everything is a link).

Quick Example (links are not 100% correct) (The MIT server could serve the following document for parts.mit/biobricks):

@prefix rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;. @prefix bbf: <http://biobricks.org/ontology/1.1/>. @prefix harvard: <http://harvard.edu/registry/parts#>.
 * 1) shortcut definition for frequently used ressources ...


 * 1) define a biobrick hosted at this address
 * BBa_0001

rdf:type       bbf:biobrick; bbf:sequence   "AAACCCGGG"; bbf:similarTo [:BBa_0003, harvard:BBa_J1000, :BBa_00010].

harvard:HBB_J1000 owl:sameAs     :BBa_0001.
 * 1) add information to a biobrick defined elsewhere

... continue for all other biobricks

OK, one can argue about human-readability but it's at least possible to understand and edit these documents (and much better than the equivalent xml).