PICA Draft Data Model Rationale
Ralph Santos PICA Draft Data Model Rationale 2008-03-07
- 1 Introduction
- 2 Foundation Concepts: Abstract Features and Molecules as Terminals
- 3 The Role of Ontologies
- 4 Objects and Relations in Part Declarations
- 5 Appendix: More on the Terminal Vocabulary
This document describes some of the details which inform the rationale behind the data model used by PICA.
The design goals for the data model are:
- The data model shall establish an abstraction for biological parts
- The data model should be relatively easy for a biologist to understand and use for providing annotations
- The data model shall encourage and support to extensibility.
Biobricks are a standard for interchangeable parts. The goal is to be able to build complex synthetic biological systems either manually or automatically in a simple manner using a standardized component technology. The key components of that standard are a set of assembly protocols describing a biochemical procedure to ligate individual parts into a functional system, and a component packaging standard describing how to design the ends of a part's DNA sequence to make it compatible with the assembly protocols.
For Biobricks to realize their full potential the expectation is that software tools and other computing systems will be made to assist in exchanging data about parts, building databases to collect part-related information like characterization data, and even helping automate the process of system design using parts. However, in order to realize these goals it is important to have a system abstraction. The focus of a system abstraction is not to describe how a part is composed but rather how that part interacts with the outside world in general and with other parts of a functional system in particular.
Foundation Concepts: Abstract Features and Molecules as Terminals
At the root of the PICA framework's data model are two basic concepts:
- a conventional set of terms to identify sequence features (which the reader will see has been taken "off the shelf" as a ready-made feature)
- a convention to define part behavior in terms of the molecular species which are processed and exchanged.
The rest of the document will explore how these two concepts are developed within the PICA framework. It describes how these concepts influence the data model, how they are related to the annotation standard, and how they are expected to relate to language extensions as they are developed. The following material also focuses on certain important design choices, explaining why certain specific features were chosen and the rational behind those design decisions.
Abstract Features vs. Sequence Processing
An abstraction may seem like a strange or awkward strategy to some biologists. Standard biological analysis involves being acutely aware of the precise composition of the DNA sequence one is handling, and dealing with it on its own terms. To those individuals an abstraction may seem like an awkward Chinese wall, or at worst a pointless academic device. However, what works for humans often does not work at all well for machines, and vice versa. One could imagine building a system that uses exclusively the tools of sequence analysis to perform the various querying, analysis and assembly to be accomplished within the Biobricks framework, such a strategy complicates or even misses some important points:
- Even when one understands the complete sequence of everything one is handling, it still begs the question of understanding the behavior of the proteins it transcribes.
- Tools which manage to be sophisticated enough to infer gene structures from raw sequence and understand the effects of all the possible manipulations and edits people could perform would be highly elaborate and complicated.
In considering what a viable part abstraction would require one must think about the questions different people might ask about a part. Naturally one of the first places to look is at sequence analysis. Naturally sequence analysis tools are designed to pick out those aspects of sequence which interest biologists. However, sequence analysis offers only one perspective on sequence, one biased toward discovering features from unknown or uncharacterized sequence. While this perspective is important it is not equally suited to every scenario, particularly those when one approaches the engineering of a biological system as an engineering problem:
- Querying for parts based on functional attributes by definition involves specifically describing behavioral characteristics while having no prior notion of a particular sequence composition in mind.
- Those wishing to build a biological system with computer assistance will need a way to describe a potential system in terms of interactions between undefined or partially defined high-level features, leaving the computer to search for parts that have the required behavioral features.
In these and other cases one deals with biological parts not as sequences but abstract features of undefined composition with defined behaviors and relationships. Obviously the complete process of constructing a synthetic biological system will at some point involve having to establish, analyze and manipulate specific DNA sequences. Nonetheless for the whole concept of biobricks to realize its potential as a basis for building complex system there must not only be an assembly standard but also a functional abstraction for both parts and systems. The functional abstraction allows parts to be viewed in uniform terms providing an understanding of the part's behavior without having to describe the specifics of the part's composition. For systems the functional abstraction not only provides a means of understanding how each part relates to each other part but also provides a basis for broadly describing the total system behavior.
The task of creating such a logical construction would at first glance seem hopelessly vague and dauntingly complex. In fact it is, because one is talking about a logical construction that encompasses not only a great deal of our understanding of DNA sequence, genetic structure, but also molecular biology and biochemistry. For those who despair of ever accomplishing such a task should take heart, because the heavy lifting has already been done.
The Role of Ontologies
While de novo creation of a logical abstraction of DNA structures and molecular biology might seem like a Quixotic task, such logical constructions exist in a form adequate to the task at hand in the form of biological ontologies. The biological ontologies provide ready-made vocabularies which have been vetted and shaped by broad panels of experts in the field. In them one finds ready-to-use sets of terms describing important biological features connected by regular, well-defined relations chosen to capture the essential relations between these objects. In effect, one has ready-made structured hierarchies of terms defined specifically to describe broad ranges of biological features. They combine a standardized definition of the biological entities acceptable to broad range of biologists with the careful articulation of definitional boundaries and precise articulation of regular and well-defined relations that make them amenable to automated handling by software systems.
The PICA strategy is a simple one: Find ontologies for those things you want to describe and borrow them wholesale. To provide a shallow learning curve for biologist annotators and tool builders, define a minimal annotation standard that minimizes the need for an annotator or tool builder to range of the ontology one must understand. To tune the data model to the needs at hand, bring in a minimal number of grammatical or relational constructs to get things done. The goal is not to restrict the ontology, but merely articulate a clear minimum burden to be imposed on biologists and tool writers for effective annotation.
Not everything comes directly from an ontology. In particular PICA permits citation of genes and compounds for reasons that will be explained below. Also, the basic relations agreed upon for open biological ontologies are borrowed to define relations among objects described in grammatical expressions in the language framework. While not purely ontological, the goal is to ground every expression in a framework that precisely describes every cited object in publicly recognized terms and clearly defined relationships.
Objects and Relations in Part Declarations
The primary focus of the proposal standard scope is to provide elementary declarations of part composition and function, i.e. elementary existence queries/declarations of components and basic relational information about influences beetween terminals limited primarily to the existence of relations and only broadest qualitative characterizations. Most of the rest is to be handled in time with language extensions or outside of the framework.
The data objects that PICA focuses on are:
- Abstract sequence features, from the Sequence Ontology
- Chemical species, from compound databases like PUBCHEM and metabolic pathway databases like KEGG and BIOCYC
- Genes, from gene and genetic sequence databases like GENBANK, EMBL and REFSEQ
With these basic objects, the annotation standard focuses on collecting the following information:
- Simple binary assertions regarding what features are contained in a part (the 'composition' slot)
- Identifying the chemical species by which the part receives regulatory signals and reactant species as well those produced as products or responses. (the 'terminals' slot)
- Identifying the response or influence relations among those species and a rough qualitative characterization of the nature of the relationship. (the 'reg-relations' and 'events' slots)
The composition slot merely requires that each type of sequence feature found in the part be declared, offering an elementary means of summarizing the functional features of a part. Technically any hyponym of SO:sequence_feature is acceptable here, but only a handful of features are specifically mandated by the minimal annotation standard. The choice of merely asserting a part's existence as opposed to declaring anything further was informed by several considerations:
- It is the simplest possible inquiry to make
- By using sequence features to describe internal components it allows the precise composition of the part to remain undefined
- It sidesteps a number of issues including sequence coordinates and identification of sequence features where the best representation may vary among different applications and communities which are best handled as a higher-level extension or in a domain-specific manner.
- The Sequence Ontology feature set is at once clearly defined and also completely agnostic to any notion of part composition or containment, and may be used consistently across a variety of part hierarchy definitions and packaging format standards.
The choice of the minimal declaration set is simple and informed by the shared needs of biologists as well as system and tool builders. As much of synthetic biology involves building synthetic gene regulation networks, among the most basic question to ask once one knows what kinds of features are in a part involves figuring out how a part fits into the overall structure of a gene or operon. While many features are optional, the basic questions of whether the basic features for a gene or operon structure are stated as required so that the information is always collected from the annotator and so that it is always clear in broad terms how to use a part, even if one must look further into the details of the part later to understand its precise use.
The next slot asks the annotator to describe the 'terminals' of the biological part. Basically this includes any chemical species involved in the function of that part. The name 'terminals' is a quite deliberate appropriation of the term as it is used in electronics, since it serves the same purpose. Just as the tiny wires and prongs of an integrated circuit convey signals and responses to the innards of integrated ciruits, biomolecules do exactly the same thing in synthetic biological systems.
It is worth pointing out that this standard varies with a common practice in synthetic biology, which is to talk about interactions among parts borrowing from the terminology of electrical engineering and information theory. This is most readily understood if one considers the term "PoPS" as it relates to systems biology. The PICA standard uses biomolecules and chemical species as the fundamental reference objects as a basis for describing part function. The goal is for extensions to describe the biochemical functions and interactions of parts in more detail, but to have them do so while ultimately relating all such information to the basic biomolecules identified at the level of the terminal declarations.
The reasons for this decision are several, a combination of theoretical factors and pragmatic ones. One of the most fundamental reasons is an abstract mathematical quality which has an intimate connection to the overall goal of the Biobricks effort. An important feature of Biobricks is the ability to compose bricks into larger systems. This capability has been clearly demonstrated in the laboratory. For a data model to support this it is simply a matter of making the model what is described mathematically as "closed under function composition".
It is true that when one views a biological system as a signal system functional composition is not an issue: all parts deal in signals, and composability is well defined. However, the signal abstraction works to a fault in that it takes such a high-level view of biological processes that it becomes difficult for biologists to relate this view of a system to things they are already familiar with. More importantly, it is so abstract that it can become unclear how to relate a signal-based view of a biological system to other representations of biological systems. The power of the signal abstraction--its flexibility--becomes a liability when trying to draw inferences about how a particular signal behavior is realized in a set of physical reactions. One could derive a signal from chemical concentration levels, rates of transcription, phosphorylation events, and any number of phenomena.
So the solution PICA adopts is to focus on using the language of chemical reactions as the basis for describing part function. PICA declarations are built upon expressions of part functions as interactions among different molecular species whose dynamic behavior is left unstated. The logic behind this manner of expressing function is mostly pragmatic, considering that this view is already well established in biology, and by citing chemical species it allows one to leverage a variety of biological resources, including metabolic pathway and compound databases. The intent is not to usurp the signal view but rather to provide a platform for it. The hope is that assertions about signal behavior can be stated within the PICA framework as an extension, building upon basic declarations between chemical species. It is much easier given a chemical description of a system to derive a series of transfer functions for that system than performing the reverse derivation.
Given the range of scenarios, the following options for defining terminals are adopted for PICA:
- Citing a compound ID when one is available
- When a compound ID isn't available, cite the ID for a gene
- In the case of a regulatory feature citing a coding region that it modulates, use the SO term 'CDS'
Regulatory Relations and Events
Regulatory relations and events are two highly parallel forms. Both are relations defined as single terminals or ordered pairs of terminals. The primary difference is that each one refers to a different kind of relation. As the name suggests, 'reg-relations' refer specifically to those signal-response relationships which are mediated by a gene regulating element. Events refer to relationships which are not mediated by gene regulation. All appear as:
((A B) assertion)
In the latter case the ordered pair refers to a preceded_by relationship intended to represent some sort of signal-response or cause-effect relationship. These declarations are only intended to refer to those aspects of behavior involving the stated terminals in the given relationship. Thus on the basis of a given declaration one cannot infer anything about behavior involving other terminals or even the same terminals in a different relation. Thus for a case where two terminals mutually influence one another bidirectionally this would be represented as two assertions:
((A B) assertion) ((B A) assertion)
In the case of regulatory relations assertions are expected to be hyponyms of 'SO:transcriptionally_regulated'. Note that categories are not mutually exclusive. In particular, it is permitted to combine the term SO:transcriptionally_constitutive with other declarations to describe leaky promoters. Thus, the following would be acceptable in a set of declarations where B is a leaky promoter actuated by transcription factor A.
((A B) transcriptionally_enhanced) ((B) transcriptionally_constitutive)
In the case of events two kinds of assertions are allowed: declarations to describe reaction chains (using the term SBO:participant) and declarations to describe control (using the term SBO:control)
Appendix: More on the Terminal Vocabulary
Consider the part BBa_F2620. The input is described as 3OC6HSL and the output is described as "PoPS". At first glance this description isn't even closed under composition, as the input is described as a molecule and the output is described as a rate of transcription operations. However there are several perspectives from which the disparity is resolved.
One perspective is to regard a biological system purely as a signaling system. This perspective renders the underlying chemistry and biology almost completely abstract, where the only significant parameters are the physical and chemical parameters which can be derived from things in that environment, reducing them to mathematically abstract information channels transforming input signals to outputs. This discipline has clearly demonstrated its validity and applications are easily found in electronics, to name just one of many examples where this discipline finds successful application.
It is plain to see how one can apply system design tools of electronics and control theory to systems biology. Indeed this has already been done (Ron Weiss, et. al.). That said, for such tools to be applied to Biobricks, a system must regularize and support such tools by providing at least some information about a part's dynamic behavior without having to rethink how this definition applies to each and every part, much less every assemblage that can be made from any arbitrary collection of parts.
Resolving this issue involves discerning an underlying unity from the gallimaufry of system representations employed in biology in general and systems biology in particular. When one does, there are several standouts. Chief among them are gene regulatory networks and metabolic pathway networks. Gene regulatory networks are clearly essential as much of synthetic biology involves controlling genes in synthetic systems. Another is the biological pathway networks, which describe the elaborate collaborations of enzymes and substrates that comprise various metabolic processes or the transduction processes describing how cells and organisms respond to various internal or environmental signals.
One Interesting quality in these networks is a shared general topological structure (they are close siblings of mathematical graphs--Euler and the bridges of Konigsberg would be proud). They also share a quality which is useful for the needs at hand, which is they establish publicly recognized definitions for objects relevant to biological parts. However the picture is a somewhat fractured one. Metabolic pathway networks, particularly in databases like KEGG and BIOCYC cite molecules and enzymes to describe the details of the pathways, using the language of chemical reactions to describe the interactions between them.
Gene regulation networks are a somewhat more complicated affair as they are newer, less mature, and come in a variety of representations owing to the range and complexity of interactions they mediate. In all cases the primary focus of description are the genes involved, the biochemical signals that control their transcription, and how transcribed signals influence in turn other genes. That said, they may be found in a variety of forms and there isn't yet a single common representation to describe all of them. Some gene regulation networks describe control relationships in boolean terms and thus may be described as classic graphs. This is often the case for large regulation networks which are systemic or organismal in scope. However, particularly when the regulation of a single discrete process is closely studied, a different representation may be chosen to more precisely describe the behavior of individual interactions. Where in large regulatory networks all that might be represented is a simple boolean asserting the broad existence or nonexistence of a relationship, more refined models which account for stochastic behavior or reaction kinetics might take the form of a system of coupled ODE's, or a reaction network to be analyzed with the Gillespie algorithm, or a neural net-like representation qualitatively describing the aggregate relationship of a number of factors upon a single process.
So what to do in the face of all this diversity? Given that PICA is intended to be a platform data model, the strategy of PICA is not to attack these problems directly but provide a simple set of relationships and common denominator terms. Rather than try to capture all of this diversity directly, the intent is to assert a set of elementary relations in a way that might be used to establish broad congruential relationships among these areas. That way others can develop more sophisticated and domain-specific representations to describe these phenomena more directly and more precisely, and the elementary relation defined in PICA may be cited to establish a defined relationship between the domain-specific model and the rest of the PICA framework or even other domain-specific models that cite elementary PICA relations in similar fashion.
So how does PICA select and define its terms? The choice is shaped largely by the task at hand. To provide an abstract for biological parts, one must have a generalizable functional representation relating part inputs to part outputs. Since biological parts are composable, the functional representation must be composable as well, implying that the functional representation must be closed under functional composition. Functions must cite entities from the same common overall set for both inputs and outputs.
All of this brings us back to BBa_F2620. So to describe this part, our desired input/output vocabulary must include compounds. As there are widely recognized compound registries, it is easy enough to cite these compounds in a standardized fashion. That mades the molecules of input signals part of the input vocabulary. The question becomes whether outputs can be described in similar fashion and how to define the logical basis for such a unification.
In many cases this constraint is applicable both to inputs and to outputs. In many cases biologists in describing reactions will logically unify a compound (particularly in the case of a protein catalyst) with the gene that encodes it. Such an inference is reasonable given the Central Dogma of Molecular Biology. This logical equivalence is ubiquitous, where diagrammatic descriptions of many synthetic biological systems will appear to depict the product of a transcribed gene being a protein without any mention of an intervening process.
It is interesting to consider that this often unmentioned intervening process is at once too obvious to mention and very loosely defined. What was once assumed to be a relatively straightforward two-step process of transcription into mRNA and translation into protein now appears to be a highly varied and often quite complex operations involving not only transcription and translation but also a number of other transformations, including reactions which will edit both the mRNA and the protein on the way to its final form. Fortunately one does not actually have to understand the intricacies of this process, much less navigate them. Numerous pathway and enzyme databases will directly cite cross-references between enzymes and the genes that encode them. However, it is worth keeping in mind that the relationship between genes and enzymes is often not a one-to-one relationship, and that a reaction cited by a part may be actuated by other means by other agents in the environment.
Thus one can find a molecule to cite as definitive for an input or output in most cases. This is straightforward for input signals, and even for parts where a complete transcript is encoded within the part itself. It is simply a matter of citing the appropriate database identifier. The only complicated case arises where an operator sequence modulates transcription outside of the part (this also applies to transcripts which start within the part but continue outside of it). If the transcript were inside the part the transcribed protein would be among the data available to the part, but when it resides outside the part it points up the need to cite entities outside of itself.
In this particular situation PICA gets a little creative. For PICA to serve its purpose it must be able to describe the parts of operons in relation to each other, and when that biological part happens to be a promoter or repressor sequence one must not only reference that part but whatever coding sequence is modulated by that part. Only the two aforementioned conventions for identifying part products (citing molecules or genes) doesn't really work in the case of the coding sequence for a promoter. Citing compound and gene identifiers works where the product is known. A promoter needs to be able to cite its coding sequence even when all that is known about the coding sequence is its relationship to the promoter. PICA resolves the situation by falling back on the sequence feature vocabulary provided by the Sequence Ontology. So PICA simply cites the coding sequence as just that, by invoking the appropriate SO term.
At this point the reader might notice that with such a usage there are two ways to interpret CDS, depending upon context: In a composition declaration, it refers to a feature wholly contained within the part. When cited as a terminal, it refers to a feature in a larger construct where the part citing the terminal is itself merely a component. This goes back to the agnosticism of Sequence Ontology features with respect to part hierarchies.