The MIT SBWG has been discussing the barcoding of engineered biological systems for a while. An initial attempt (described at Barcodes) was made to implement barcodes on many BioBrick coding regions. Here is a quick (and likely incomplete) overview on some of the issues surrounding barcodes.
As Drew has initially suggested, there are three basic purposes for barcoding synthetic systems.
- Detection: to enable detection of standard biological parts in arbitrary DNA samples. For instance, users wish to detect cases of misuse of parts.
- Identification: to enable identification of biological parts, devices and systems. Such identification may also involve determining the original designer. For instance, users wish to quickly identify the vector in which their part resides in a typical sequencing reaction.
- Authentication: to enable verification the integrity of a DNA sequence. Such a barcode would allow users to check for naturally-occuring or human-induced mutations.
There are a couple schemes for barcoding that have been implemented on a trial basis or proposed.
- The original barcodes scheme enabled quick detection of BioBricks via PCR methods. It is not clear that detection-based barcodes are necessary given the ease with which sequencing can be done.
- The barcode schemes described below are primarily for the purpose of identifying BioBricks. Since the scheme proposed by Austin is able to encode arbitrary bit strings, such a barcode would permit inline documentation of anything include
- BioBrick part number
- URL or doi number
- inline comments
- There is no available proposal to accomplish the third goal of authentication.
A universal barcoding scheme is difficult to implement for several reasons.
- Barcodes should be biologically innocuous. Ideally, barcodes should not be DNA sequences that will encode a biological function including but not limited to the following.
- initiate transcription (be a promoter)
- contain coding sequences
- initiate translation
- have secondary structure
- have restriction enzyme cut sites
- have too many repetitive elements
- have too many strings of a single nucleotide or strings of purines/pyrimidines.
- Barcodes should not be sequences that interfere with system function. There must be a mechanism to insert "escape sequences" to interrupt system-specific meaningful sequences.
- Barcodes should not be too long. Long sequences can add to fabrication costs and may impact system function.
- Barcodes should be sequenceable.
- Barcodes should have some mechanism for detecting and/or correcting for errors due to mutation. Additional bases are needed for error detection and correction which can lengthen barcode length.
- Austin 21:27, 18 April 2006 (EDT): ECC for normal mutations is relatively easy. Frameshifts however are difficult. I contacted Robert Gallager about this problem and this is what he responded with (I haven't had the time or energy to figure any more of this out):
- You have picked a difficult problem for yourself. Almost all of coding theory is done using the assumption of perfect timing. I worked on this problem back in 1961 and wrote a short technical note on it. You can find a very poor copy of it on my website (http://web.mit.edu/gallager/www/). It is the 3rd entry under Internal memoranda in my publication list. You will have to find out something about convolutional codes and sequential decoding to make sense of it, so it might not be worth your effort. I don't think that having an alphabet size of 4 rather than 2 is a major difference. I haven't seen anything else in the literature dealing with this problem, although I haven't been looking. It is sensible to restrict yourself to single anomalies, but the decoder has no block structure and thus no way to define a single anomaly. That is why I looked at convolutional codes back in 61, since the block structure problem was avoided. Gallager, R. G., "Sequential Decoding for Binary Channels with Noise and Synchronization Errors", Lincoln Group Report, 2502, Summer 1961.
Austin's barcode scheme to encode bit strings below represents an attempt to address many of these issues.
Scheme to encode bit strings
Enable encoding of arbitrary bit strings into DNA without introducing "biologically bad" sequences. (Tom and Austin).
Text to bit string converter (ASCII not Unicode)
Compression algorithms for DNA sequences: X. Chen, S. Kwong, M. Li, Genome Informatics (GIW'99), Tokyo, Japan, pp.51-61, 1999.
I've put up a test page for playing with encoding binary into DNA at http://synbio.mit.edu/tools/encoder.cgi
Check out the world's first Illegal DNA sequence.
The general encoding/decoding method: Each byte of 8 bits is split into 4x2 bits. Each pair of bits at each location is mapped to some nucleotide. For example 00 at position 0 could be mapped to A, 01 at position 1 to T, 00 at position 1 to T, etc. To be decodable, there must be a 1 to 1 mapping at each position from 2 bits to 4 nucleotides. But this leaves 244 different ways to do this type of encoding. Currently, I pick a particular encoding that for some common things that Reshma would like to encode (plasmid names, parts, etc. in ASCII) have the following properties:
- %GC close to 50%
- %GT as high as possible (biased nucleotide use).
Biasing the nucleotides makes it less likely for restriction sites or other secondary structures to appear. In addition, it allows for an easier choice of an escape sequence. The idea is to have a family of escape sequences, for example ACNNNAC, which can be inserted anywhere you want. This allows you to modify the %GC content if desired, break up bad sequences, or whatever else by inserting arbitrary non-coding sequence. If the escape occurs in the real sequence, it gets escaped itself. Escapes have not been chosen or implemented yet.
Another idea I had was to try all possible encodings and pick the one that provided the best properties that one wants. At the beginning (as the start code), we print the encoding (would take a fixed 12nt).
Randy suggested some form of compression. Not sure how much space we save or how much more complex it would make the algorithm.
This algorithm provides the ability to encode anything such as Unicode, pictures, or anything else under the sun. Is the complexity and increase in size worth this capability?
Scheme to encode text only
Case-sensitive codon tables
Each codon represents an alphanumeric character (case-insensitive). For convenience, those letters of the alphabet which represent a single letter amino acid code are coded by one of the amino acid's codons (aiming for near 50% GC content).
(Note this table was done by hand so please correct errors!)
|GCA||A||codon for Ala||GCT||a||codon for Ala|
|GCC||B||(near alanine)||GCG||b||(near alanine)|
|TGC||C||codon for Cys||TGT||c||codon for Cys|
|GAC||D||codon for Asp||GAT||d||codon for Asp|
|GAA||E||codon for Glu||GAG||e||codon for Glu|
|TTC||F||codon for Phe||TTT||f||codon for Phe|
|GGA||G||codon for Gly||GGC||g||codon for Gly|
|CAC||H||codon for His||CAT||h||codon for His|
|ATC||I||codon for Ile||ATA||i||codon for Ile|
|GGT||J||(no reason)||GGG||j||(no reason)|
|AAG||K||codon for Lys||AAA||k||codon for Lys|
|CTA||L||codon for Leu||CTC||l||codon for Leu|
|ATG||M||codon for Met||CTG||m||sometimes codes for Met|
|AAC||N||codon for Asn||AAT||n||codon for Asn|
|CCC||O||(near proline)||CCU||o||(near proline)|
|CCG||P||codon for Pro||CCA||p||codon for Pro|
|CAA||Q||codon for Gln||CAG||q||codon for Gln|
|AGA||R||codon for Arg||AGG||r||codon for Arg|
|AGC||S||codon for Ser||AGT||s||codon for Ser|
|ACA||T||codon for Thr||ACT||t||codon for Thr|
|GTC||U||(near valine)||GTG||u||(near valine)|
|GTA||V||codon for Val||GTT||v||codon for Val|
|TGG||W||codon for Trp||TGA||w||(no reason)|
|TAG||X||resembles a stop codon||TAA||x||resembles a stop codon|
|TAC||Y||codon for Tyr||TAT||y||codon for Tyr|
|TTG||Z||(no reason)||TTA||z||(no reason)|
|ATT||0||zero seems to go with stop codon|
|CTT||1||(looks like an l)|
|ACC||2||two starts with a T|
|ACG||3||three starts with a T|
|CGA||4||has an R in it|
|TCC||6||six starts with an S|
|TCG||7||seven starts with an S|
Start and stop sequences
What is a good start and stop sequence for the plasmid barcode?
- We could possibly use the same sequence that is used for the CDS barcodes (i.e. C TGA TAG TGC TAG TGT AGA T C) without the variable nucleotide. Or would this just confuse any diagnostics people try to run on constructs?
- Another possibility is to flank both sides with the translational stop sequence.
- Maybe a start and stop sequence isn't necessary?
- One problem with this codon table it that it becomes possible to accidentally encode BioBricks sites in the barcode. A case-insensitive code might reduce the likelihood of that happening? Any possible fixes to this problem? Use one of the codons that doesn't encode a alphanumeric character as a "spacer" in this eventuality (i.e. CGC or CGG)?
- I didn't bother to try avoiding certain codons like start codons.
- These codons may not be optimally spaced from one another? Tom doesn't think this matters.
- Tom pointed out that the barcode should probably be as GC content neutral (i.e. try to avoid all AT or all GC codons).