Synthetic Biology:Vectors/Barcode

Barcode scheme to encode bit strings

Enable encoding of arbitrary bit strings into DNA without introducing "biologically bad" sequences. (Tom and Austin).

Text to bit string converter (ASCII not Unicode)

Compression algorithms for DNA sequences: X. Chen, S. Kwong, M. Li, Genome Informatics (GIW'99), Tokyo, Japan, pp.51-61, 1999.

I've put up a test page for playing with encoding binary into DNA at http://synbio.mit.edu/tools/encoder.cgi

Check out the world's first Illegal DNA sequence.

The general encoding/decoding method: Each byte of 8 bits is split into 4x2 bits. Each pair of bits at each location is mapped to some nucleotide. For example 00 at position 0 could be mapped to A, 01 at position 1 to T, 00 at position 1 to T, etc. To be decodable, there must be a 1 to 1 mapping at each position from 2 bits to 4 nucleotides. But this leaves [math]\displaystyle{ 24^4 }[/math] different ways to do this type of encoding. Currently, I pick a particular encoding that for some common things that Reshma would like to encode (plasmid names, parts, etc. in ASCII) have the following properties:

%GC close to 50%
%GT as high as possible (biased nucleotide use).

Biasing the nucleotides makes it less likely for restriction sites or other secondary structures to appear. In addition, it allows for an easier choice of an escape sequence. The idea is to have a family of escape sequences, for example ACNNNAC, which can be inserted anywhere you want. This allows you to modify the %GC content if desired, break up bad sequences, or whatever else by inserting arbitrary non-coding sequence. If the escape occurs in the real sequence, it gets escaped itself. Escapes have not been chosen or implemented yet.

Another idea I had was to try all possible encodings and pick the one that provided the best properties that one wants. At the beginning (as the start code), we print the encoding (would take a fixed 12nt).

Randy suggested some form of compression. Not sure how much space we save or how much more complex it would make the algorithm.

This algorithm provides the ability to encode anything such as Unicode, pictures, or anything else under the sun. Is the complexity and increase in size worth this capability?

Barcode scheme to encode text only

Case-sensitive codon tables

Each codon represents an alphanumeric character (case-insensitive). For convenience, those letters of the alphabet which represent a single letter amino acid code are coded by one of the amino acid's codons (aiming for near 50% GC content).

(Note this table was done by hand so please correct errors!)

Encoding table

Codon	Character	Rationale	Codon	Character	Rationale
GCA	A	codon for Ala	GCT	a	codon for Ala
GCC	B	(near alanine)	GCG	b	(near alanine)
TGC	C	codon for Cys	TGT	c	codon for Cys
GAC	D	codon for Asp	GAT	d	codon for Asp
GAA	E	codon for Glu	GAG	e	codon for Glu
TTC	F	codon for Phe	TTT	f	codon for Phe
GGA	G	codon for Gly	GGC	g	codon for Gly
CAC	H	codon for His	CAT	h	codon for His
ATC	I	codon for Ile	ATA	i	codon for Ile
GGT	J	(no reason)	GGG	j	(no reason)
AAG	K	codon for Lys	AAA	k	codon for Lys
CTA	L	codon for Leu	CTC	l	codon for Leu
ATG	M	codon for Met	CTG	m	sometimes codes for Met
AAC	N	codon for Asn	AAT	n	codon for Asn
CCC	O	(near proline)	CCU	o	(near proline)
CCG	P	codon for Pro	CCA	p	codon for Pro
CAA	Q	codon for Gln	CAG	q	codon for Gln
AGA	R	codon for Arg	AGG	r	codon for Arg
AGC	S	codon for Ser	AGT	s	codon for Ser
ACA	T	codon for Thr	ACT	t	codon for Thr
GTC	U	(near valine)	GTG	u	(near valine)
GTA	V	codon for Val	GTT	v	codon for Val
TGG	W	codon for Trp	TGA	w	(no reason)
TAG	X	resembles a stop codon	TAA	x	resembles a stop codon
TAC	Y	codon for Tyr	TAT	y	codon for Tyr
TTG	Z	(no reason)	TTA	z	(no reason)
ATT	0	zero seems to go with stop codon
CTT	1	(looks like an l)
ACC	2	two starts with a T
ACG	3	three starts with a T
CGA	4	has an R in it
TCT	5	(no reason)
TCC	6	six starts with an S
TCG	7	seven starts with an S
TCA	8	(no reason)
CGT	9	(no reason)

Lookup table

	T	C	A	G
T	f	5	y	c	T
	F	6	Y	C	C
	z	8	x	w	A
	Z	7	X	W	G
C	1	o	h	9	T
	l	O	H	spacer	C
	L	p	Q	4	A
	m	P	q	spacer	G
A	0	t	n	s	T
	I	2	N	S	C
	i	T	k	R	A
	M	3	K	r	G
G	v	a	d	J	T
	U	B	D	g	C
	V	A	E	G	A
	u	b	e	j	G

Start and stop sequences

What is a good start and stop sequence for the plasmid barcode?

We could possibly use the same sequence that is used for the CDS barcodes (i.e. C TGA TAG TGC TAG TGT AGA T C) without the variable nucleotide. Or would this just confuse any diagnostics people try to run on constructs?
Another possibility is to flank both sides with the translational stop sequence.
Maybe a start and stop sequence isn't necessary?
One problem with this codon table it that it becomes possible to accidentally encode BioBricks sites in the barcode. A case-insensitive code might reduce the likelihood of that happening? Any possible fixes to this problem? Use one of the codons that doesn't encode a alphanumeric character as a "spacer" in this eventuality (i.e. CGC or CGG)?

Notes

I didn't bother to try avoiding certain codons like start codons.
These codons may not be optimally spaced from one another? Tom doesn't think this matters.
Tom pointed out that the barcode should probably be as GC content neutral (i.e. try to avoid all AT or all GC codons).

Synthetic Biology:Vectors/Barcode

Contents

Barcode scheme to encode bit strings

Barcode scheme to encode text only

Case-sensitive codon tables

Encoding table

Lookup table

Start and stop sequences

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools