BBF RFC-13: Rethinking the boundaries and composition of coding regions Tom Knight 19 November 2008 Related RFCs: 9, 11, 12 Keywords: protein fusions, domains, protein tags, assembly Purpose: With the advent of several assembly standards fostering in-frame protein domain fusions, it is important to rethink our categorization of parts to allow the documentation and distribution of parts containing only a portion of a protein coding region. This RFC attempts to document initial thoughts on the naming and documentation of such sub-coding region parts. Introduction Proteins typically consist of one or more domains, sequences of amino acids which fold relatively independently and which are evolutionarily shuffled as a unit among different protein coding regions. The DNA sequence of such domains must maintain in-frame translation, and thus is a multiple of three bases. In our older assembly technology, the assembly scar was 8 bases long, and failed to maintain the coding region frame. Several proposals for new assembly techniques, including the Ira Phillips proposal, Bam/Bgl, BB-2 (see RFCs 11, 12, 14), and blunt scarless assembly, allow in-frame composition of protein domains. The N-terminal domain of a protein coding region is special in a number of ways. First, it always contains a start codon, spaced at an appropriate distance from a ribosomal binding site. Second, many coding regions have special features at the N terminus, such as protein export tags and lipoprotein cleavage and attachment tags. These function when internal to a coding region, and therefore are termed Head domains. Similarly, the C-terminal domain of a protein is special, containing at least a stop codon. Other special features, such as degradation tags, are also required to be at the extreme C-terminus. Again, these domains cannot function when internal to a coding region, and are termed Tail domains. Proposal: Each coding region will consist logically of at least three domains, a Head domain, one or more internal domains, and a tail domain. A part in the registry may (similar to any composite part) consist of a composition of domains. In particular, existing coding regions consist of a particularly simple Head domain (the start codon), a single internal domain, and a simple Tail domain (the stop codon). (1) Head Domain: The Head Domain consists of the start codon followed immediately by zero or more triplets specifiying an N-terminal tag, such as a protein export tag or lipoprotein binding tag. (2) Internal Domains: Internal domains consist of a series of codon triplets coding for an amino acid sequence without a start codon or stop codon. Multiple Internal Domains can be fused. (3) Special Internal Domains: Short Internal Domains with specific function may be separately categorized, but obey the same composition rules as normal Internal domains. Special Internal Domains include tags, linkers, cleavage-sites, intein-sites. (4) Tail Domain: The Tail Domain consists of zero or more triplet codons, followed by a pair of TAA stop codons. In the simplest case, the stop codons terminate the protein with an Stop. More complex Tail Domains may include degradation tags appropriate to the organism (with different degradation rates, e.g.). Note that different assembly techniques will, in general, result in different amino acid sequences for coding regions composed out of the same Head, Tail, and Internal Domains. We anticipate that users will use care in thinking about the effects of such differences on their experiments, but also feel confident that many such differences will be minor, when the composition uses structures such as export tags, degradation tails, and purification tags. RFC 14 describes the use of these concepts in combination with BB-2 assembly standard.