Computational biology of gene expression
We study mechanisms of gene expression and regulation using computational and experimental approaches. We also develop algorithms for the identification of genes in genomic sequences and other applications in genomics. A unifying goal of our research is to understand the rules of RNA splicing specificity: how the precise locations of introns and splice sites are identified in primary transcripts. A major current effort is to develop computational methods to identify splicing enhancer and repressor motifs and to test the function of these motifs using in vivo splicing assays. We continue to develop improved methods for identifying genes in eukaryotic genomes, and we have started to work on computational methods for identifying microRNAs and predicting their functions. We are also using a combination of computational and experimental methods to study alternative splicing, a common mechanism of gene regulation in vertebrates.
RNA splicing specificity
Most eukaryotic genes contain one or more introns which must be removed from the primary transcript by the RNA splicing machinery in order to create the proper mRNA sequence to direct protein synthesis. This process must be highly accurate in order to ensure production of adequate amounts of correctly processed mRNA. The problem of RNA splicing specificity is to describe the set of `rules' which govern choice of intron and splice site locations in primary transcripts by the nuclear splicing machinery and to understand the molecular basis for these rules. This problem is analogous to the problem faced by biochemists in the early 1960s of identifying the rules governing translation of mRNAs into specific peptide sequences by the ribosome, the solution of which was the genetic code. The rules governing splicing are likely to be more complicated than those for translation, and are not exactly the same in all organisms. On the other hand, progress in large scale sequencing efforts is providing a wealth of data related to this problem in the form of thousands of gene sequences of known exon-intron structure. A typical human primary transcript is 30 kilobases long and contains about ten exons separated by much larger and more variably sized introns. The discrepancy between human exon and intron lengths led to the "exon definition" model of splicing in which splice sites are first paired across exons, with subsequent spliceosome assembly proceeding through pairing of exon units. In the alternative "intron definition" model, splice sites are initially paired across introns rather than exons. Intron definition is thought to be the predominant mode of splicing in transcripts containing short introns and long exons.