AlexLabNotebook/ChrIIIRebuild/9/19/05-9/23/05

= 9/19/05 - 9/23/05 = Goal: make list of all features on chromosome III, start to figure out 5' and 3' ends required to specify a "functional element".

Next week: 9/24/05-9/30/05

ChrIII specific

 * Get list of types of features -- done.
 * Determine position of ARS, centromere -- done.
 * Make list of genes with introns -- done.
 * Determine whether there are any polycistronic transcripts -- done
 * Make list of overlapping functional elements -- done.
 * For each feature, determine 3' and 5' flanking sequences that are needed [eg promoters and 3' UTR for genes] --> this is a functional element
 * Determine transcription start and end sites for Pol I, II and III genes
 * Figure out what cis-acting sequences are required for functions of all other types of elements
 * Determine whether orientation matters for HML, HMR and MAT loci.
 * Make list of essential genes.
 * Find genes controlled by same TF that are divergently transcribed.

General

 * If 2 genes controlled by same TF are divergently transcribed, doesn't separating them while preserving the promoter sequence mean that you now need 2x the concentration of TF as before, because you've duplicated the binding sites of the TF ? Alternatively, given the large number of alternative binding sites that already exist, maybe having one more is really just a very minor change in the concentration of TF seen by the "real" binding site ...
 * Is there any known correlation between whether a gene is on the leading/lagging strand and its mutation rate, expression level etc ? In other words, is there a model for what could/should happen if a gene is moved from the plus strand to the minus strand, or vice versa ?
 * Consider taking out pseudogenes, introns [have to be careful not to eliminate snoRNAs that are made from introns].

Features

 * Chr III features:
 * repeat_family (6): Terminal bits of telomeres, X-element combinatorial repeats and X-element core sequence. Occur on both right and left ends of chromosome.
 * telomere (2): Full left and right telomeres.
 * nucleotide_match (17): ARS and X-element core sequences.
 * binding_site (2): X-elements which contain Tbp1 binding site.
 * ARS (19): Full ARS sequence, not just consensus sequence.
 * repeat_region (19): LTR of Ty 1, 2, 4 and 5 elements, like YCLWomega1.
 * transposable_element (2): Full-length Ty5 and Ty2 elements, like YCLWTy5-1.
 * gene (182): Full sequence for gene, spanning all introns and exons.
 * CDS (195): Coding sequences for "real" genes, pseudogenes and transposable element genes;  each exon is considered a separate coding sequence.
 * pseudogene (2): YCL075W and YCL074W encode fragments of Ty Pol protein.
 * region (7): HMRa, HMLalpha, MATalpha and consensus elements CDEI, CDEII and CDEIII of centromere
 * tRNA (10): Full tRNAs; some have introns, eg tRNA-Leu
 * ncRNA (16): tRNAs and snoRNAs, like snR65.
 * transposable_element_gene (2): Genes encoded by Ty elements, like YCL020W.
 * intron (10): Introns of 8 genes and 2 tRNAs.
 * snoRNA (4): snoRNAs.
 * centromere (1): Full centromeric sequence for CEN3.


 * Elements that contain introns:
 * tL(CAA)C
 * tS(CGA)C
 * YCL012C
 * YCL005W-A
 * YCL002C
 * YCR028C-A
 * YCR031C
 * YCR097W aka HMRa


 * Some snoRNAs are polycistronic; see, for example, Chanfreau et al.

3' and 5' ends

 * Pol I is only required for rRNA gene transcription; YCR003W is only rRNA gene on chromosome III.
 * [Textbook] Pol II promoters are ~40bp long, up/downstream of transcription start site, consist of
 * TFIIB responsive element [BRE], (G/C)(G/C)(G/A)CGCCC, -37 to -32
 * TATA box, TATA(A/T)(A/T), -31 to -26
 * initiator, (C/T)(C/T)AN(T/A)(C/T)(C/T), -2 to +4
 * downstream promoter elements [DPE], (A/G)G(A/T)CGTG, +28 to +32
 * (all numbers relative to transcription start site)
 * [Textbook] Transcription of Pol II genes is terminated by polyA signal sequences that signal cleavage and addition of the polyA tail.
 * [Textbook] Pol III promoters tend to be downstream of transcription start sites. Have Box A and Box B motifs.
 * UTRdb is database of published 3' and 5' UTRs.
 * Zhang and Dietrich paper describes experimental mapping of transcription start sites. Raw data available here.
 * 3' end formation for Pol II genes:
 * 3' UTR prediction website, has downloadable data on predicted 3' UTRs for yeast genes.
 * Downloadable data doesn't have UTRs for all genes because some genes didn't have a predicted 3' UTR that scored above the cutoff. In particular, 98 genes on chromosome III don't have predicted 3' UTRs, so maybe the thing to do for these is to manually run the prediction and pick the most probable site, or just choose a particular length of UTR for those genes, like the max UTR length found so far.
 * Also, downloadable data has some strange UTR lengths eg genes YBR264C, YFR026C etc supposedly have a 3' UTR of length 1.
 * Review of mRNA 3' end formation
 * Guo & Sherman review of cis elements directing 3' end formation; contains consensus sequences.
 * Graber et al bioinformatic analysis of 3'-end-processing signals; contains consensus sequences, ranked by strength.

Overlapping elements

 * Top-level feature types:
 * Telomere [includes repeat_family, binding_site, some of the nucleotide_match elements]
 * ARS [includes rest of nucleotide_match elements]
 * Transposable_element [includes some repeat_regions, pseudogene elements]
 * repeat_region
 * Gene [includes CDS elements, some region elements]
 * tRNA [includes some ncRNA elements]
 * snoRNA [includes rest of ncRNA elements]
 * Centromere [includes rest of region elements]
 * Assertion: Interesting overlaps to remove are between the top-level feature types above.

Next week: 9/24/05-9/30/05