User:Michael Barton/Notebook/Biosynthetic cost in protein evolution/Datasets/Genes

Protein coding genes


The database table for genes contains columns for the SGD name of the sequence and the coding DNA.


  • None of the table columns are empty
  • The sequence begins with a start codon, ends with a stop codon, and contains only ATGCs


Each gene is loaded into the database from the SGD fasta file containing protein coding ORFs, and can be found on the SGD ftp server. The data is loaded into the database using the bioruby library, where for each fasta entry the first word of the header is stored as the gene name, and the fasta sequence is stored as the gene sequence.

After loading there are 5883 protein coding genes in the database. The smallest gene is 51 nucleotides long including stop codon, the longest gene is 14733 nucleotides long including stop codon. The respective mean gene length and standard deviation is 1489.83528811831 and 1147.7776369172 nucleotides including stop codons.

