I had an idea how one could quantify the occurence of tetramers (for example GATC) in coding sequences compared to their expected occurence based on the amino acid sequence.

Probabilities for GATC for different aa combinations:

- probability of GAT for D: 0.63
- probability of CNN for L: 0.727
- probability of CNN for P: 1
- probability of CNN for H: 1
- probability of CNN for Q: 1
- probability of CNN for R: 0.919
- probability of ATC for I: 0.414
- probability of NNC for S: 0.153
- probability of NNC for W: 1
- probability of NNC for M: 1
- probability of NNC for T: 0.269
- probability of NNC for K: 0.239
- probability of NNC for R: 0.14
- probability of NNC for V: 0.366
- probability of NNC for A: 0.352
- probability of NNC for E: 0.312
- probability of NNC for G: 0.15
- probability of NNC for P: 0.511
- probability of NNC for L: 0.616
- probability of NNC for Q: 0.651
- probability of TCN for S: 0.575
- probability of NGA for G: 0.114
- probability of NGA for R: 0.116

Codon usage database: http://bioinformatics.forsyth.org/mgcud/index.php