Vocabulary analysis

Vocabulary analysis

This is a technique that searches for word-like strings in sequences. It performs very well on English texts (as one would hope), and also reveals a surprising variety of word-like strings in protein sequences, although these are somewhat sparser than they are in English texts - texts are made of words, whereas proteins are definitely not, but they have words embedded in them.  This has now been published: here  What the words are, or rather what they represent, is still a matter for discussion. Some of them are hyper-conserved stretches in rather distantly related proteins. A few others appear in apparently unrelated sequences but, where structural information is available, seems to be of similar structure. Some appear to have nothing in common at all - I call these homonyms (words of identical spelling but different meaning). There are also different predominant classes of words in different proteomes, eg. some species have repetitive low-complexity words, others are conserved runs from large, often immense, gene-families. The technique in the paper is based on (and is a considerable improvement on) a classic method from the early 80s that was applied to DNA only. It might be worth trying the new technique on DNA too. I think there is still a bit of mileage in this, but have no idea when I'll get around to doing it.