Tregwiki:Why is this a difficult problem

The problem of finding binding sites is difficult at various levels:


 * In the lab: Binding sites are difficult to prove with experiments. There are no high-throughput experiments to show whether a certain sequence really has an effect on an organism. Yes, of course, you can cut out the sequence and see if the organism is still growing in a normal manner. While this is impossible for complete human beings, it takes quite a while with mice and plants. That's why most sites are tested in cultured cells or with fast-growing animals like flies/worms/fish/etc.
 * But even with an animal growing fast: Say you have cloned a certain fragment 5' of a basal promoter and GFP into a plasmid... and don't see an expression. Is it due to the basal promoter? Due to the spacing between your fragment and the promoter? Are you looking at the right stage of development? If you organism is not transparent, did you cut it at every angle imaginable? Personally, I doubt that there will be a complete description of all cis-regulatory elements of the human genome in our life time.


 * Some transcription factors bind several completely different sites, some are unknown yet.


 * On the computer: Even when a site of a transcription factor is known to be valid from experiments (at the moment, more than 13000 sites are known), you cannot simply take a genome and search for all occurences of the site. Binding sites are simply too short and too degenerate for this approach: A factor that might recognize AATAATCC might also recognize GGTAATCC or GGTAATAA. Therefore, we usually wait until many sites of the same transcription factor have been elucidated and align them. The result is four probabilities for every position at the binding site, for example (A=0.3,C=0.2,G=0.3,T=0.2) for the first position of our example factor. That would mean that it preferentially binds A and G at the first nucleotide. These distributions can be visualized with a sequence logo that scales more important nucleotides higher than unimportant ones.


 * Even when we know many binding sites for a transcription factor the matrix is so degenerate that when you search a given promoter for all known matrices on some websites, you get more than 1 hits per basepair. So you have to know before you're searching a sequence roughly what you're searching for. Even then, if your sequence is longer than 400 bps, you will always find what you're searching for, even on a completely random sequence. If you ever use gene-regulation.com or genomatix.com or a site like this to scan your promoter sequences, please generate a random sequence of the same length (type "random dna sequence" in google) and compare the results. You will be surprised how many hits you get.


 * Therefore, many computer scientists are trying to improve the way searching is performed and the representation of the matrices themselves. Biologists are trying to speed up and automize the process of site-validation. This wiki is dedicated to these approaches and give some rudimentary comments.