Hebin:TFBS asymmetry

Question Are enhancers enriched for binding sites than random sequences?Are binding sites evolving any slower than XXX? This involves explaining the excessive conservation in the spacer sequences!How many sites are affected by substitutions between mel and sim?Confirm the asymmetry! and show if conservation occurs at higher level--Pocc (compensation within homeotypic sites) or even between different factors'sitesOne direction: simulations to investigate properties of compensatory mutations how easy is it to create a new site de-novowhere do compensatory mutations tend to happen?confirm that overlapping sites tend to be more conserved. also show by simulation under a simple compensatory scheme. </li></ul> </ul> Cloning the Linear Templates were cloned into <a class="external-link" href="http://catalog.takara-bio.co.jp/en/product/basic_info.asp?unitid=U100005842">pUC118</a> cat 3322 bcd his-tag out of frame the 3 prime end of the 2-step pcr product. I cloned them into the vectors and picked two clones, both show this wrong 3' end that put his tag out of frame. GCAGTTTGCCTACTGCTTTAAT-ATCATCACCATCACCACTAATAA C  C Now I'm not sure if this is a problem with pcr or the primer. I'm going to do the 2-step pcr again and do bulk sequencing from the 3' end and also sequence the other three clones I picked. Might need to reorder the primer. General Method Approaches Compare evolutionary constraint in TFBS, spacers and other regions 117/704 sites contain fixed substitutions between mel and sim, around 1/7 most desirably: compare to the nearby gene's neutral rate.

Calculate π total: 0.01242 tfbs : 0.00729

nonsyn 0.0026 syn   0.0352 intron 0.0166 5'UTR 0.0108 3'UTR 0.0113 inter 0.0204 Divide sites by their information content Method <ul>First add pseudo count P[b] * K (k=1) to every cell of the matrix </li>ic = sum( f[b,p] * ln ( f[b,p] / P[b] ) )</li>Note that this is different from the other method which define ic = 2 + sum( f[b,p] * ln( f[b,p] ) )</li>ic range: 0.089 - 1.552. divide into (0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6) bins, count the total number of sites in each bin as well as number of sites div/poly</li></ul> Results For all 30 factors &lt;0.2   &lt;0.4    &lt;0.6    &lt;0.8    &lt;1.0    &lt;1.2    &lt;1.4    &lt;1.6 653    1233    1235    852     837     1069    32      207 21      34      31      25      12      26      0       3 7       18      16      14      5       6       0       6

&lt;0.8	&gt;0.8 2.79%	=1.91% <img class="image-inline" src="result/Div-IC.jpg/image_preview" alt="Divergence~Information Content" /> Determine Ancestral Binding Site Ｅnergy Rationale When the inferred ancestral sequence for a melanogaster binding site is not predicted to be a binding site, the simplest interpretation is that one or more melanogaster-specific substitutions create a novel site. However due to the degeneracy and other properties of the TFBS motif, it is conceivable that there might be a weak site overlapping with the current position of the mel binding site in the ancestral sequence. Maybe what the mel-specific mutations does is to make the binding site stronger while also sliding it a little bit. anc AATCCCCACTT &gt;&gt;&gt;&gt;&gt;&gt; mel ...T.......  Code <a title="Count_Variation" class="internal-link" href="code/Count-variations.pl">Count_Variation</a> How many new gains? select if(anc_score &gt;=0, 'p',if(anc_score=-1, '-1','-2')) anc, count(*) from tfbs where tfbs_score &gt;= 0 group by if(anc_score &gt;=0, 'p',if(anc_score = -1, '-1', '-2')); +-+--+ +-+--+ +-+--+ 2% are newly gained How many polymorphism in binding sites available for analysis? select count(*), allele_number from tfbs_poly where tf_name is not NULL and tf_name != 'spacer' and polytype = 'nt' and poly_Or_not = 'poly' group by allele_number order by allele_number; +--+---+ +--+---+ +--+---+ Determine The Effect of Fixed/Polymorphic Nucleotide Substitutions Rationale ID	unique for each change type	mel / sim / poly / amb fpid	fpid of the binding site coord	genomic coordinate factor	transcription factor effect	change in PWM score or pvalue? freq	allele frequency. only for polymorphic sites For each fixed nucleotide substitution in melanogaster, if the ancestral inference is available, I will calculate the difference in PWM score by this substitution alone on the background of the ancestral sequence. If the ancestral inference is not available, I will mark it as ambiguous. The same procedure will be done for simulans specific fixed changes with the consideration of whether the ancestral sequence is a good binding site. In the mean time I would like to produce a list of candidate targets for MITOMI. Details: I'll go in to get a b.s. for factor X. First I'll determine if this site is gapped or not. If gapped, this site would be discarded. If not, I'll then go through each substitution. For fixed sites, I'll determine the direction of the change and then calculate the Polymorphic sites are treated the same way as fixed substitutions, except that the frequency of the polymorphism is also collected. How many sites are un-gapped? Total	793 sites with PWM score Gapped	89 sites Nngap	704 sites How many sites are gained in mel and how many lost in sim within the footprints? Within 704 ungapped sites (anc-mel-sim), using 80% recovery rate cutoff for each factor, the percentage is divided by (535+21+2 = 558) cutoff 80%	cutoff 0 1-1-1	535	95.9%	601	96.9% 1-1-0	21	 3.8%	19	 3.1% 1-0-1	2	 	0	0-1-0	14	 2.5%	12	 1.9% 0-0-1	2		0 0-0-0	130		72 117/704 sites contain fixed substitutions between mel and sim, around 1/7 21/704 sites are present in the ancestor but lost in simulans Complexity in evaluating the effects of changes Here is an example: anc    .........?......? yak    .....T...GA.....a mel     tgtaaAATAATTattat sim    .........T....--- .........T....--- The A-&gt;T change actually makes the complementary sequence a better binding site as shown below a      \AATAATT\ a position=      1  score=   4.38  ln(p-value)=   -5.73  sequence= AATAATT a position=      1C score=   5.03  ln(p-value)=   -6.44  sequence= AATTATT a      \AATATTT\ a position=      1  score=   2.47  ln(p-value)=   -4.30  sequence= AATATTT a position=      1C score=   1.66  ln(p-value)=   -3.73  sequence= AAATATT I noticed that the consensus for en is AATTAAA. Like this case, when the consensus has some kind of symmetry so that the sequence can be read from both strands, then a mutation might make the other strand a better site. The complexity lies in the following: 1. The orientation and position of the binding site is not fixed. A change may actually make the best match to PWM shift by several base pair or appear in the complementary strand. 2. Because of 1, even if accepting the additivity and site independence assumption in PWM, changes in different positions may not be independent just because the binding site itself may have moved. In such conditions, inference of the order of the mutations become necessary and even critical. Luckily this type of situation is not very common. First of all there are not many binding sites where there are more than 1 change. See below for the distribution of number of fixed/poly changes. Besides, this file records all cases where a change seems to have changed the orientation of the best match. <a title="Complex cases where a change will change the orientation of a site" class="internal-link" href="result/complex-cases-where-a-change-will-change-the">Complex cases where a change will change the orientation of a site</a> Another example--potential compensatory mutation 6314   4.52 anc     ................. yak    ..g..-.G..A...c.. mel    tatacTATTATTcggtg sim    .....A.....G..... T-&gt;A : 0.51-&gt;1.99 ...........G..... T-&gt;G : 4.52-&gt;0.51 ...........G.....       .....A.....G.....  What's the distribution of # fixed/poly changes per un-gapped binding site? no     div     poly 0      587     644 1       90      53 2       21      3 3       4       3 4       2       1 5       0       0 sum	704	704 Note: here I didn't distinguish which lineage the change is on. In cases of more than one change in a single bs, do the second change tend to be of complementary? I checked all cases with more than one changes in a single binding site on one lineage (for mel, &gt;1 fixed diff; for sim, &gt;1 fixed/poly changes) Expectation The hypothesis is that an initial change that decreases binding affinity would pose stronger constraint on the fixation of subsequent mutations (assuming a non-linear fitness-energy function because the first mutation might push the energy to some critical window where a further decrease would be very deleterious. The prediction under this hypothesis is that if there are two changes in a single site, they tend to be of opposing effects.  This hypothesis will not hold if the complementary change happens but more frequently in another location of the enhancer, in which case, the original site that get damaged is now deprived of its functional importance and thus become prone to further changes.  Observation  Focusing on 2-3 changes in a single site on either lineage, in mel they tend to be both/all increasing or the net effect is obviously increasing.  For sim, of the 12 or so cases I checked, three of them that have two changes show opposing effects. Two out of these three cases the net effect is quite close to zero. However while the estimates of the effects are from PWM, it need to be taken with caution. I tried to solve some of the ambiguous rooted base pairs. It turned out that VISTA browser and UCSC use different algorithms for alignment. While the VISTA one which I used in this case shows a deletion in yak, the sequence is present in UCSC alignment. After solving 3-4 cases where this would potentially influence my conclusion, I still hold the impression that the 3/13 cases show evidence of complementary changes. Plus these 3 cases have a pretty small predicted effects. This makes sense in that if the initial change only decreases the binding energy by a little, then there are a lot of ways to compensate that. If the initial change strongly influence the binding affinity, then only a very small number of specific changes can help revive the site, in which case creating another one in a new place may be easier! MK test and Frequency Spectrum
 * anc | count(*) |
 * -1 |       14 |
 * -2 |        5 |
 * p  |      695 |
 * count(*) | allele_number |
 * 3 |            2 |
 * 6 |            3 |
 * 22 |            4 |
 * 40 |            5 |
 * 17 |            6 |
 * 1) Table structure of the result:
 * 1) Notes

Code setwd("/Volumes/hebin-1/Documents/tfbs/Result") data &lt;- read.table("effect_changes.txt", head=T)


 * 1) inc/dec
 * 1) inc/dec

t &lt;- 1; mk &lt;- c(0,0,0,0,0,0); dim(mk) &lt;- c(3,2) rownames(mk) &lt;- c('mel','sim', 'poly'); colnames(mk) &lt;- c('inc','dec') mk[1,1] &lt;- length(data$effects[data$type == 'mel' &amp; data $ effects &gt; t &amp; data$score &gt; 0]) mk[1,2] &lt;- length(data$effects[data$type == 'mel' &amp; data $ effects &lt; -t &amp; data$score &gt; 0]) mk[2,1] &lt;- length(data$effects[data$type == 'sim' &amp; data $ effects &gt; t &amp; data$score &gt; 0]) mk[2,2] &lt;- length(data$effects[data$type == 'sim' &amp; data $ effects &lt; -t &amp; data$score &gt; 0]) mk[3,1] &lt;- length(data$effects[data$type == 'poly' &amp; data $ effects &gt; t &amp; data$score &gt; 0]) mk[3,2] &lt;- length(data$effects[data$type == 'poly' &amp; data $ effects &lt; -t &amp; data$score &gt; 0]) mk setwd("/Volumes/hebin-1/Documents/tfbs/Result") data &lt;- read.table("effect_changes.txt", head=T)

t2 &lt;- 4; cond &lt;- (data$effects &gt; t2 | data$effects &lt; -t2) mk2 &lt;- c(0,0,0,0,0,0); dim(mk2) &lt;- c(3,2) rownames(mk2) &lt;- c('mel','sim', 'poly'); colnames(mk2) &lt;- c('large','small') mk2[1,1] &lt;- length(data$effects[data$type == 'mel' &amp; cond &amp; data$score &gt; 0]) mk2[1,2] &lt;- length(data$effects[data$type == 'mel' &amp; !cond &amp; data$score &gt; 0]) mk2[2,1] &lt;- length(data$effects[data$type == 'sim' &amp; cond &amp; data$score &gt; 0]) mk2[2,2] &lt;- length(data$effects[data$type == 'sim' &amp; !cond &amp; data$score &gt; 0]) mk2[3,1] &lt;- length(data$effects[data$type == 'poly' &amp; cond &amp; data$score &gt; 0]) mk2[3,2] &lt;- length(data$effects[data$type == 'poly' &amp; !cond &amp; data$score &gt; 0]) mk2
 * 1) large/small
 * 1) large/small

t3 &lt;- 1 comb &lt;- summary(data$freq[data$type == 'poly' &amp; data$score &gt; 0 &amp; data$alleles == 5])[1:5] inc &lt;- summary(data$freq[data$type == 'poly' &amp; data$score &gt; 0 &amp; data$alleles == 5 &amp; data$effects &gt; t3])[1:5] dec &lt;- summary(data$freq[data$type == 'poly' &amp; data$score &gt; 0 &amp; data$alleles == 5 &amp; data$effects &lt; -t3])[1:5] a &lt;- c(1, 1/2, 1/3, 1/4, 1/5) matrix &lt;- rbind(comb/sum(comb)*sum(a), inc/sum(inc)*sum(a), dec/sum(dec)*sum(a), a) rownames(matrix) &lt;- c('combined', 'increase', 'decrease', 'neutral') barplot(matrix, beside = T, legend.text = rownames(matrix), col=rainbow(10)[c(3,5,7,8)]) title(main = 'frequency spectrum of 5 alleles of different classes')
 * 1) frequency spectrum
 * 1) frequency spectrum

SELECT poly_Or_not, count(*) FROM tfbs_poly WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not in ('poly', 'div') GROUP BY poly_Or_not;
 * 1)  To count the number of changes
 * 2)  in spacers
 * 3)  May 8th 08
 * 4)  hebin
 * 1)  hebin

+-+--+ +-+--+ +-+--+
 * poly_Or_not | count(*) |
 * div        |     2796 |
 * poly       |     1834 |

SELECT if(tfbs_poly.anc = mel, 'sim', if(anc = '?', 'amb', 'mel')) type, count(*) FROM tfbs_poly WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not = 'div' group by type;

+--+--+ +--+--+ +--+--+ Solving ambiguous ancestry sites There are 39 changes that cannot be rooted using my original VISTA mel-yak alignment. I extended the alignment to multi-species up to virilis, data retrieved from UCSC genome browser <a title="Solving ambiguous ancestry using UCSC multi-species alignment" class="internal-link" href="result/ambiguous-sites.txt">Solving ambiguous ancestry using UCSC multi-species alignment</a> All together I solved 15 / 39 cases with reasonable confidence (based on parsimony principle). 4/5 mel specific changes are predicted to increase binding affinity while 5/6 sim specific changes are expected to decrease binding affinity. So the asymmetry is made even more extreme. Results
 * type | count(*) |
 * amb |      646 |
 * mel |     1223 |
 * sim |      927 |

inc: &gt;0; dec: &lt; 0 inc dec mel  34  16 sim   7  42 poly 17  50

excluding mel_specific sites inc dec mel  15  15 sim   4  31 poly 14  42

inc: &gt;1; dec: &lt; -1 inc dec spacer mel  26   8   1223 sim   1  35    927 poly 12  42   1834

excluding the mel_specific sites inc dec mel  11   7 sim   0  25 poly 10  35

using Pi instead of PWM score inc: &gt;0.01; dec: &lt; -0.01 inc dec mel  22   7 sim   4  26 poly 13  33

inc: &gt;0.005; dec: &lt; -0.005 inc dec mel  23   9 sim   4  30 poly 14  40

individual regions pattern the same CE1049 27 changes (including ambiguous sites) inc dec spacer mel   4   2   26 sim   0   5   22 poly  1   5   58

CE1004 11 changes inc dec spacer mel   0   0   3 sim   0   2   1 poly  1   2   3

CE1068 24 changes inc dec spacer mel   2   2    4 sim   0   7    7 poly  0   1   10

large : small cutoff = 3 large small spacer mel    16    34    1223 sim    23    26     927 poly   28    39    1834

large : small cutoff = 4 large small mel    13    37 sim    16    33 poly   16    51 Problems <ol>because the number is small, the ambiguous sites can have a large effect on the result.</li>pwm is derived from melanogaster footprints. They are adapted to whatever equilibrium frequency is in melanogaster. Does this a priori predict more gains than loss in mel?</li>because the footprints are defined in melanogaster, imagine the ancestral pool of sites--the sampling is biased towards changes that increase binding affinity because those are the ones that will be detected in the footprint assays</li></ol> Further questions: 0. what's the expected number of large/small effects? 1. what's the number for non-binding sites? 2. frequency spectrum? 3. if really biased by the sampling, maybe expand to PWM predictions? where are the functional binding site? Further directions: <ol>verify with selex PWM? </li>unbiased prediction of binding sites in both species. But now how could you distinguish functional sites from site like sequences, which presumably has very different evolutionary pattern.</li>if inc/dec dimension is orthogonal to large/small effect dimension, then the later should not be biased. </li></ol> <img class="image-inline" src="result/Frequency%20Spectrum.pdf/image_preview" alt="Frequency_Spectrum" height="420" width="420" /> Validate the results with Pollard's PWM Methods <ol>Start from effect_changes.txt</li>for each documented change, retrieve the footprint sequence, make two copies with the only difference in this change, one being ancestral and one being evolved</li>use patser with Pollard's PWM to find the score difference</li></ol> Results inc &gt; 1; dec &lt; -1; inc dec mel  15   8 sim   8  16 poly  8  21

inc &gt; 0; dec &lt; 0 inc dec mel  22  14 sim  13  22 poly 17  33

Validate the results with SELEX PWM Impression on the comparison between SELEX PWM and footprint PWM

Casey's PWM are learned from the footprints. Typically they are less information-dense (containing lots of low information site). They recover more footprints, as they are optimized for. But they also predict a lot more additional sites when used to scan the whole enhancer Methods <ol>extract matrix from xml file and format as consensus. <a title="XML_PWM2.pl" class="internal-link" href="code/XML-PWM2.pl">XML_PWM2.pl</a></li>use patser to transform the matrix into scoring matrix</li><li>for each change for the factor, use the selex matrix to reexamine the effect and record. <a title="compare_selex.pl" class="internal-link" href="code/compare-selex.pl">compare_selex.pl</a></li></ol> Results footprint PWM inc dec mel   8   2 sim   1  13 poly  4  12

SELEX PWM inc dec mel   8   3 sim   2  11 poly  3  11 <img class="image-inline" src="result/compselex.jpg/image_large" alt="compselex.jpg" height="597" width="675" /> Unbiased prediction How to choose good PWMs and proper cutoffs Case study: bcd23(SELEX) vs. bcd There are 5 footprint sites. Use the lowest footprint score 3.11 as cutoff, Casey's PWM bcd finds 16 hits while the same cutoff on bcd23 (selex) misses one footprint site but finds only 8 hits. However there are 3 sites that are consistently recovered but not found in footprints for unknown reasons. eve position=    431  score=   4.29  ln(p-value)=   -6.66  sequence= TCCAATCC eve position=    455  score=   5.51  ln(p-value)=   -7.57  sequence= CCCAATCC eve position=    461  score=   5.51  ln(p-value)=   -7.57  sequence= CCCAATCC Protocol <ol><li>../patser/patser -A a:t 0.3 c:g 0.2 -m ../OptimalPatserMatrices/bcd23 -b 1 -u2 -s -c -ls 0 -f bcd_test.fasta <img class="image-inline" src="result/hb-cutoff.jpg/image_preview" alt="hb_cutoff" /> Method <ol><li> Extract the whole enhancer sequence alignment for mel and w501 sim, anc as well. Consider filling up the w501 gaps with other sim sequences (optional). </li><li>Do predictions in both mel and sim sequences with the chosen pwm and cutoff. </li><li>Transform the coordinates (- is not counted in patser) and get rid of overlapping sites</li><li>Store the binding site map in array of arrays. (MySQL and R have more powerful matrix manipulation tools, but they are not good at string processing and looping structure.) </li><li>First examine sites predicted in mel sequences--see if the asymmetry holds.</li> <ol><li>To transform the coordinate, I used a hash $coordinate -- $coordinate{$n} stores the absolute position for relative pos $n. This value is used for extracting sequences using substr.</li><li>To deal with overlapping site, I recorded the previous site information and judge if the next site is overlapping. Throw away the lower score one.</li><li>Loop through each of the remaining binding sites. Extract the alignment block, throw away gapped sites. Evaluate the effects of changes. Record "region factor position type effect" </li></ol> <li>For each predicted binding site, judge the ancestral sequence if possible. Assign the phylogenetic type of the site (available types listed below)Then examine all changes in those binding sites including mel and sim predicted ones.</li><li>Look at compensatory changes </li><li> Use Ah-Ram's algorithm to plot </li></ol> ~/Documents/tfbs/patser/patser -A a:t 0.3 c:g 0.2 -m ~/Documents/tfbs/OptimalPatserMatrices/bcd -b 1 -ls 3.11 -s -c S2E sequence CAATATAACCCAATAATTTGAAGTAACTGGCAGGAGCGAGGTATCCTTCCTGGTTACCCGGTACTGCATAACAATGGAACCCGAACCGTAACTGGGACAGATCGAAAAGCTGGCCT GGTTTCTCGCTGTGTGTGCCGTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGCAGTTGCAGCGTTTCGCTTTCGTCCTCGTTTCACTTTCGAGTTAGACTTTATTGC AGCATCTTGAACAATCGTCGCAGTTTGGTAACACGCTGTGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTATAATCGCACAACGAGACCGGGTTGCGAAGTCAGGGC ATTCCGCCGATCTAGCCATCGCCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAGCCAAGGGCTTGACTTGGAATCCAATCCCGATCCCTAGCCCGATCC CAATCCCAATCCCAATCCCTTGTCCTTTTCATTAGAAAGTCATAAAAACACATAATAATGATGTCGAAGGGATTAGGGGCGCGCAGGTCCAGGCAACGCAATTAACGGACTAGCGA ACTGGGTTATTTTTTTGCGCCGACTTAGCCCTGATCCGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAGTTGTGGGTGG---ACCCCACGATTTTTTTG
 * grep "position" &gt; ./temp.txt</li><li>Convert the output to a table format <a title="convert.pl" class="internal-link" href="../code/temp3.pl">convert.pl</a></li><li>Validate that the patser predicted position are consistent with the genome coordinate <a title="Validate the relative position" class="internal-link" href="../code/Unbiased-prediction.pl">Validate the relative position</a></li><li>Given a cutoff, evaluate the percentage of footprint recovered and the ratio of total hits to hits in footprints. <a title="PWM and cutoff evaluation" class="internal-link" href="../code/pwm-and-cutoff-evaluation">PWM and cutoff evaluation</a></li></ol>

select fpid, tfbs_start-5489522 pos, tfbs_score, tfbs_pvalue, strand from tfbs where region = 'CE1056' and factor = 'bcd'; Result <ul><li>Examine mel-predicted sites only, for hb, cutoff = 4, t=1

In footprint sites: inc dec mel   4   1 sim   0   6 Total: 97, Gapped: 16, 12 ungapped sites contain fixed differences In mel-predicted sites: inc dec mel   6   2 sim   0  10 Total: 206 binding sites, Gapped: 29 of the 10 changes that cannot be rooted, 9/10 shows a stronger binding energy in mel and 1/10 is the other way. So resolving the ancestry for them would only enhancer the asymmetry </li><li> Examine sim-predicted sites only, for hb, cutoff = 4, t=1

w501 only inc dec mel   0   4 sim   3   2 Total: 151 binding sites, Gapped: 19

mixed sim inc dec mel   0   8 sim   7   3 Total: 191 binding sites, Gapped: 23 </li></ul> Interpretation <ol><li>Extreme: if there is no turnover at all, then one would expect equal number of increase and decrease to balance out on each lineage</li><li>the observed extreme asymmetry indicates either evolution of total binding or high turnover plus biased sample of footprint sites from the focal species, likely a mixture of the two.</li></ol> <ul><li>Turnover rate estimated from the prediction</li></ul> Things to get to: net change in the number of sites per enhancer, in total the number of gains and losses in each species, number of sliding cases Region	Overlap	mel	sim	Sliding CE1030	0	1	0	0 1	1	0 CE1013	10	2	0	0 14	3	2 CE1056	3	1	0	0 4	1	1 CE1149	4	3	1	1 6	3	1 CE1089 	16	3	2	1 20	6	3	#has poor alignment! CE1012	14	3	1	1 21	6	7 CE1004	14	3	2	2 16	5	3 CE1035	18	2	0	0 23	3	3 CE1162	38	14	9	5 20	23	6	#has poor alignment! CE1034	7	0	0	0 7	2	2 CE1068	10	12	11	5 14	13	10?	#10&lt;11, due to the mixed sim allele in the previous case CE1181	12	1	3	0 11	3	4 CE1029	12	3	4	3 13	4	5 total	158	48	33	18 footpr	66	23

no evidence that the mel-specific ones tend to be non-footprints. this is sort of consistent with Justin's finding that semi-conserved sites do not tend to have lower affinity.

Estimate from hb alone: mel: 206 sites in total, 25% mel specific sim: 191 sites in total, 17% sim specific

inc  dec mel	6    9 sim	7   16 Total: 239 binding sites, Gapped: 45

Among the ambiguous sites, 2/9 (inc/dec)

using hb39 SELEX matrix CE1030 3       0       0       0 CE1013 8       0       0       0 CE1056 2       1       0       0 CE1149 8       3       0       0 CE1089 15      0       0       0 CE1012 11      0       0       0 CE1004 12      2       1       0 CE1035 16      1       0       0 CE1162 36      13      9       0 CE1034 7       0       0       0 CE1068 13      10      13      0 CE1181 22      2       4       0 CE1029 14      4       3       0 total	167	36	30

mel Polymorphism Materials <ol><li>eve-stripe 2 enhancer element from Ludwig and Kreitman 1995</li> |	dec	inc	spacer +- mel	div	|5	1	0	4 poly	|4	1	1	2 |	4/6	1/6	3/5, 1/5 +- sim	div	|7	2	0	5 poly	|3	0	0	3 |			2/6, 2*1/6 <li>from other literatures

name	Alleles	Chr_arm	Coordinate		Region	Start		End		ref Ubx	22+1	3R	12,526,539-12,527,269	CE1034	12526644	12527144	-Phinchongsakuldit, MacArthur &amp; Brookfield, MBE, 2004 Ubx	22+1	3R	12,599,005-12,600,040	CE1181	12598627	12600114	-same as above ftz	13	3R	2,689,483-2,689,995	CE1121	2689374 	2690045 	Dermitzakis, Bergman &amp; Clark, MBE, 2003 en	12	2R	7,043,788-7,044,631	CE1149	7044399 	7044957 	same as above hb	12	3R	4,517,317-4,520,669	CE1162	4520300 	4524677 	Tautz &amp; Nigro, MBE 1998 50	2L	12,537,331-15,942,868	CE1061	13904731	13905265	50 genomes project CE1138	14614893	14616302 CE1078	15474198	15476199 50	X	1,911,660-5,005,056	CE1086	1980273 	1980606 CE1085	1981449 	1981745 CE1084	1996366 	1996806 CE1172	2033396 	2033534 CE1192	2304109 	2304228 CE1188	2651290 	2651409 CE1187	2652700 	2654032 CE1099	3105441 	3106264 CE1126	4129700 	4134947 CE1038	4231703 	4233191 CE1146	4908617 	4909698 there are altogether 144 footprint sites </li></ol> hb promoter CE1162 overlaps with fpid 6413-6421, 8 bcd sites and 1 hb site disappointingly, there are no polymorphism or fixed substitutions in these 8 sites. <ol><li> </li></ol> Transcription factors divergence

Methods <ol><li>get the TF protein sequence from UCSC</li><li>blast for AA-&gt;AA on flybase, save in XML</li><li>run TF_align_XML.pl to get alignment</li><li>run emma on EMBOSS with alignment, save in MSF</li><li>run pretty plot with MSF</li><li>on UCSC go to uniprot to get structure information</li><li>annotate in preview for DNA binding domain </li></ol> Calculate Pocc and its application Methods <ul><li>patser score = sigma{ ln f(b,j)/p(b) } </li><li>deltaG = RT sigma{ ln f(b,j)/p(b) }</li><li>Pi = [X] / (exp{-deltaG / RT} + [X])</li><li>let [X] = exp{deltaG[consensus]}</li><li>Pi = 1 / (exp{ (deltaG[consensus] - deltaG) / RT} + 1) = 1 / (exp{patserS[consensus] - patserS} + 1) </li><li>Consensus sequence has the highest score -&gt; highest deltaG -&gt; lowest Kd -&gt; more difficult to dissociate (because the more negative deltaG is, the more self-initiating the process is.)</li><li>I've verified that my hand-written way of calculating score is the same as patser calculation, while the later is much slower.</li><li> </li></ul> Validation of the method I manually check some of the pair Pocc in the output. Sometimes there seem to be a lot of changes on sim lineage that decreases binding energy score by quite a bit. For example, for Region CE1057, factor Trl, here is the list of binding sites energy scores for the two species fpid	mel	sim 4152	0.13	-0.19 4154	4.93	-0.63 4157	7.36	7.29 4165	9.10	10.59

Pocc_mel = 0.066709956 Pocc_sim = 0.188498675

The reason is better explained below in properties of Pocc. Basically, changes in binding score doesn't correlate linearly with Pocc. Thus change in the strong site 4165 contribute disproportionally to Pocc. Properties of Pocc Pocc is the probability that AT LEAST one TF is bound to the enhancer. As expected, abolishing one binding site in an enhancer consisting a handful of strong sites will not affect Pocc to any substantial amount. However, why some enhancers are built with a single site while some contain a dozen similarly strongly sites? Several levels of considering binding sites profile divergence: <ol><li>count the kind and number of binding sites</li><li>calculate Pocc, which integrates over all the binding sites of the same kind</li><li>model the output of the enhancer thus integrating binding sites position and combination information and so on</li><li>Because Pocc is not responsive when there is redundancy. (it is enough to have one or several strong sites to ensure that at least one TF is bound), this comparison will only reveal those cases with single site in the enhancer that is altered in one species or those that have many sites altered at the same time. </li></ol> One thing to note is what Pocc value reflects. When there are many strong sites in a sequence, abolishing one of them seem to have very small changes on Pocc. Necessary to read Pocc papers. 1. permute the sequence (preserve dinucleotide frequency) For bcd using 1. selex PWM 2. PWM from Down et al. code <a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc expected</a> <a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc pair calculation</a> Result <ol><li>General pattern:

</li></ol> <a title="Pocc expected" class="internal-link" href="code/Pocc-pairexp.pl"></a> For hb Pocc = 0.0490626779367482 	p-value = 0.951		CE1030 Pocc = 0.857583818550552       p-value = 0.009 Pocc = 0.644644214013364       p-value = 0.066 Pocc = 0.726122781523663       p-value = 0.164 Pocc = 0.99205125002337 	p-value = 0 Pocc = 0.98936891052637 	p-value = 0 Pocc = 0.994712693935612       p-value = 0 Pocc = 0.996523508773069       p-value = 0 Pocc = 0.999922656447524       p-value = 0.005 Pocc = 0.953240878140783       p-value = 0 Pocc = 0.985433379804871       p-value = 0.035 Pocc = 0.955171520728597       p-value = 0.08 Pocc = 0.972211787235409       p-value = 0.006 2. same sequence, scramble the pwm (preserve the total information content) hb 1000 iterations CE1030       Pocc = 0.0485909445226903       p-value = 0.793 CE1013       Pocc = 0.855046878947466        p-value = 0.008 CE1056       Pocc = 0.641396463335304        p-value = 0.162 CE1149       Pocc = 0.723027286433472        p-value = 0.087 CE1089       Pocc = 0.991711235632589        p-value = 0 CE1012       Pocc = 0.988940266852853        p-value = 0 CE1004       Pocc = 0.994466512897498        p-value = 0.002 CE1035       Pocc = 0.996344086058097        p-value = 0.003 CE1162       Pocc = 0.999915746792497        p-value = 0.178 CE1034       Pocc = 0.952078221580422        p-value = 0.001 CE1068       Pocc = 0.984885043860919        p-value = 0.031 CE1181       Pocc = 0.953952078454468        p-value = 0.019 CE1029       Pocc = 0.971370913035186        p-value = 0.016

bcd24 1000 iterations CE1030 Pocc = 0.161788727877099        p-value = 0.13 CE1056 Pocc = 0.842029413353345        p-value = 0.002 CE1089 Pocc = 0.0167758142211993       p-value = 0.733 CE1012 Pocc = 0.0902420614763332       p-value = 0.334 CE1004 Pocc = 0.0484866852062439       p-value = 0.375 CE1035 Pocc = 0.419809501350481        p-value = 0.101 CE1028 Pocc = 0.0449268578762426       p-value = 0.133 CE1074 Pocc = 0.12098154880125 	p-value = 0.144 CE1162 Pocc = 0.739505636509969        p-value = 0.109 Pocc pairs of the two species' enhancer <img class="image-inline" src="result/Pocc-pair.jpg/image_large" alt="Pocc_pair" height="657" width="628" /> Obviously there are more decreases than increases. Something interesting I discover Here I can divide all cases into three categories: 1. Affinity in footprints decreases a lot in simulans but Pocc are similar -- functional compensation 2. Affinity in footprints decreases a lot in simulans and Pocc drops dramatically as well -- functional evolution or hetero-factor compensation? 3. Affinity in footprints doesn't change a lot but Pocc changes -- either sim evolves a stronger binding or there are hidden sites somewhere!

Category I

Changes in footprints: CE1004 3616    Kr  7.69    8640866 poly    0.38    1       4 CE1004 3618    Kr  7.62    8640890  sim   -5.81    4       4 CE1004 3620    Kr -1.00    8640926  sim   -1.74    4       4 CE1004 3623    Kr -1.00    8641007 poly    0.00    1       5

Pocc values Region	factor	mel_l	sim_l	mel_Pocc		sim_Pocc CE1004 Kr      548     542     0.391655915895923       0.364895827027112

For this case, actually there doesn't need to be any complementary changes because there are already a lot of strong sites in this region: +--++-+++--+-++--++-+---++++ +--++-+++--+-++--++-+---++++ +--++-+++--+-++--++-+---++++ Simply abolishing one site won't change Pocc a lot. However, what does Pocc means?
 * fpid | region | chr_arm | factor | target | fp_start | fp_end | tfbs_start | tfbs_end | tfbs_score | tfbs_pvalue | anc_score | anc_pvalue | strand | gapped |
 * 3612 | CE1004 | 3L     | Kr     | h      |  8640828 | 8640838 |    8640827 |  8640836 |       7.56 |       -9.26 |      7.56 |      -9.26 | -      | N      |
 * 3616 | CE1004 | 3L     | Kr     | h      |  8640856 | 8640866 |    8640857 |  8640866 |       7.69 |       -9.39 |      7.69 |      -9.39 | +      | N      |
 * 3618 | CE1004 | 3L     | Kr     | h      |  8640881 | 8640891 |    8640882 |  8640891 |       7.62 |       -9.33 |      7.62 |      -9.33 | +      | N      |
 * 3620 | CE1004 | 3L     | Kr     | h      |  8640919 | 8640929 |    8640918 |  8640927 |      -6.59 |           0 |        -1 |          0 | -      | N      |
 * 3623 | CE1004 | 3L     | Kr     | h      |  8641003 | 8641011 |    8641003 |  8641012 |      -2.72 |           0 |        -1 |          0 | -      | N      |
 * 3624 | CE1004 | 3L     | Kr     | h      |  8641119 | 8641129 |    8641118 |  8641127 |       8.07 |       -9.78 |      8.07 |      -9.78 | -      | N      |
 * 3627 | CE1004 | 3L     | Kr     | h      |  8641145 | 8641155 |    8641140 |  8641149 |      -3.88 |           0 |        -1 |          0 | +      | N      |
 * 3633 | CE1004 | 3L     | Kr     | h      |  8641252 | 8641262 |    8641251 |  8641260 |       9.42 |      -11.46 |      9.42 |     -11.46 | -      | Y      |

Category II

Changes in footprints: 196 CE1029 260     Kr   3.5   12636476  mel    5.81    5       5 anc    ......G............. yak    ......G............. mel    tgcgtGAAAGGGTGAagcta	9.3 sim    nnnn..G............. 3.5       ......G.............        ......G.............        ......G.............        ......G.............

Pocc values: CE1029 Kr      1746    1727    0.215347001553784       0.0280790836253705

Interesting literature for CE1029 Shimell, M. J., Peterson, A. J., Burr, J., Simon, J. A., O'Connor, M. B., February 2000. Functional analysis of repressor binding sites in the iab-2 regulatory region of the abdominal-a homeotic gene. Developmental Biology 218 (1), 38-52. URL <a href="http://dx.doi.org/10.1006/dbio.1999.9576">http://dx.doi.org/10.1006/dbio.1999.9576</a> [abstract] Abdominal-A (ABD-A) homeotic protein. Analysis of a 1.7-kb enhancer element [iab-2(1.7)] from the iab-2 regulatory region shows that in contrast to Ubx enhancer elements, both HB and Kru¨ ppel (KR) are required to set the ABD-A anterior boundary in parasegment 7. DNase I footprinting and site-directed mutagenesis show that HB and KR are direct regulators of this iab-2 enhancer. The single KR site can be moved to a new location 100 bp away and still maintain repressive activity, whereas relocation by 300 bp abolishes activity. These results suggest that KR repression occurs through a local quenching mechanism. I tried to search flanking sequence 300 bp on each side for compensatory Kr sites for sim but didn't find any. I then calculated Pocc for sim sequence as well as a p-value from 1000 dinucleotide permutation of the sequence. The results indicate that sim sequence just don't contain a good Kr site Pocc_mel = 0.215347001553784      p-value (for dinucleotide shuffling) = 0.109 Pocc_sim = 0.050892903419477      p-value (for dinucleotide shuffling) = 0.804 Calculate Potential and its application Method The calculation is very similar to Pocc and have interesting relationship with it. <ul><li>score all the sliding windows of pwm_length and rank the scores</li><li>Pick the highest rank score</li><li>Pick the second highest rank among those that do not overlap with the previous picked ones until no site can be picked because of overlaps</li><li>Sum up the Pi 's</li><li>At equilibrium, this sum value can be interpreted as the "average number of proteins on the enhancer"</li><li>This value will be very close to Pocc while Pocc is small (0.2 for example)</li><li>This value doesn't have the saturation problem with Pocc. So it is better for enhancers containing lots of strong sites and to show the effect of abolishing one of them on the potential of the enhancer to be bound by the factor.</li><li>Its calculation doesn't impose any arbitrary cutoff.</li></ul> Code <a title="Potential_pair.pl" class="internal-link" href="code/Potential-pair.pl">Potential_pair.pl</a> Result select e.type, e.region, e.fpid, e.factor, e.coordinate, e.effects, e.Pi, p.melpt, p.simpt, p.simpt_exp from  effect_changes e, potential p where e.type LIKE "%sim" and e.Pi &lt; -0.01 and e.region = p.region and e.factor = p.factor order by e.type, e.factor, e.region

+--++--+++--++---+---+---+ +--++--+++--++---+---+---+ +--++--+++--++---+---+---+
 * type | region | fpid | factor | coordinate | effects | Pi         | melpt     | simpt     | simpt_exp |
 * asim | CE1068 | 6240 | hb    |   12590526 |    -3.94 | -0.0143405 |   2.91906 |    2.4386 |   2.54402 |
 * asim | CE1109 | 6321 | Kr    |   21022230 |    -4.26 | -0.0693678 |  0.121555 | 0.0532071 | 0.0461818 |
 * asim | CE1109 | 6321 | Kr    |   21022232 |    -6.22 |  -0.070286 |  0.121555 | 0.0532071 | 0.0461818 |
 * asim | CE1146 | 1599 | ovo   |    4909561 |    -2.21 |  -0.110128 |  0.554871 |  0.370851 |  0.444743 |
 * asim | CE1131 | 6347 | Ubx   |   11455015 |    -3.44 |  -0.117362 |  0.377224 |  0.231532 |  0.335409 |
 * asim | CE1068 | 6236 | z     |   12590413 |    -4.76 | -0.0758525 |   3.14204 |   3.08593 |   3.14403 |
 * sim | CE1049 | 6089 | abd-A  |    2759019 |    -3.23 | -0.0409354 |   17.6296 |   18.7606 |   17.3765 |
 * sim | CE1200 | 6489 | ap     |   22997767 |    -5.61 |   -0.43012 |   3.47621 |   1.64334 |   2.66249 |
 * sim | CE1099 | 4491 | br-Z3  |    3105855 |    -0.85 | -0.0167293 |   3.27923 |   1.67408 |   3.02324 |
 * sim | CE1016 | 981  | Deaf1  |    2613240 |    -2.26 | -0.0356046 | 0.0638928 | 0.0379833 | 0.0283291 |
 * sim | CE1076 | 6272 | dl     |    2581108 |    -1.32 | -0.0135067 |  0.495107 |  0.611945 |  0.714109 |
 * sim | CE1085 | 6314 | en     |    1981579 |    -4.01 |  -0.194292 |   1.31849 |   1.87045 |   1.17026 |
 * sim | CE1152 | 5108 | en     |    2692110 |     -1.9 |  -0.369892 |    2.5783 |   2.64407 |   2.43348 |
 * sim | CE1179 | 5366 | en     |    5827316 |   -0.411 |  -0.043412 |  0.838666 |   1.31887 |  0.795254 |
 * sim | CE1030 | 6017 | hb     |   11456134 | -0.87005 | -0.0169816 | 0.0411556 | 0.0238455 | 0.0241783 |
 * sim | CE1035 | 6032 | hb     |   20630455 |  -0.4217 | -0.0649021 |   3.04001 |   2.78995 |   2.90002 |
 * sim | CE1035 | 6035 | hb     |   20630490 |    -5.82 | -0.0524852 |   3.04001 |   2.78995 |   2.90002 |
 * sim | CE1056 | 4137 | hb     |    5490187 |    -5.82 |  -0.119851 |  0.728987 |   0.61201 |  0.609136 |
 * sim | CE1068 | 6238 | hb     |   12590469 |    -5.82 |  -0.234132 |   2.91906 |    2.4386 |   2.54402 |
 * sim | CE1181 | 5387 | hb     |   12599392 |    -5.82 |  -0.127427 |   2.37144 |   2.74126 |   2.25778 |
 * sim | CE1004 | 3634 | kni    |    8641330 |    -1.02 |  -0.201558 |   6.03252 |   4.83989 |   5.83096 |
 * sim | CE1181 | 5390 | kni    |   12599840 |   -4.586 | -0.0191775 |   7.76955 |   8.90855 |   7.76955 |
 * sim | CE1004 | 3618 | Kr     |    8640890 |    -5.81 | -0.0421574 |  0.463782 |  0.421662 |  0.421625 |
 * sim | CE1035 | 6039 | Kr     |   20630567 |    -0.38 |  -0.085451 |   2.07909 |   1.99488 |   1.99364 |
 * sim | CE1200 | 6471 | pan    |   22997071 |    -5.79 |  -0.348139 |   4.44358 |   3.00821 |   3.81605 |
 * sim | CE1199 | 5541 | slbo   |    9898793 |     -5.2 |  -0.455406 |  0.717394 |  0.203589 |  0.262045 |
 * sim | CE1035 | 6034 | tll    |   20630490 |    -2.12 | -0.0317674 |   2.14552 |   2.18389 |   2.15739 |
 * sim | CE1064 | 6172 | Trl    |   14722644 |    -1.63 |   -0.33617 |  0.901073 |  0.540058 |  0.612796 |
 * sim | CE1068 | 6214 | Trl    |   12589882 |    -1.49 |  -0.156298 |  0.955081 |  0.673408 |  0.716796 |
 * sim | CE1032 | 300  | twi    |   17203868 |     -6.2 |  -0.268195 |  0.809029 |  0.592913 |  0.564329 |
 * sim | CE1049 | 6103 | Ubx    |    2759403 |    -2.54 |  -0.310968 |   1.93739 |   1.47474 |   1.33022 |
 * sim | CE1049 | 6129 | Ubx    |    2760478 |    -3.18 |  -0.151917 |   1.93739 |   1.47474 |   1.33022 |

All except 2 cases (CE1035/tll, CE1076/dl), all the others are consistent with sim_exp &lt; mel_pt. In these two cases, CE1035 has one mel sub that decreases; CE1076 has one amb sitethat makes sim a stronger one. Compare sim_pt and sim_pt_exp: inc/dec 14/14! cannot invoke compensation, at least not shown through this Potential calculation. Look for cases of dramatic Potential changes See figure in Keynote file for the "Change in Potential, rank ordered". 1. take simPt-melPt &gt; 0.3 or simPt-melPt &lt; -0.7 mysql&gt; select * from Potential where (simpt-melpt) &gt; 0.3 or (simpt-melpt) &lt; -0.7; +++--+--+--+-+---+ +++--+--+--+-+---+ +++--+--+--+-+---+
 * region | factor | mell | siml | melpt   | simpt   | simpt_exp |
 * CE1016 | Dfd   | 2658 | 2767 |  23.7409 | 24.5887 |   23.7409 | x possibly non-specific PWM
 * CE1079 | abd-A |  891 |  863 |  3.04907 | 3.65384 |   3.04907 |
 * CE1049 | abd-A | 2301 | 2378 |  17.6296 | 18.7606 |   17.3765 |
 * CE1200 | ap    |  812 |  824 |  3.47621 | 1.64334 |   2.66249 | *
 * CE1004 | bcd   |  548 |  542 |  1.16911 | 1.54582 |   1.17638 |
 * CE1099 | br-Z1 |  823 |  723 |  3.25048 | 2.26729 |   3.22339 |
 * CE1164 | br-Z1 | 3563 | 3450 |  7.51766 | 5.43528 |   7.36482 |
 * CE1099 | br-Z2 |  823 |  723 |  5.31897 | 3.67533 |   5.31964 |
 * CE1099 | br-Z3 |  823 |  723 |  3.27923 | 1.67408 |   3.02324 | *
 * CE1086 | en    |  333 |  252 |  2.62556 | 0.89412 |   2.59706 | *
 * CE1085 | en    |  296 |  288 |  1.31849 | 1.87045 |   1.17026 | ? strong compensation
 * CE1084 | en    |  440 |  434 |  2.62796 | 1.48458 |   2.49766 |
 * CE1179 | en    |  114 |  140 | 0.838666 | 1.31887 |  0.795254 |
 * CE1134 | eve   |  547 |  541 |  4.66232 | 3.91912 |   4.66232 |
 * CE1181 | hb    | 1487 | 1490 |  2.37144 | 2.74126 |   2.25778 |
 * CE1012 | kni   |  933 |  919 |  7.79598 | 6.82782 |    7.7697 |
 * CE1004 | kni   |  548 |  542 |  6.03252 | 4.83989 |   5.83096 |
 * CE1181 | kni   | 1487 | 1490 |  7.76955 | 8.90855 |   7.76955 |
 * CE1055 | pan   | 1578 | 1541 |  8.73959 | 7.96172 |   8.72432 |
 * CE1200 | pan   |  812 |  824 |  4.44358 | 3.00821 |   3.81605 |
 * CE1187 | z     | 1332 | 1250 |  5.32862 | 4.40007 |   5.06548 |

Specifically focus on enhancers that have few numbers of binding site instances. Disruption of sites in these cases might have more profound effects.

mysql&gt; select * from Potential order by (simpt-melpt)/simpt limit 30; +++--+--+---+---+---+ +++--+--+---+---+---+ +++--+--+---+---+---+ From this calculation I cannot really see evidence for compensation. Next step to do: Identify candidate regions for evolution of regulation experiment to verify some of them? look for some existing expression dataset? Do the unbiased prediction for several different factors and summarize the turnover pattern! see what to do with MK and selection find out a model to explain the asymmetry--what's the expected turnover rate to explain 1/7 affected and almost all are decreasing? what's the expectation? chances for compensation easier allopatric? Recalculate MK table
 * region | factor | mell | siml | melpt    | simpt     | simpt_exp |
 * CE1029 | Kr    | 1746 | 1727 |  0.218233 | 0.0212329 | 0.0275388 |
 * CE1144 | ovo   | 1212 | 1214 |  0.320677 |  0.069831 |  0.123308 |
 * CE1199 | slbo  |  327 |  330 |  0.717394 |  0.203589 |  0.262045 |
 * CE1086 | en    |  333 |  252 |   2.62556 |   0.89412 |   2.59706 |
 * CE1112 | br-Z1 |  348 |  309 |   1.01843 |   0.38619 |   1.01843 |
 * CE1042 | Antp  |  477 |  447 |   1.02567 |  0.409653 |  0.992764 |
 * CE1169 | en    |  220 |  220 |  0.191158 | 0.0802844 |  0.185892 |
 * CE1190 | z     |  119 |  119 |  0.574322 |  0.249853 |  0.760966 |
 * CE1109 | Kr    |  981 | 1026 |  0.121555 | 0.0532071 | 0.0461818 |
 * CE1042 | Ubx   |  477 |  447 |   1.19577 |  0.565161 |   1.19577 |
 * CE1200 | ap    |  812 |  824 |   3.47621 |   1.64334 |   2.66249 |
 * CE1159 | Trl   |  368 |  366 |  0.190474 | 0.0967834 |  0.190474 |
 * CE1099 | br-Z3 |  823 |  723 |   3.27923 |   1.67408 |   3.02324 |
 * CE1134 | zen   |  547 |  541 |  0.528344 |   0.28739 |  0.528344 |
 * CE1084 | en    |  440 |  434 |   2.62796 |   1.48458 |   2.49766 |
 * CE1030 | hb    |  517 |  520 | 0.0411556 | 0.0238455 | 0.0241783 |
 * CE1016 | Deaf1 | 2658 | 2767 | 0.0638928 | 0.0379833 | 0.0283291 |
 * CE1064 | Trl   | 1438 | 1431 |  0.901073 |  0.540058 |  0.612796 |
 * CE1112 | br-Z2 |  348 |  309 |   1.52899 |  0.925284 |   1.46139 |
 * CE1131 | Ubx   | 1093 | 1076 |  0.377224 |  0.231532 |  0.335409 |
 * CE1069 | Mad   |  352 |  352 | 0.0798586 | 0.0490269 | 0.0848505 |
 * CE1036 | pan   |  888 |  868 |   1.32913 |  0.854274 |   1.32913 |
 * CE1021 | eve   |  505 |  476 |  0.195583 |  0.129641 |  0.195583 |
 * CE1146 | ovo   | 1081 | 1092 |  0.554871 |  0.370851 |  0.444743 |
 * CE1200 | pan   |  812 |  824 |   4.44358 |   3.00821 |   3.81605 |
 * CE1099 | br-Z2 |  823 |  723 |   5.31897 |   3.67533 |   5.31964 |

inc dec mel  22   7 sim   4  26 poly 13  33

Pairs of Overlapping binding sites select a.region, a.fpid, a.factor, b.fpid, b.factor from tfbs a, tfbs b where a.region = b.region and a.fpid != b.fpid and a.tfbs_start between b.tfbs_start and b.tfbs_end;

++--++--++ ++--++--++ ++--++--++ 154 rows in set (9.60 sec)
 * region | fpid | factor | fpid | factor |
 * CE1099 | 4472 | br-Z3 | 4471 | br-Z2  |
 * CE1099 | 4474 | br-Z2 | 4475 | br-Z1  |
 * CE1099 | 4483 | br-Z2 | 4482 | br-Z3  |
 * CE1099 | 4486 | br-Z2 | 4485 | br-Z3  |
 * CE1099 | 4488 | br-Z1 | 4489 | br-Z3  |
 * CE1010 | 3637 | abd-A | 3638 | Ubx    |
 * CE1010 | 3639 | Ubx   | 3640 | abd-A  |
 * CE1010 | 3643 | abd-A | 3644 | Ubx    |
 * CE1010 | 3643 | abd-A | 3646 | Ubx    |
 * CE1010 | 3646 | Ubx   | 3644 | Ubx    |
 * CE1010 | 3650 | abd-A | 3649 | Ubx    |
 * CE1010 | 3653 | Ubx   | 3654 | abd-A  |
 * CE1010 | 3656 | abd-A | 3655 | Ubx    |
 * CE1010 | 3659 | Ubx   | 3658 | abd-A  |
 * CE1010 | 3661 | abd-A | 3660 | Ubx    |
 * CE1010 | 3660 | Ubx   | 3659 | Ubx    |
 * CE1010 | 3663 | abd-A | 3662 | Ubx    |
 * CE1030 | 6003 | bcd   | 6002 | Kr     |
 * CE1030 | 6004 | cad   | 6002 | Kr     |
 * CE1030 | 6004 | cad   | 6003 | bcd    |
 * CE1164 | 5446 | br-Z1 | 5448 | br-Z2  |
 * CE1164 | 5448 | br-Z2 | 5446 | br-Z1  |
 * CE1164 | 5449 | br-Z2 | 5450 | br-Z3  |
 * CE1164 | 5449 | br-Z2 | 5451 | br-Z1  |
 * CE1164 | 5451 | br-Z1 | 5450 | br-Z3  |
 * CE1164 | 5457 | br-Z1 | 5459 | br-Z3  |
 * CE1164 | 5458 | br-Z2 | 5459 | br-Z3  |
 * CE1164 | 5458 | br-Z2 | 5457 | br-Z1  |
 * CE1164 | 5461 | br-Z3 | 5460 | br-Z2  |
 * CE1013 | 2269 | kni   | 2270 | hb     |
 * CE1013 | 2277 | kni   | 2278 | hb     |
 * CE1013 | 2278 | hb    | 2277 | kni    |
 * CE1056 | 4118 | Kr    | 4117 | bcd    |
 * CE1056 | 4124 | Kr    | 4123 | Kr     |
 * CE1056 | 4133 | bcd   | 4132 | Kr     |
 * CE1036 | 2466 | pan   | 2467 | tin    |
 * CE1036 | 2467 | tin   | 2466 | pan    |
 * CE1036 | 2468 | pan   | 2469 | tin    |
 * CE1149 | 6375 | eve   | 6376 | en     |
 * CE1149 | 6375 | eve   | 6374 | zen    |
 * CE1149 | 6374 | zen   | 6376 | en     |
 * CE1149 | 6374 | zen   | 6375 | eve    |
 * CE1149 | 6380 | en    | 6379 | zen    |
 * CE1149 | 6380 | en    | 6378 | eve    |
 * CE1149 | 6378 | eve   | 6379 | zen    |
 * CE1149 | 6381 | Kr    | 6380 | en     |
 * CE1149 | 6381 | Kr    | 6379 | zen    |
 * CE1149 | 6381 | Kr    | 6378 | eve    |
 * CE1149 | 6384 | eve   | 6385 | en     |
 * CE1149 | 6384 | eve   | 6383 | zen    |
 * CE1149 | 6383 | zen   | 6385 | en     |
 * CE1149 | 6383 | zen   | 6384 | eve    |
 * CE1134 | 6354 | eve   | 6353 | zen    |
 * CE1050 | 3453 | prd   | 3454 | eve    |
 * CE1089 | 5961 | kni   | 5960 | hb     |
 * CE1089 | 5912 | kni   | 5914 | kni    |
 * CE1089 | 5918 | tll   | 5919 | bcd    |
 * CE1089 | 5919 | bcd   | 5918 | tll    |
 * CE1089 | 5920 | kni   | 5918 | tll    |
 * CE1012 | 5925 | kni   | 5924 | tll    |
 * CE1012 | 5925 | kni   | 5926 | bcd    |
 * CE1012 | 5926 | bcd   | 5924 | tll    |
 * CE1012 | 5941 | bcd   | 5940 | kni    |
 * CE1012 | 5951 | kni   | 5950 | hb     |
 * CE1004 | 3610 | kni   | 3611 | hb     |
 * CE1004 | 3611 | hb    | 3610 | kni    |
 * CE1004 | 3613 | kni   | 3614 | hb     |
 * CE1004 | 3614 | hb    | 3613 | kni    |
 * CE1004 | 3615 | bcd   | 3613 | kni    |
 * CE1004 | 3615 | bcd   | 3614 | hb     |
 * CE1004 | 3616 | Kr    | 3613 | kni    |
 * CE1004 | 3616 | Kr    | 3614 | hb     |
 * CE1004 | 3616 | Kr    | 3615 | bcd    |
 * CE1004 | 3617 | kni   | 3616 | Kr     |
 * CE1004 | 3620 | Kr    | 3619 | tll    |
 * CE1004 | 3625 | bcd   | 3624 | Kr     |
 * CE1004 | 3626 | hb    | 3624 | Kr     |
 * CE1004 | 3626 | hb    | 3625 | bcd    |
 * CE1004 | 3630 | bcd   | 3629 | tll    |
 * CE1004 | 3633 | Kr    | 3632 | hb     |
 * CE1112 | 1201 | br-Z2 | 1199 | br-Z1  |
 * CE1039 | 2493 | slbo  | 2492 | vvl    |
 * CE1000 | 5968 | z     | 5967 | Trl    |
 * CE1035 | 6029 | hb    | 6028 | Kr     |
 * CE1035 | 6030 | tll   | 6028 | Kr     |
 * CE1035 | 6030 | tll   | 6029 | hb     |
 * CE1035 | 6032 | hb    | 6031 | tll    |
 * CE1035 | 6032 | hb    | 6033 | Kr     |
 * CE1035 | 6033 | Kr    | 6031 | tll    |
 * CE1035 | 6034 | tll   | 6035 | hb     |
 * CE1076 | 6260 | Mad   | 6261 | brk    |
 * CE1076 | 6261 | brk   | 6260 | Mad    |
 * CE1076 | 6262 | brk   | 6263 | Mad    |
 * CE1076 | 6263 | Mad   | 6262 | brk    |
 * CE1076 | 6264 | brk   | 6265 | Mad    |
 * CE1076 | 6266 | brk   | 6267 | Mad    |
 * CE1076 | 6282 | brk   | 6281 | Mad    |
 * CE1016 | 987 | Dfd    |  986 | Deaf1  |
 * CE1049 | 6081 | abd-A | 6082 | Ubx    |
 * CE1049 | 6083 | abd-A | 6084 | Ubx    |
 * CE1049 | 6086 | Ubx   | 6085 | abd-A  |
 * CE1049 | 6087 | Ubx   | 6088 | abd-A  |
 * CE1049 | 6089 | abd-A | 6090 | Ubx    |
 * CE1049 | 6093 | abd-A | 6092 | Ubx    |
 * CE1049 | 6094 | abd-A | 6095 | Ubx    |
 * CE1049 | 6097 | abd-A | 6096 | Ubx    |
 * CE1049 | 6099 | Ubx   | 6098 | abd-A  |
 * CE1049 | 6100 | Ubx   | 6101 | abd-A  |
 * CE1049 | 6102 | abd-A | 6103 | Ubx    |
 * CE1049 | 6106 | abd-A | 6107 | Ubx    |
 * CE1049 | 6108 | Ubx   | 6109 | abd-A  |
 * CE1049 | 6110 | abd-A | 6111 | Ubx    |
 * CE1049 | 6113 | Ubx   | 6111 | Ubx    |
 * CE1049 | 6113 | Ubx   | 6110 | abd-A  |
 * CE1049 | 6112 | abd-A | 6113 | Ubx    |
 * CE1049 | 6116 | abd-A | 6117 | Ubx    |
 * CE1049 | 6119 | abd-A | 6118 | Ubx    |
 * CE1049 | 6120 | abd-A | 6121 | Ubx    |
 * CE1049 | 6122 | Ubx   | 6123 | abd-A  |
 * CE1049 | 6125 | Ubx   | 6124 | abd-A  |
 * CE1049 | 6124 | abd-A | 6125 | Ubx    |
 * CE1049 | 6126 | Ubx   | 6127 | abd-A  |
 * CE1049 | 6128 | abd-A | 6129 | Ubx    |
 * CE1049 | 6131 | abd-A | 6130 | Ubx    |
 * CE1049 | 6133 | Ubx   | 6132 | abd-A  |
 * CE1049 | 6135 | Ubx   | 6134 | abd-A  |
 * CE1042 | 4111 | Ubx   | 4112 | Antp   |
 * CE1162 | 6417 | bcd   | 6416 | hb     |
 * CE1028 | 3680 | tll   | 3679 | bcd    |
 * CE1034 | 2448 | hb    | 2449 | tll    |
 * CE1034 | 2454 | en    | 2453 | twi    |
 * CE1058 | 6146 | Ubx   | 6147 | eve    |
 * CE1058 | 6152 | zen   | 6150 | eve    |
 * CE1058 | 6152 | zen   | 6153 | Ubx    |
 * CE1058 | 6153 | Ubx   | 6150 | eve    |
 * CE1058 | 6164 | z     | 6163 | Trl    |
 * CE1068 | 6202 | hb    | 6201 | Trl    |
 * CE1068 | 6206 | z     | 6207 | Trl    |
 * CE1068 | 6210 | z     | 6211 | Trl    |
 * CE1068 | 6214 | Trl   | 6215 | z      |
 * CE1068 | 6219 | z     | 6218 | Trl    |
 * CE1068 | 6221 | hb    | 6220 | Trl    |
 * CE1068 | 6221 | hb    | 6222 | z      |
 * CE1068 | 6222 | z     | 6220 | Trl    |
 * CE1068 | 6228 | z     | 6226 | hb     |
 * CE1068 | 6235 | hb    | 6234 | z      |
 * CE1068 | 6245 | Trl   | 6246 | z      |
 * CE1181 | 5372 | twi   | 5371 | hb     |
 * CE1181 | 5375 | twi   | 5374 | tll    |
 * CE1181 | 5378 | en    | 5377 | hb     |
 * CE1181 | 5377 | hb    | 5378 | en     |
 * CE1029 | 265 | hb     |  266 | eve    |
 * CE1069 | 753 | tin    |  751 | Mad    |
 * CE1200 | 6487 | ap    | 6486 | pan    |