Hebin:TFBS asymmetry

From OpenWetWare
Jump to: navigation, search

Contents

Question

  • Are enhancers enriched for binding sites than random sequences?
  • Are binding sites evolving any slower than XXX?  This involves explaining the excessive conservation in the spacer sequences!
  • How many sites are affected by substitutions between mel and sim?
  • Confirm the asymmetry! and show if conservation occurs at higher level--Pocc (compensation within homeotypic sites) or even between different factors'sites
  • One direction: simulations to investigate properties of compensatory mutations
    • how easy is it to create a new site de-novo
    • where do compensatory mutations tend to happen?
    • confirm that overlapping sites tend to be more conserved.  also show by simulation under a simple compensatory scheme.

 

Cloning

the Linear Templates were cloned into <a class="external-link" href="http://catalog.takara-bio.co.jp/en/product/basic_info.asp?unitid=U100005842">pUC118</a>

cat 3322

bcd his-tag out of frame

the 3 prime end of the 2-step pcr product.  I cloned them into the vectors and picked two clones, both show this wrong 3' end that put his tag out of frame.

 

GCAGTTTGCCTACTGCTTTAAT-ATCATCACCATCACCACTAATAA
                  C   C

Now I'm not sure if this is a problem with pcr or the primer.  I'm going to do the 2-step pcr again and do bulk sequencing from the 3' end and also sequence the other three clones I picked.  Might need to reorder the primer.

 

General Method

Approaches

Compare evolutionary constraint in TFBS, spacers and other regions

117/704 sites contain fixed substitutions between mel and sim, around 1/7
most desirably: compare to the nearby gene's neutral rate.

Calculate π
total: 0.01242
tfbs : 0.00729

nonsyn 0.0026
syn    0.0352
intron 0.0166
5'UTR  0.0108
3'UTR  0.0113
inter  0.0204

 

Divide sites by their information content

Method

  • First add pseudo count P[b] * K (k=1) to every cell of the matrix
  • ic = sum( f[b,p] * ln ( f[b,p] / P[b] ) )
  • Note that this is different from the other method which define ic = 2 + sum( f[b,p] * ln( f[b,p] ) )
  • ic range: 0.089 - 1.552.  divide into (0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6) bins, count the total number of sites in each bin as well as number of sites div/poly

 

Results

For all 30 factors
<0.2    <0.4    <0.6    <0.8    <1.0    <1.2    <1.4    <1.6
653     1233    1235    852     837     1069    32      207
21      34      31      25      12      26      0       3
7       18      16      14      5       6       0       6

<0.8	>0.8
2.79%	=1.91%

<img class="image-inline" src="result/Div-IC.jpg/image_preview" alt="Divergence~Information Content" />

 

Determine Ancestral Binding Site Energy

Rationale

When the inferred ancestral sequence for a melanogaster binding site is not predicted to be a binding site, the simplest interpretation is that one or more melanogaster-specific substitutions create a novel site.  However due to the degeneracy and other properties of the TFBS motif, it is conceivable that there might be a weak site overlapping with the current position of the mel binding site in the ancestral sequence.  Maybe what the mel-specific mutations does is to make the binding site stronger while also sliding it a little bit.

<p>anc  AATCCCCACTT
       >>>>>>
mel  ...T.......</p>

Code

 <a title="Count_Variation" class="internal-link" href="code/Count-variations.pl">Count_Variation</a>

How many new gains?

select if(anc_score >=0, 'p',if(anc_score=-1, '-1','-2')) anc, count(*) from tfbs where tfbs_score >= 0 group by if(anc_score >=0, 'p',if(anc_score = -1, '-1', '-2'));

 

+-----+----------+
| anc | count(*) |
+-----+----------+
| -1  |       14 | 
| -2  |        5 | 
| p   |      695 | 
+-----+----------+

2% are newly gained

How many polymorphism in binding sites available for analysis?

select count(*), allele_number from tfbs_poly where tf_name is not NULL and tf_name != 'spacer' and polytype = 'nt' and poly_Or_not = 'poly' group by allele_number order by allele_number;

+----------+---------------+
| count(*) | allele_number |
+----------+---------------+
|        3 |             2 | 
|        6 |             3 | 
|       22 |             4 | 
|       40 |             5 | 
|       17 |             6 | 
+----------+---------------+

 

Determine The Effect of Fixed/Polymorphic Nucleotide Substitutions

Rationale

#Table structure of the result:

ID	unique for each change
type	mel / sim / poly / amb
fpid	fpid of the binding site
coord	genomic coordinate
factor	transcription factor
effect	change in PWM score or pvalue?
freq	allele frequency. only for polymorphic sites

#Notes

For each fixed nucleotide substitution in melanogaster, if the ancestral inference is available, I will calculate the difference in PWM score by this substitution alone on the background of the ancestral sequence.  If the ancestral inference is not available, I will mark it as ambiguous.  The same procedure will be done for simulans specific fixed changes with the consideration of whether the ancestral sequence is a good binding site.  In the mean time I would like to produce a list of candidate targets for MITOMI.

Details: I'll go in to get a b.s. for factor X.  First I'll determine if this site is gapped or not.  If gapped, this site would be discarded.  If not, I'll then go through each substitution.  For fixed sites, I'll determine the direction of the change and then calculate the Polymorphic sites are treated the same way as fixed substitutions, except that the frequency of the polymorphism is also collected.

How many sites are un-gapped?

Total	793 sites with PWM score
Gapped	89 sites
Nngap	704 sites

How many sites are gained in mel and how many lost in sim within the footprints?

Within 704 ungapped sites
(anc-mel-sim), using 80% recovery rate cutoff for each factor, the percentage is divided by (535+21+2 = 558)
	cutoff 80%	cutoff 0
1-1-1	535	95.9%	601	96.9%
1-1-0	21	 3.8%	19	 3.1%
1-0-1	2	 	0	
0-1-0	14	 2.5%	12	 1.9%
0-0-1	2		0
0-0-0	130		72
117/704 sites contain fixed substitutions between mel and sim, around 1/7
 21/704 sites are present in the ancestor but lost in simulans

 

Complexity in evaluating the effects of changes

Here is an example:

anc     .........?......?
yak     .....T...GA.....a
mel     tgtaaAATAATTattat
sim     .........T....---
        .........T....---

The A->T change actually makes the complementary sequence a better binding site as shown below

a       \AATAATT\    
                   a  position=      1  score=   4.38  ln(p-value)=   -5.73  sequence= AATAATT
                   a  position=      1C score=   5.03  ln(p-value)=   -6.44  sequence= AATTATT
a       \AATATTT\
                   a  position=      1  score=   2.47  ln(p-value)=   -4.30  sequence= AATATTT
                   a  position=      1C score=   1.66  ln(p-value)=   -3.73  sequence= AAATATT

I noticed that the consensus for en is AATTAAA.  Like this case, when the consensus has some kind of symmetry so that the sequence can be read from both strands, then a mutation might make the other strand a better site.

The complexity lies in the following:

1. The orientation and position of the binding site is not fixed.  A change may actually make the best match to PWM shift by several base pair or appear in the complementary strand.

2. Because of 1, even if accepting the additivity and site independence assumption in PWM, changes in different positions may not be independent just because the binding site itself may have moved.  In such conditions, inference of the order of the mutations become necessary and even critical.

Luckily this type of situation is not very common.  First of all there are not many binding sites where there are more than  1 change.  See below for the distribution of number of fixed/poly changes.  Besides, this file records all cases where a change seems to have changed the orientation of the best match.

<a title="Complex cases where a change will change the orientation of a site" class="internal-link" href="result/complex-cases-where-a-change-will-change-the">Complex cases where a change will change the orientation of a site</a>

Another example--potential compensatory mutation

6314    4.52
anc     .................
yak     ..g..-.G..A...c..
mel     tatacTATTATTcggtg
sim     .....A.....G.....	T->A : 0.51->1.99
        ...........G.....	T->G : 4.52->0.51
        ...........G.....
        .....A.....G.....

 

What's the distribution of # fixed/poly changes per un-gapped binding site?

no      div     poly
0       587     644
1       90      53
2       21      3
3       4       3
4       2       1
5       0       0
sum	704	704

 Note: here I didn't distinguish which lineage the change is on.

 

In cases of more than one change in a single bs,  do the second change tend to be of complementary?

I checked all cases with more than one changes in a single binding site on one lineage (for mel, >1 fixed diff;  for sim, >1 fixed/poly changes)

Expectation

The hypothesis is that an initial change that decreases binding affinity would pose stronger constraint on the fixation of subsequent mutations (assuming a non-linear fitness-energy function because the first mutation might push the energy to some critical window where a further decrease would be very deleterious.

 

The prediction under this hypothesis is that if there are two changes in a single site, they tend to be of opposing effects.  This hypothesis will not hold if the complementary change happens but more frequently in another location of the enhancer, in which case, the original site that get damaged is now deprived of its functional importance and thus become prone to further changes.

 

Observation

Focusing on 2-3 changes in a single site on either lineage, in mel they tend to be both/all increasing or the net effect is obviously increasing.  For sim, of the 12 or so cases I checked, three of them that have two changes show opposing effects.  Two out of these three cases the net effect is quite close to zero.  However while the estimates of the effects are from PWM, it need to be taken with caution.

I tried to solve some of the ambiguous rooted base pairs.  It turned out that VISTA browser and UCSC use different algorithms for alignment.  While the VISTA one which I used in this case shows a deletion in yak, the sequence is present in UCSC alignment.

After solving 3-4 cases where this would potentially influence my conclusion, I still hold the impression that the 3/13 cases show evidence of complementary changes.  Plus these 3 cases have a pretty small predicted effects.  This makes sense in that if the initial change only decreases the binding energy by a little, then there are a lot of ways to compensate that.  If the initial change strongly influence the binding affinity, then only a very small number of specific changes can help revive the site, in which case creating another one in a new place may be easier!

 

MK test and Frequency Spectrum

Code

setwd("/Volumes/hebin-1/Documents/tfbs/Result")
data <- read.table("effect_changes.txt", head=T)

###########
# inc/dec
###########

t <- 1;
mk <- c(0,0,0,0,0,0); dim(mk) <- c(3,2)
rownames(mk) <- c('mel','sim', 'poly'); colnames(mk) <- c('inc','dec')
mk[1,1] <- length(data$effects[data$type == 'mel' & data $ effects > t &  data$score > 0])
mk[1,2] <- length(data$effects[data$type == 'mel' & data $ effects < -t &  data$score > 0])
mk[2,1] <- length(data$effects[data$type == 'sim' & data $ effects > t &  data$score > 0])
mk[2,2] <- length(data$effects[data$type == 'sim' & data $ effects < -t &  data$score > 0])
mk[3,1] <- length(data$effects[data$type == 'poly' & data $ effects > t &  data$score > 0])
mk[3,2] <- length(data$effects[data$type == 'poly' & data $ effects < -t &  data$score > 0])
mk
setwd("/Volumes/hebin-1/Documents/tfbs/Result")
data <- read.table("effect_changes.txt", head=T)

##############
# large/small
##############
t2 <- 4;
cond <- (data$effects > t2 | data$effects < -t2)
mk2 <- c(0,0,0,0,0,0); dim(mk2) <- c(3,2)
rownames(mk2) <- c('mel','sim', 'poly'); colnames(mk2) <- c('large','small')
mk2[1,1] <- length(data$effects[data$type == 'mel' & cond & data$score > 0])
mk2[1,2] <- length(data$effects[data$type == 'mel' & !cond &  data$score > 0])
mk2[2,1] <- length(data$effects[data$type == 'sim' & cond &  data$score > 0])
mk2[2,2] <- length(data$effects[data$type == 'sim' & !cond &  data$score > 0])
mk2[3,1] <- length(data$effects[data$type == 'poly' & cond &  data$score > 0])
mk2[3,2] <- length(data$effects[data$type == 'poly' & !cond &  data$score > 0])
mk2

######################
# frequency spectrum
######################
t3 <- 1
comb <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5])[1:5]
inc <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5 & data$effects > t3])[1:5]
dec <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5 & data$effects < -t3])[1:5]
a <- c(1, 1/2, 1/3, 1/4, 1/5)
matrix <- rbind(comb/sum(comb)*sum(a), inc/sum(inc)*sum(a), dec/sum(dec)*sum(a), a)
rownames(matrix) <- c('combined', 'increase', 'decrease', 'neutral')
barplot(matrix, beside = T, legend.text = rownames(matrix), col=rainbow(10)[c(3,5,7,8)])
title(main = 'frequency spectrum of 5 alleles of different classes')

###################################
#  To count the number of changes
#  in spacers
#  May 8th 08
#  hebin
###################################
SELECT poly_Or_not, count(*) 
FROM tfbs_poly 
WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not in ('poly', 'div') 
GROUP BY poly_Or_not;

+-------------+----------+
| poly_Or_not | count(*) |
+-------------+----------+
| div         |     2796 | 
| poly        |     1834 | 
+-------------+----------+

SELECT if(tfbs_poly.anc = mel, 'sim', if(anc = '?', 'amb', 'mel')) type, count(*) 
FROM tfbs_poly 
WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not = 'div' group by type;

+------+----------+
| type | count(*) |
+------+----------+
| amb  |      646 | 
| mel  |     1223 | 
| sim  |      927 | 
+------+----------+

Solving ambiguous ancestry sites

There are 39 changes that cannot be rooted using my original VISTA mel-yak alignment.  I extended the alignment to multi-species up to virilis, data retrieved from UCSC genome browser

 

<a title="Solving ambiguous ancestry using UCSC multi-species alignment" class="internal-link" href="result/ambiguous-sites.txt">Solving ambiguous ancestry using UCSC multi-species alignment</a>

 

 

All together I solved 15 / 39 cases with reasonable confidence (based on parsimony principle).  4/5 mel specific changes are predicted to increase binding affinity while 5/6 sim specific changes are expected to decrease binding affinity.  So the asymmetry is made even more extreme.

 

Results

inc: >0; dec: < 0
     inc dec
mel   34  16
sim    7  42
poly  17  50

excluding mel_specific sites
     inc dec
mel   15  15
sim    4  31
poly  14  42

inc: >1; dec: < -1
     inc dec spacer
mel   26   8   1223
sim    1  35    927
poly  12  42   1834

excluding the mel_specific sites
     inc dec
mel   11   7
sim    0  25
poly  10  35

using Pi instead of PWM score
inc: >0.01; dec: < -0.01
     inc dec
mel   22   7
sim    4  26
poly  13  33

inc: >0.005; dec: < -0.005
     inc dec
mel   23   9
sim    4  30
poly  14  40


individual regions pattern the same
CE1049 27 changes (including ambiguous sites)
     inc dec  spacer
mel    4   2   26
sim    0   5   22
poly   1   5   58

CE1004 11 changes
     inc dec  spacer
mel    0   0   3
sim    0   2   1
poly   1   2   3

CE1068 24 changes
     inc dec  spacer
mel    2   2    4
sim    0   7    7
poly   0   1   10

large : small  cutoff = 3
     large small  spacer
mel     16    34    1223
sim     23    26     927
poly    28    39    1834

large : small  cutoff = 4
     large small
mel     13    37
sim     16    33
poly    16    51

Problems

  1. because the number is small, the ambiguous sites can have a large effect on the result.
  2. pwm is derived from melanogaster footprints.  They are adapted to whatever equilibrium frequency is in melanogaster.  Does this a priori predict more gains than loss in mel?
  3. because the footprints are defined in melanogaster, imagine the

    ancestral pool of sites--the sampling is biased towards changes that increase binding affinity because those are the ones that will be

    detected in the footprint assays

Further questions: 0. what's the expected number of large/small effects? 1. what's the number for non-binding sites?  2. frequency spectrum? 3. if really biased by the sampling, maybe expand to PWM predictions?  where are the functional binding site?

Further directions:

  1. verify with selex PWM?
  2. unbiased prediction of binding sites in both species.  But now how could you distinguish functional sites from site like sequences, which presumably has very different evolutionary pattern.
  3. if inc/dec dimension is orthogonal to large/small effect dimension, then the later should not be biased.

 

<img class="image-inline" src="result/Frequency%20Spectrum.pdf/image_preview" alt="Frequency_Spectrum" height="420" width="420" />

 

Validate the results with Pollard's PWM

Methods

  1. Start from effect_changes.txt
  2. for each documented change, retrieve the footprint sequence, make two copies with the only difference in this change, one being ancestral and one being evolved
  3. use patser with Pollard's PWM to find the score difference

Results

inc > 1; dec < -1;
     inc dec
mel   15   8
sim    8  16
poly   8  21

inc > 0; dec < 0
     inc dec
mel   22  14
sim   13  22
poly  17  33

Validate the results with SELEX PWM

 

Impression on the comparison between SELEX PWM and footprint PWM

Casey's PWM are learned from the footprints.  Typically they are less information-dense (containing lots of low information site).  They recover more footprints, as they are optimized for.  But they also predict a lot more additional sites when used to scan the whole enhancer

Methods

  1. extract matrix from xml file and format as consensus.  <a title="XML_PWM2.pl" class="internal-link" href="code/XML-PWM2.pl">XML_PWM2.pl</a>
  2. use patser to transform the matrix into scoring matrix
  3. for each change for the factor, use the selex matrix to reexamine the effect and record.  <a title="compare_selex.pl" class="internal-link" href="code/compare-selex.pl">compare_selex.pl</a>

 

Results

footprint PWM
     inc dec
mel    8   2
sim    1  13
poly   4  12

SELEX PWM
     inc dec
mel    8   3
sim    2  11
poly   3  11

<img class="image-inline" src="result/compselex.jpg/image_large" alt="compselex.jpg" height="597" width="675" />

 

Unbiased prediction

How to choose good PWMs and proper cutoffs

Case study: bcd23(SELEX) vs. bcd

There are 5 footprint sites.  Use the lowest footprint score 3.11 as cutoff, Casey's PWM bcd finds 16 hits while the same cutoff on bcd23 (selex) misses one footprint site but finds only 8 hits.  However there are 3 sites that are consistently recovered but not found in footprints for unknown reasons.

eve  position=    431  score=   4.29  ln(p-value)=   -6.66  sequence= TCCAATCC
eve  position=    455  score=   5.51  ln(p-value)=   -7.57  sequence= CCCAATCC
eve  position=    461  score=   5.51  ln(p-value)=   -7.57  sequence= CCCAATCC

 Protocol

  1. ../patser/patser -A a:t 0.3 c:g 0.2 -m ../OptimalPatserMatrices/bcd23 -b 1 -u2 -s -c -ls 0 -f bcd_test.fasta
    |  grep "position" > ./temp.txt
  2. Convert the output to a table format <a title="convert.pl" class="internal-link" href="../code/temp3.pl">convert.pl</a>
  3. Validate that the patser predicted position are consistent with the genome coordinate <a title="Validate the relative position" class="internal-link" href="../code/Unbiased-prediction.pl">Validate the relative position</a>
  4. Given a cutoff, evaluate the percentage of footprint recovered and the ratio of total hits to hits in footprints.  <a title="PWM and cutoff evaluation" class="internal-link" href="../code/pwm-and-cutoff-evaluation">PWM and cutoff evaluation</a>

 

<img class="image-inline" src="result/hb-cutoff.jpg/image_preview" alt="hb_cutoff" />

Method

<tbody> </tbody>
type
mel
sim anc
1   √   √   √
2   √     √
3
  √   √
4   √    
5
  √  
6
  √
  ?
7

  √   ?
  1. Extract the whole enhancer sequence alignment for mel and w501 sim, anc as well.  Consider filling up the w501 gaps with other sim sequences (optional).

  2. Do predictions in both mel and sim sequences with the chosen pwm and cutoff.
  3. Transform the coordinates (- is not counted in patser) and get rid of overlapping sites
  4. Store the binding site map in array of arrays. (MySQL and R have more powerful matrix manipulation tools, but they are not good at string processing and looping structure.)
  5. First examine sites predicted in mel sequences--see if the asymmetry holds.
    1. To transform the coordinate, I used a hash $coordinate -- $coordinate{$n} stores the absolute position for relative pos $n.  This value is used for extracting sequences using substr.
    2. To deal with overlapping site, I recorded the previous site information and judge if the next site is overlapping.  Throw away the lower score one.
    3. Loop through each of the remaining binding sites.  Extract the alignment block, throw away gapped sites.  Evaluate the effects of changes.  Record "region factor position type effect"
  6. For each predicted binding site, judge the ancestral sequence if possible.  Assign the phylogenetic type of the site (available types listed below)Then examine all changes in those binding sites including mel and sim predicted ones.
  7. Look at compensatory changes
  8. Use Ah-Ram's algorithm to plot

 

~/Documents/tfbs/patser/patser -A a:t 0.3 c:g 0.2 -m ~/Documents/tfbs/OptimalPatserMatrices/bcd -b 1 -ls 3.11 -s -c
S2E sequence
CAATATAACCCAATAATTTGAAGTAACTGGCAGGAGCGAGGTATCCTTCCTGGTTACCCGGTACTGCATAACAATGGAACCCGAACCGTAACTGGGACAGATCGAAAAGCTGGCCT
GGTTTCTCGCTGTGTGTGCCGTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGCAGTTGCAGCGTTTCGCTTTCGTCCTCGTTTCACTTTCGAGTTAGACTTTATTGC
AGCATCTTGAACAATCGTCGCAGTTTGGTAACACGCTGTGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTATAATCGCACAACGAGACCGGGTTGCGAAGTCAGGGC
ATTCCGCCGATCTAGCCATC--------GCCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAGCCAAGGGCTTGACTTGGAATCCAATCCCGATCCCTAGCCCGATCC
CAATCCCAATCCCAATCCCTTGTCCTTTTCATTAGAAAGTCATAAAAACACATAATAATGATGTCGAAGGGATTAGGGGCGCGCAGGTCCAGGCAACGCAATTAACGGACTAGCGA
ACTGGGTTATTTTTTTGCGCCGACTTAGCCCTGATCCGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAGTTGTGGGTGG---ACCCCACGATTTTTTTG

select fpid, tfbs_start-5489522 pos, tfbs_score, tfbs_pvalue, strand from tfbs where region = 'CE1056' and factor = 'bcd';

Result

  • Examine mel-predicted sites only, for hb, cutoff = 4, t=1
    In footprint sites:
         inc dec
    mel    4   1
    sim    0   6
    Total: 97, Gapped: 16, 12 ungapped sites contain fixed differences
    In mel-predicted sites:
         inc dec
    mel    6   2
    sim    0  10
    Total: 206 binding sites, Gapped: 29
    of the 10 changes that cannot be rooted, 9/10 shows a stronger binding energy in mel and 1/10 is the other way.  So resolving the ancestry for them would only enhancer the asymmetry
    
  • Examine sim-predicted sites only, for hb, cutoff = 4, t=1

    w501 only
         inc dec
    mel    0   4
    sim    3   2
    Total: 151 binding sites, Gapped: 19
    
    mixed sim
         inc dec
    mel    0   8
    sim    7   3
    Total: 191 binding sites, Gapped: 23

 Interpretation

  1. Extreme: if there is no turnover at all, then one would expect equal number of increase and decrease to balance out on each lineage
  2. the observed extreme asymmetry indicates either evolution of total binding or high turnover plus biased sample of footprint sites from the focal species, likely a mixture of the two.

 

  • Turnover rate estimated from the prediction

Things to get to:

net change in the number of sites per enhancer, in total the number of gains and losses in each species, number of sliding cases

Region	Overlap	mel	sim	Sliding	
CE1030	0	1	0	0
	1	1	0
CE1013	10	2	0	0
	14	3	2
CE1056	3	1	0	0
	4	1	1
CE1149	4	3	1	1
	6	3	1
CE1089 	16	3	2	1
	20	6	3	#has poor alignment!
CE1012	14	3	1	1
	21	6	7
CE1004	14	3	2	2
	16	5	3
CE1035	18	2	0	0
	23	3	3
CE1162	38	14	9	5
	20	23	6	#has poor alignment!
CE1034	7	0	0	0
	7	2	2
CE1068	10	12	11	5
	14	13	10?	#10<11, due to the mixed sim allele in the previous case
CE1181	12	1	3	0
	11	3	4
CE1029	12	3	4	3
	13	4	5
total	158	48	33	18
footpr	66	23

no evidence that the mel-specific ones tend to be non-footprints.  this is sort of consistent with Justin's
finding that semi-conserved sites do not tend to have lower affinity.

Estimate from hb alone:
mel: 206 sites in total, 25% mel specific
sim: 191 sites in total, 17% sim specific

      inc   dec
mel	6     9
sim	7    16
Total: 239 binding sites, Gapped: 45

Among the ambiguous sites,
2/9 (inc/dec)

using hb39 SELEX matrix
CE1030  3       0       0       0
CE1013  8       0       0       0
CE1056  2       1       0       0
CE1149  8       3       0       0
CE1089  15      0       0       0
CE1012  11      0       0       0
CE1004  12      2       1       0
CE1035  16      1       0       0
CE1162  36      13      9       0
CE1034  7       0       0       0
CE1068  13      10      13      0
CE1181  22      2       4       0
CE1029  14      4       3       0
total	167	36	30

 

mel Polymorphism

Materials

  1. eve-stripe 2 enhancer element from Ludwig and Kreitman 1995
  2. 		|	dec	inc	spacer
    ----------------+---------------------------------
    mel	div	|5	1	0	4
    	poly	|4	1	1	2
    		|	4/6	1/6	3/5, 1/5
    ----------------+---------------------------------
    sim	div	|7	2	0	5
    	poly	|3	0	0	3
    		|			2/6, 2*1/6
  3. from other literatures
    name	Alleles	Chr_arm	Coordinate		Region	Start		End		ref
    Ubx	22+1	3R	12,526,539-12,527,269	CE1034	12526644	12527144	-Phinchongsakuldit, MacArthur & Brookfield, MBE, 2004
    Ubx	22+1	3R	12,599,005-12,600,040	CE1181	12598627	12600114	-same as above
    ftz	13	3R	2,689,483-2,689,995	CE1121	2689374 	2690045 	Dermitzakis, Bergman & Clark, MBE, 2003
    en	12	2R	7,043,788-7,044,631	CE1149	7044399 	7044957 	same as above
    hb	12	3R	4,517,317-4,520,669	CE1162	4520300 	4524677 	Tautz & Nigro, MBE 1998
    	50	2L	12,537,331-15,942,868	CE1061	13904731	13905265	50 genomes project
    						CE1138	14614893	14616302
    						CE1078	15474198	15476199
    	50	X	1,911,660-5,005,056	CE1086	1980273 	1980606
    						CE1085	1981449 	1981745
    						CE1084	1996366 	1996806
    						CE1172	2033396 	2033534
    						CE1192	2304109 	2304228
    						CE1188	2651290 	2651409
    						CE1187	2652700 	2654032
    						CE1099	3105441 	3106264
    						CE1126	4129700 	4134947
    						CE1038	4231703 	4233191
    						CE1146	4908617 	4909698
    there are altogether 144 footprint sites
    
hb promoter CE1162
overlaps with fpid 6413-6421, 8 bcd sites and 1 hb site
disappointingly, there are no polymorphism or fixed substitutions in these 8 sites.

Transcription factors divergence

Methods

  1. get the TF protein sequence from UCSC
  2. blast for AA->AA on flybase, save in XML
  3. run TF_align_XML.pl to get alignment
  4. run emma on EMBOSS with alignment, save in MSF
  5. run pretty plot with MSF
  6. on UCSC go to uniprot to get structure information
  7. annotate in preview for DNA binding domain

 

Calculate Pocc and its application

Methods

  • patser score = sigma{ ln f(b,j)/p(b) }
  • deltaG = RT sigma{ ln f(b,j)/p(b) }
  • Pi = [X] / (exp{-deltaG / RT} + [X])
  • let [X] = exp{deltaG[consensus]}
  • Pi = 1 / (exp{ (deltaG[consensus] - deltaG) / RT} + 1) = 1 / (exp{patserS[consensus] - patserS} + 1)
  • Consensus sequence has the highest score -> highest deltaG -> lowest Kd -> more difficult to dissociate (because the more negative deltaG is, the more self-initiating the process is.)
  • I've verified that my hand-written way of calculating score is the same as patser calculation, while the later is much slower.

Validation of the method

I manually check some of the pair Pocc in the output.  Sometimes there seem to be a lot of changes on sim lineage that decreases binding energy score by quite a bit.  For example, for Region CE1057, factor Trl, here is the list of binding sites energy scores for the two species

fpid	mel	sim
4152	0.13	-0.19
4154	4.93	-0.63
4157	7.36	7.29
4165	9.10	10.59

Pocc_mel = 0.066709956
Pocc_sim = 0.188498675

The reason is better explained below in properties of Pocc.  Basically, changes in binding score doesn't correlate linearly with Pocc.
Thus change in the strong site 4165 contribute disproportionally to Pocc.

 

Properties of Pocc

Pocc is the probability that AT LEAST one TF is bound to the enhancer.  As expected, abolishing one binding site in an enhancer consisting a handful of strong sites will not affect Pocc to any substantial amount.  However, why some enhancers are built with a single site while some contain a dozen similarly strongly sites?

 

Several levels of considering binding sites profile divergence:

  1. count the kind and number of binding sites
  2. calculate Pocc, which integrates over all the binding sites of the same kind
  3. model the output of the enhancer thus integrating binding sites position and combination information and so on
  4. Because Pocc is not responsive when there is redundancy. (it is enough to have one or several strong sites to ensure that at least one TF is bound), this comparison will only reveal those cases with single site in the enhancer that is altered in one species or those that have many sites altered at the same time.

One thing to note is what Pocc value reflects.  When there are many strong sites in a sequence, abolishing one of them seem to have very small changes on Pocc.  Necessary to read Pocc papers.

 

1. permute the sequence (preserve dinucleotide frequency)

For bcd using 1. selex PWM 2. PWM from Down et al.

<tbody> </tbody>
Region
Pocc pvalue
CE1030 0.1604 0.232

0.2641
0.618
CE1056
0.8398
0

0.9209
0.001
CE1089
0.0166
0.935

0.0881
0.989
CE1012
0.0894
0.372

0.4948
0.428
CE1004
0.0480
0.389

0.6110
0.116
CE1035
0.4169
0.125

0.8910
0.029
CE1028
0.0445
0.135

0.0753
0.385
CE1074
0.1199
0.315

0.8253
0.03
CE1162
0.7363
0.165

0.9691
0.667

 

code

<a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc expected</a>

<a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc pair calculation</a>

Result

  1. General pattern:

 

<a title="Pocc expected" class="internal-link" href="code/Pocc-pairexp.pl"></a>

For hb
Pocc = 0.0490626779367482  	p-value = 0.951		CE1030
Pocc = 0.857583818550552        p-value = 0.009
Pocc = 0.644644214013364        p-value = 0.066
Pocc = 0.726122781523663        p-value = 0.164
Pocc = 0.99205125002337 	p-value = 0
Pocc = 0.98936891052637 	p-value = 0
Pocc = 0.994712693935612        p-value = 0
Pocc = 0.996523508773069        p-value = 0
Pocc = 0.999922656447524        p-value = 0.005
Pocc = 0.953240878140783        p-value = 0
Pocc = 0.985433379804871        p-value = 0.035
Pocc = 0.955171520728597        p-value = 0.08
Pocc = 0.972211787235409        p-value = 0.006

2. same sequence, scramble the pwm (preserve the total information content)

hb  1000 iterations
CE1030        Pocc = 0.0485909445226903       p-value = 0.793
CE1013        Pocc = 0.855046878947466        p-value = 0.008
CE1056        Pocc = 0.641396463335304        p-value = 0.162
CE1149        Pocc = 0.723027286433472        p-value = 0.087
CE1089        Pocc = 0.991711235632589        p-value = 0
CE1012        Pocc = 0.988940266852853        p-value = 0
CE1004        Pocc = 0.994466512897498        p-value = 0.002
CE1035        Pocc = 0.996344086058097        p-value = 0.003
CE1162        Pocc = 0.999915746792497        p-value = 0.178
CE1034        Pocc = 0.952078221580422        p-value = 0.001
CE1068        Pocc = 0.984885043860919        p-value = 0.031
CE1181        Pocc = 0.953952078454468        p-value = 0.019
CE1029        Pocc = 0.971370913035186        p-value = 0.016

bcd24 1000 iterations
CE1030  Pocc = 0.161788727877099        p-value = 0.13
CE1056  Pocc = 0.842029413353345        p-value = 0.002
CE1089  Pocc = 0.0167758142211993       p-value = 0.733
CE1012  Pocc = 0.0902420614763332       p-value = 0.334
CE1004  Pocc = 0.0484866852062439       p-value = 0.375
CE1035  Pocc = 0.419809501350481        p-value = 0.101
CE1028  Pocc = 0.0449268578762426       p-value = 0.133
CE1074  Pocc = 0.12098154880125 	p-value = 0.144
CE1162  Pocc = 0.739505636509969        p-value = 0.109

Pocc pairs of the two species' enhancer

 

<img class="image-inline" src="result/Pocc-pair.jpg/image_large" alt="Pocc_pair" height="657" width="628" />

 

 

Obviously there are more decreases than increases.

 

Something interesting I discover

Here I can divide all cases into three categories:
1. Affinity in footprints decreases a lot in simulans but Pocc are similar -- functional compensation
2. Affinity in footprints decreases a lot in simulans and Pocc drops dramatically as well -- functional evolution or hetero-factor compensation?
3. Affinity in footprints doesn't change a lot but Pocc changes -- either sim evolves a stronger binding or there are hidden sites somewhere!

Category I

Changes in footprints:
CE1004 3616     Kr  7.69    8640866 poly    0.38    1       4
CE1004 3618     Kr  7.62    8640890  sim   -5.81    4       4
CE1004 3620     Kr -1.00    8640926  sim   -1.74    4       4
CE1004 3623     Kr -1.00    8641007 poly    0.00    1       5

Pocc values
Region	factor	mel_l	sim_l	mel_Pocc		sim_Pocc
CE1004  Kr      548     542     0.391655915895923       0.364895827027112

For this case, actually there doesn't need to be any complementary changes because there are already a lot of strong sites in this region:
+------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+
| fpid | region | chr_arm | factor | target | fp_start | fp_end  | tfbs_start | tfbs_end | tfbs_score | tfbs_pvalue | anc_score | anc_pvalue | strand | gapped |
+------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+
| 3612 | CE1004 | 3L      | Kr     | h      |  8640828 | 8640838 |    8640827 |  8640836 |       7.56 |       -9.26 |      7.56 |      -9.26 | -      | N      | 
| 3616 | CE1004 | 3L      | Kr     | h      |  8640856 | 8640866 |    8640857 |  8640866 |       7.69 |       -9.39 |      7.69 |      -9.39 | +      | N      | 
| 3618 | CE1004 | 3L      | Kr     | h      |  8640881 | 8640891 |    8640882 |  8640891 |       7.62 |       -9.33 |      7.62 |      -9.33 | +      | N      | 
| 3620 | CE1004 | 3L      | Kr     | h      |  8640919 | 8640929 |    8640918 |  8640927 |      -6.59 |           0 |        -1 |          0 | -      | N      | 
| 3623 | CE1004 | 3L      | Kr     | h      |  8641003 | 8641011 |    8641003 |  8641012 |      -2.72 |           0 |        -1 |          0 | -      | N      | 
| 3624 | CE1004 | 3L      | Kr     | h      |  8641119 | 8641129 |    8641118 |  8641127 |       8.07 |       -9.78 |      8.07 |      -9.78 | -      | N      | 
| 3627 | CE1004 | 3L      | Kr     | h      |  8641145 | 8641155 |    8641140 |  8641149 |      -3.88 |           0 |        -1 |          0 | +      | N      | 
| 3633 | CE1004 | 3L      | Kr     | h      |  8641252 | 8641262 |    8641251 |  8641260 |       9.42 |      -11.46 |      9.42 |     -11.46 | -      | Y      | 
+------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+
Simply abolishing one site won't change Pocc a lot.  However, what does Pocc means?

Category II

Changes in footprints:
196 CE1029  260     Kr   3.5   12636476  mel    5.81    5       5
anc     ......G.............
yak     ......G.............
mel     tgcgtGAAAGGGTGAagcta	9.3
sim     nnnn..G.............	3.5
        ......G.............
        ......G.............
        ......G.............
        ......G.............

Pocc values:
CE1029  Kr      1746    1727    0.215347001553784       0.0280790836253705


Interesting literature for CE1029

Shimell, M. J., Peterson, A. J., Burr, J., Simon, J. A., O'Connor, M. B.,

 February 2000. Functional analysis of repressor binding sites in the iab-2
 regulatory region of the abdominal-a homeotic gene. Developmental Biology
 218 (1), 38-52.

URL <a href="http://dx.doi.org/10.1006/dbio.1999.9576">http://dx.doi.org/10.1006/dbio.1999.9576</a>

 

[abstract]
Abdominal-A (ABD-A) homeotic protein. Analysis of a 1.7-kb 
enhancer element [iab-2(1.7)] from the iab-2 regulatory region shows that in contrast to Ubx enhancer elements, both HB 
and Kru¨ ppel (KR) are required to set the ABD-A anterior boundary in parasegment 7. DNase I footprinting and site-directed 
mutagenesis show that HB and KR are direct regulators of this iab-2 enhancer. The single KR site can be moved to a new 
location 100 bp away and still maintain repressive activity, whereas relocation by 300 bp abolishes activity. These results 
suggest that KR repression occurs through a local quenching mechanism.

 I tried to search flanking sequence 300 bp on each side for compensatory Kr sites for sim but didn't find any.  I then calculated Pocc for sim sequence as well as a p-value from 1000 dinucleotide permutation of the sequence.  The results indicate that sim sequence just don't contain a good Kr site

Pocc_mel = 0.215347001553784       p-value (for dinucleotide shuffling) = 0.109
Pocc_sim = 0.050892903419477       p-value (for dinucleotide shuffling) = 0.804

Calculate Potential and its application

Method

The calculation is very similar to Pocc and have interesting relationship with it.

  • score all the sliding windows of pwm_length and rank the scores
  • Pick the highest rank score
  • Pick the second highest rank among those that do not overlap with the previous picked ones until no site can be picked because of overlaps
  • Sum up the Pi 's
  • At equilibrium, this sum value can be interpreted as the "average number of proteins on the enhancer"
  • This value will be very close to Pocc while Pocc is small (0.2 for example)
  • This value doesn't have the saturation problem with Pocc.  So it is better for enhancers containing lots of strong sites and to show the effect of abolishing one of them on the potential of the enhancer to be bound by the factor.
  • Its calculation doesn't impose any arbitrary cutoff.

 

Code

<a title="Potential_pair.pl" class="internal-link" href="code/Potential-pair.pl">Potential_pair.pl</a> 

Result

select e.type, e.region, e.fpid, e.factor, e.coordinate, e.effects, e.Pi, p.melpt, p.simpt, p.simpt_exp 
from   effect_changes e, potential p 
where e.type LIKE "%sim" and e.Pi < -0.01 and e.region = p.region and e.factor = p.factor order by e.type, e.factor, e.region

+------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+
| type | region | fpid | factor | coordinate | effects  | Pi         | melpt     | simpt     | simpt_exp |
+------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+
| asim | CE1068 | 6240 | hb     |   12590526 |    -3.94 | -0.0143405 |   2.91906 |    2.4386 |   2.54402 | 
| asim | CE1109 | 6321 | Kr     |   21022230 |    -4.26 | -0.0693678 |  0.121555 | 0.0532071 | 0.0461818 | 
| asim | CE1109 | 6321 | Kr     |   21022232 |    -6.22 |  -0.070286 |  0.121555 | 0.0532071 | 0.0461818 | 
| asim | CE1146 | 1599 | ovo    |    4909561 |    -2.21 |  -0.110128 |  0.554871 |  0.370851 |  0.444743 | 
| asim | CE1131 | 6347 | Ubx    |   11455015 |    -3.44 |  -0.117362 |  0.377224 |  0.231532 |  0.335409 | 
| asim | CE1068 | 6236 | z      |   12590413 |    -4.76 | -0.0758525 |   3.14204 |   3.08593 |   3.14403 | 
| sim  | CE1049 | 6089 | abd-A  |    2759019 |    -3.23 | -0.0409354 |   17.6296 |   18.7606 |   17.3765 | 
| sim  | CE1200 | 6489 | ap     |   22997767 |    -5.61 |   -0.43012 |   3.47621 |   1.64334 |   2.66249 | 
| sim  | CE1099 | 4491 | br-Z3  |    3105855 |    -0.85 | -0.0167293 |   3.27923 |   1.67408 |   3.02324 | 
| sim  | CE1016 | 981  | Deaf1  |    2613240 |    -2.26 | -0.0356046 | 0.0638928 | 0.0379833 | 0.0283291 | 
| sim  | CE1076 | 6272 | dl     |    2581108 |    -1.32 | -0.0135067 |  0.495107 |  0.611945 |  0.714109 | 
| sim  | CE1085 | 6314 | en     |    1981579 |    -4.01 |  -0.194292 |   1.31849 |   1.87045 |   1.17026 | 
| sim  | CE1152 | 5108 | en     |    2692110 |     -1.9 |  -0.369892 |    2.5783 |   2.64407 |   2.43348 | 
| sim  | CE1179 | 5366 | en     |    5827316 |   -0.411 |  -0.043412 |  0.838666 |   1.31887 |  0.795254 | 
| sim  | CE1030 | 6017 | hb     |   11456134 | -0.87005 | -0.0169816 | 0.0411556 | 0.0238455 | 0.0241783 | 
| sim  | CE1035 | 6032 | hb     |   20630455 |  -0.4217 | -0.0649021 |   3.04001 |   2.78995 |   2.90002 | 
| sim  | CE1035 | 6035 | hb     |   20630490 |    -5.82 | -0.0524852 |   3.04001 |   2.78995 |   2.90002 | 
| sim  | CE1056 | 4137 | hb     |    5490187 |    -5.82 |  -0.119851 |  0.728987 |   0.61201 |  0.609136 | 
| sim  | CE1068 | 6238 | hb     |   12590469 |    -5.82 |  -0.234132 |   2.91906 |    2.4386 |   2.54402 | 
| sim  | CE1181 | 5387 | hb     |   12599392 |    -5.82 |  -0.127427 |   2.37144 |   2.74126 |   2.25778 | 
| sim  | CE1004 | 3634 | kni    |    8641330 |    -1.02 |  -0.201558 |   6.03252 |   4.83989 |   5.83096 | 
| sim  | CE1181 | 5390 | kni    |   12599840 |   -4.586 | -0.0191775 |   7.76955 |   8.90855 |   7.76955 | 
| sim  | CE1004 | 3618 | Kr     |    8640890 |    -5.81 | -0.0421574 |  0.463782 |  0.421662 |  0.421625 | 
| sim  | CE1035 | 6039 | Kr     |   20630567 |    -0.38 |  -0.085451 |   2.07909 |   1.99488 |   1.99364 | 
| sim  | CE1200 | 6471 | pan    |   22997071 |    -5.79 |  -0.348139 |   4.44358 |   3.00821 |   3.81605 | 
| sim  | CE1199 | 5541 | slbo   |    9898793 |     -5.2 |  -0.455406 |  0.717394 |  0.203589 |  0.262045 | 
| sim  | CE1035 | 6034 | tll    |   20630490 |    -2.12 | -0.0317674 |   2.14552 |   2.18389 |   2.15739 | 
| sim  | CE1064 | 6172 | Trl    |   14722644 |    -1.63 |   -0.33617 |  0.901073 |  0.540058 |  0.612796 | 
| sim  | CE1068 | 6214 | Trl    |   12589882 |    -1.49 |  -0.156298 |  0.955081 |  0.673408 |  0.716796 | 
| sim  | CE1032 | 300  | twi    |   17203868 |     -6.2 |  -0.268195 |  0.809029 |  0.592913 |  0.564329 | 
| sim  | CE1049 | 6103 | Ubx    |    2759403 |    -2.54 |  -0.310968 |   1.93739 |   1.47474 |   1.33022 | 
| sim  | CE1049 | 6129 | Ubx    |    2760478 |    -3.18 |  -0.151917 |   1.93739 |   1.47474 |   1.33022 | 
+------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+

All except 2 cases (CE1035/tll, CE1076/dl), all the others are consistent with sim_exp < mel_pt.  
In these two cases, CE1035 has one mel sub that decreases; CE1076 has one amb sitethat makes sim a stronger one.
Compare sim_pt and sim_pt_exp:
inc/dec 14/14!
cannot invoke compensation, at least not shown through this Potential calculation.

Look for cases of dramatic Potential changes

See figure in Keynote file for the "Change in Potential, rank ordered".

1. take simPt-melPt > 0.3 or simPt-melPt < -0.7

mysql> select * from Potential where (simpt-melpt) > 0.3 or (simpt-melpt) < -0.7;                                      
+--------+--------+------+------+----------+---------+-----------+
| region | factor | mell | siml | melpt    | simpt   | simpt_exp |
+--------+--------+------+------+----------+---------+-----------+
| CE1016 | Dfd    | 2658 | 2767 |  23.7409 | 24.5887 |   23.7409 | x possibly non-specific PWM
| CE1079 | abd-A  |  891 |  863 |  3.04907 | 3.65384 |   3.04907 | 
| CE1049 | abd-A  | 2301 | 2378 |  17.6296 | 18.7606 |   17.3765 | 
| CE1200 | ap     |  812 |  824 |  3.47621 | 1.64334 |   2.66249 | *
| CE1004 | bcd    |  548 |  542 |  1.16911 | 1.54582 |   1.17638 | 
| CE1099 | br-Z1  |  823 |  723 |  3.25048 | 2.26729 |   3.22339 | 
| CE1164 | br-Z1  | 3563 | 3450 |  7.51766 | 5.43528 |   7.36482 | 
| CE1099 | br-Z2  |  823 |  723 |  5.31897 | 3.67533 |   5.31964 | 
| CE1099 | br-Z3  |  823 |  723 |  3.27923 | 1.67408 |   3.02324 | *
| CE1086 | en     |  333 |  252 |  2.62556 | 0.89412 |   2.59706 | *
| CE1085 | en     |  296 |  288 |  1.31849 | 1.87045 |   1.17026 | ? strong compensation
| CE1084 | en     |  440 |  434 |  2.62796 | 1.48458 |   2.49766 | 
| CE1179 | en     |  114 |  140 | 0.838666 | 1.31887 |  0.795254 | 
| CE1134 | eve    |  547 |  541 |  4.66232 | 3.91912 |   4.66232 | 
| CE1181 | hb     | 1487 | 1490 |  2.37144 | 2.74126 |   2.25778 | 
| CE1012 | kni    |  933 |  919 |  7.79598 | 6.82782 |    7.7697 | 
| CE1004 | kni    |  548 |  542 |  6.03252 | 4.83989 |   5.83096 | 
| CE1181 | kni    | 1487 | 1490 |  7.76955 | 8.90855 |   7.76955 | 
| CE1055 | pan    | 1578 | 1541 |  8.73959 | 7.96172 |   8.72432 | 
| CE1200 | pan    |  812 |  824 |  4.44358 | 3.00821 |   3.81605 | 
| CE1187 | z      | 1332 | 1250 |  5.32862 | 4.40007 |   5.06548 | 
+--------+--------+------+------+----------+---------+-----------+ 

Specifically focus on enhancers that have few numbers of binding site instances.
Disruption of sites in these cases might have more profound effects.

mysql> select * from Potential order by (simpt-melpt)/simpt limit 30;
+--------+--------+------+------+-----------+-----------+-----------+
| region | factor | mell | siml | melpt     | simpt     | simpt_exp |
+--------+--------+------+------+-----------+-----------+-----------+
| CE1029 | Kr     | 1746 | 1727 |  0.218233 | 0.0212329 | 0.0275388 | 
| CE1144 | ovo    | 1212 | 1214 |  0.320677 |  0.069831 |  0.123308 | 
| CE1199 | slbo   |  327 |  330 |  0.717394 |  0.203589 |  0.262045 | 
| CE1086 | en     |  333 |  252 |   2.62556 |   0.89412 |   2.59706 | 
| CE1112 | br-Z1  |  348 |  309 |   1.01843 |   0.38619 |   1.01843 | 
| CE1042 | Antp   |  477 |  447 |   1.02567 |  0.409653 |  0.992764 | 
| CE1169 | en     |  220 |  220 |  0.191158 | 0.0802844 |  0.185892 | 
| CE1190 | z      |  119 |  119 |  0.574322 |  0.249853 |  0.760966 | 
| CE1109 | Kr     |  981 | 1026 |  0.121555 | 0.0532071 | 0.0461818 | 
| CE1042 | Ubx    |  477 |  447 |   1.19577 |  0.565161 |   1.19577 | 
| CE1200 | ap     |  812 |  824 |   3.47621 |   1.64334 |   2.66249 | 
| CE1159 | Trl    |  368 |  366 |  0.190474 | 0.0967834 |  0.190474 | 
| CE1099 | br-Z3  |  823 |  723 |   3.27923 |   1.67408 |   3.02324 | 
| CE1134 | zen    |  547 |  541 |  0.528344 |   0.28739 |  0.528344 | 
| CE1084 | en     |  440 |  434 |   2.62796 |   1.48458 |   2.49766 | 
| CE1030 | hb     |  517 |  520 | 0.0411556 | 0.0238455 | 0.0241783 | 
| CE1016 | Deaf1  | 2658 | 2767 | 0.0638928 | 0.0379833 | 0.0283291 | 
| CE1064 | Trl    | 1438 | 1431 |  0.901073 |  0.540058 |  0.612796 | 
| CE1112 | br-Z2  |  348 |  309 |   1.52899 |  0.925284 |   1.46139 | 
| CE1131 | Ubx    | 1093 | 1076 |  0.377224 |  0.231532 |  0.335409 | 
| CE1069 | Mad    |  352 |  352 | 0.0798586 | 0.0490269 | 0.0848505 | 
| CE1036 | pan    |  888 |  868 |   1.32913 |  0.854274 |   1.32913 | 
| CE1021 | eve    |  505 |  476 |  0.195583 |  0.129641 |  0.195583 | 
| CE1146 | ovo    | 1081 | 1092 |  0.554871 |  0.370851 |  0.444743 | 
| CE1200 | pan    |  812 |  824 |   4.44358 |   3.00821 |   3.81605 | 
| CE1099 | br-Z2  |  823 |  723 |   5.31897 |   3.67533 |   5.31964 | 
+--------+--------+------+------+-----------+-----------+-----------+

From this calculation I cannot really see evidence for compensation.

Next step to do:

Identify candidate regions for evolution of regulation

experiment to verify some of them?

look for some existing expression dataset?

Do the unbiased prediction for several different factors and summarize the turnover pattern!

see what to do with MK and selection

find out a model to explain the asymmetry--what's the expected turnover rate to explain 1/7 affected and almost all are decreasing?  what's the expectation?  chances for compensation easier allopatric?

Recalculate MK table

      inc dec
mel   22   7
sim    4  26
poly  13  33

Pairs of Overlapping binding sites

select a.region, a.fpid, a.factor, b.fpid, b.factor
from tfbs a, tfbs b
where a.region = b.region and a.fpid != b.fpid and a.tfbs_start between b.tfbs_start and b.tfbs_end;

+--------+------+--------+------+--------+
| region | fpid | factor | fpid | factor |
+--------+------+--------+------+--------+
| CE1099 | 4472 | br-Z3  | 4471 | br-Z2  | 
| CE1099 | 4474 | br-Z2  | 4475 | br-Z1  | 
| CE1099 | 4483 | br-Z2  | 4482 | br-Z3  | 
| CE1099 | 4486 | br-Z2  | 4485 | br-Z3  | 
| CE1099 | 4488 | br-Z1  | 4489 | br-Z3  | 
| CE1010 | 3637 | abd-A  | 3638 | Ubx    | 
| CE1010 | 3639 | Ubx    | 3640 | abd-A  | 
| CE1010 | 3643 | abd-A  | 3644 | Ubx    | 
| CE1010 | 3643 | abd-A  | 3646 | Ubx    | 
| CE1010 | 3646 | Ubx    | 3644 | Ubx    | 
| CE1010 | 3650 | abd-A  | 3649 | Ubx    | 
| CE1010 | 3653 | Ubx    | 3654 | abd-A  | 
| CE1010 | 3656 | abd-A  | 3655 | Ubx    | 
| CE1010 | 3659 | Ubx    | 3658 | abd-A  | 
| CE1010 | 3661 | abd-A  | 3660 | Ubx    | 
| CE1010 | 3660 | Ubx    | 3659 | Ubx    | 
| CE1010 | 3663 | abd-A  | 3662 | Ubx    | 
| CE1030 | 6003 | bcd    | 6002 | Kr     | 
| CE1030 | 6004 | cad    | 6002 | Kr     | 
| CE1030 | 6004 | cad    | 6003 | bcd    | 
| CE1164 | 5446 | br-Z1  | 5448 | br-Z2  | 
| CE1164 | 5448 | br-Z2  | 5446 | br-Z1  | 
| CE1164 | 5449 | br-Z2  | 5450 | br-Z3  | 
| CE1164 | 5449 | br-Z2  | 5451 | br-Z1  | 
| CE1164 | 5451 | br-Z1  | 5450 | br-Z3  | 
| CE1164 | 5457 | br-Z1  | 5459 | br-Z3  | 
| CE1164 | 5458 | br-Z2  | 5459 | br-Z3  | 
| CE1164 | 5458 | br-Z2  | 5457 | br-Z1  | 
| CE1164 | 5461 | br-Z3  | 5460 | br-Z2  | 
| CE1013 | 2269 | kni    | 2270 | hb     | 
| CE1013 | 2277 | kni    | 2278 | hb     | 
| CE1013 | 2278 | hb     | 2277 | kni    | 
| CE1056 | 4118 | Kr     | 4117 | bcd    | 
| CE1056 | 4124 | Kr     | 4123 | Kr     | 
| CE1056 | 4133 | bcd    | 4132 | Kr     | 
| CE1036 | 2466 | pan    | 2467 | tin    | 
| CE1036 | 2467 | tin    | 2466 | pan    | 
| CE1036 | 2468 | pan    | 2469 | tin    | 
| CE1149 | 6375 | eve    | 6376 | en     | 
| CE1149 | 6375 | eve    | 6374 | zen    | 
| CE1149 | 6374 | zen    | 6376 | en     | 
| CE1149 | 6374 | zen    | 6375 | eve    | 
| CE1149 | 6380 | en     | 6379 | zen    | 
| CE1149 | 6380 | en     | 6378 | eve    | 
| CE1149 | 6378 | eve    | 6379 | zen    | 
| CE1149 | 6381 | Kr     | 6380 | en     | 
| CE1149 | 6381 | Kr     | 6379 | zen    | 
| CE1149 | 6381 | Kr     | 6378 | eve    | 
| CE1149 | 6384 | eve    | 6385 | en     | 
| CE1149 | 6384 | eve    | 6383 | zen    | 
| CE1149 | 6383 | zen    | 6385 | en     | 
| CE1149 | 6383 | zen    | 6384 | eve    | 
| CE1134 | 6354 | eve    | 6353 | zen    | 
| CE1050 | 3453 | prd    | 3454 | eve    | 
| CE1089 | 5961 | kni    | 5960 | hb     | 
| CE1089 | 5912 | kni    | 5914 | kni    | 
| CE1089 | 5918 | tll    | 5919 | bcd    | 
| CE1089 | 5919 | bcd    | 5918 | tll    | 
| CE1089 | 5920 | kni    | 5918 | tll    | 
| CE1012 | 5925 | kni    | 5924 | tll    | 
| CE1012 | 5925 | kni    | 5926 | bcd    | 
| CE1012 | 5926 | bcd    | 5924 | tll    | 
| CE1012 | 5941 | bcd    | 5940 | kni    | 
| CE1012 | 5951 | kni    | 5950 | hb     | 
| CE1004 | 3610 | kni    | 3611 | hb     | 
| CE1004 | 3611 | hb     | 3610 | kni    | 
| CE1004 | 3613 | kni    | 3614 | hb     | 
| CE1004 | 3614 | hb     | 3613 | kni    | 
| CE1004 | 3615 | bcd    | 3613 | kni    | 
| CE1004 | 3615 | bcd    | 3614 | hb     | 
| CE1004 | 3616 | Kr     | 3613 | kni    | 
| CE1004 | 3616 | Kr     | 3614 | hb     | 
| CE1004 | 3616 | Kr     | 3615 | bcd    | 
| CE1004 | 3617 | kni    | 3616 | Kr     | 
| CE1004 | 3620 | Kr     | 3619 | tll    | 
| CE1004 | 3625 | bcd    | 3624 | Kr     | 
| CE1004 | 3626 | hb     | 3624 | Kr     | 
| CE1004 | 3626 | hb     | 3625 | bcd    | 
| CE1004 | 3630 | bcd    | 3629 | tll    | 
| CE1004 | 3633 | Kr     | 3632 | hb     | 
| CE1112 | 1201 | br-Z2  | 1199 | br-Z1  | 
| CE1039 | 2493 | slbo   | 2492 | vvl    | 
| CE1000 | 5968 | z      | 5967 | Trl    | 
| CE1035 | 6029 | hb     | 6028 | Kr     | 
| CE1035 | 6030 | tll    | 6028 | Kr     | 
| CE1035 | 6030 | tll    | 6029 | hb     | 
| CE1035 | 6032 | hb     | 6031 | tll    | 
| CE1035 | 6032 | hb     | 6033 | Kr     | 
| CE1035 | 6033 | Kr     | 6031 | tll    | 
| CE1035 | 6034 | tll    | 6035 | hb     | 
| CE1076 | 6260 | Mad    | 6261 | brk    | 
| CE1076 | 6261 | brk    | 6260 | Mad    | 
| CE1076 | 6262 | brk    | 6263 | Mad    | 
| CE1076 | 6263 | Mad    | 6262 | brk    | 
| CE1076 | 6264 | brk    | 6265 | Mad    | 
| CE1076 | 6266 | brk    | 6267 | Mad    | 
| CE1076 | 6282 | brk    | 6281 | Mad    | 
| CE1016 |  987 | Dfd    |  986 | Deaf1  | 
| CE1049 | 6081 | abd-A  | 6082 | Ubx    | 
| CE1049 | 6083 | abd-A  | 6084 | Ubx    | 
| CE1049 | 6086 | Ubx    | 6085 | abd-A  | 
| CE1049 | 6087 | Ubx    | 6088 | abd-A  | 
| CE1049 | 6089 | abd-A  | 6090 | Ubx    | 
| CE1049 | 6093 | abd-A  | 6092 | Ubx    | 
| CE1049 | 6094 | abd-A  | 6095 | Ubx    | 
| CE1049 | 6097 | abd-A  | 6096 | Ubx    | 
| CE1049 | 6099 | Ubx    | 6098 | abd-A  | 
| CE1049 | 6100 | Ubx    | 6101 | abd-A  | 
| CE1049 | 6102 | abd-A  | 6103 | Ubx    | 
| CE1049 | 6106 | abd-A  | 6107 | Ubx    | 
| CE1049 | 6108 | Ubx    | 6109 | abd-A  | 
| CE1049 | 6110 | abd-A  | 6111 | Ubx    | 
| CE1049 | 6113 | Ubx    | 6111 | Ubx    | 
| CE1049 | 6113 | Ubx    | 6110 | abd-A  | 
| CE1049 | 6112 | abd-A  | 6113 | Ubx    | 
| CE1049 | 6116 | abd-A  | 6117 | Ubx    | 
| CE1049 | 6119 | abd-A  | 6118 | Ubx    | 
| CE1049 | 6120 | abd-A  | 6121 | Ubx    | 
| CE1049 | 6122 | Ubx    | 6123 | abd-A  | 
| CE1049 | 6125 | Ubx    | 6124 | abd-A  | 
| CE1049 | 6124 | abd-A  | 6125 | Ubx    | 
| CE1049 | 6126 | Ubx    | 6127 | abd-A  | 
| CE1049 | 6128 | abd-A  | 6129 | Ubx    | 
| CE1049 | 6131 | abd-A  | 6130 | Ubx    | 
| CE1049 | 6133 | Ubx    | 6132 | abd-A  | 
| CE1049 | 6135 | Ubx    | 6134 | abd-A  | 
| CE1042 | 4111 | Ubx    | 4112 | Antp   | 
| CE1162 | 6417 | bcd    | 6416 | hb     | 
| CE1028 | 3680 | tll    | 3679 | bcd    | 
| CE1034 | 2448 | hb     | 2449 | tll    | 
| CE1034 | 2454 | en     | 2453 | twi    | 
| CE1058 | 6146 | Ubx    | 6147 | eve    | 
| CE1058 | 6152 | zen    | 6150 | eve    | 
| CE1058 | 6152 | zen    | 6153 | Ubx    | 
| CE1058 | 6153 | Ubx    | 6150 | eve    | 
| CE1058 | 6164 | z      | 6163 | Trl    | 
| CE1068 | 6202 | hb     | 6201 | Trl    | 
| CE1068 | 6206 | z      | 6207 | Trl    | 
| CE1068 | 6210 | z      | 6211 | Trl    | 
| CE1068 | 6214 | Trl    | 6215 | z      | 
| CE1068 | 6219 | z      | 6218 | Trl    | 
| CE1068 | 6221 | hb     | 6220 | Trl    | 
| CE1068 | 6221 | hb     | 6222 | z      | 
| CE1068 | 6222 | z      | 6220 | Trl    | 
| CE1068 | 6228 | z      | 6226 | hb     | 
| CE1068 | 6235 | hb     | 6234 | z      | 
| CE1068 | 6245 | Trl    | 6246 | z      | 
| CE1181 | 5372 | twi    | 5371 | hb     | 
| CE1181 | 5375 | twi    | 5374 | tll    | 
| CE1181 | 5378 | en     | 5377 | hb     | 
| CE1181 | 5377 | hb     | 5378 | en     | 
| CE1029 |  265 | hb     |  266 | eve    | 
| CE1069 |  753 | tin    |  751 | Mad    | 
| CE1200 | 6487 | ap     | 6486 | pan    | 
+--------+------+--------+------+--------+
154 rows in set (9.60 sec)