Koch Lab:Publications/Proof of principle for shotgun DNA mapping by unzipping: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
No edit summary
Line 10: Line 10:
My understanding of your paper is that you have shown that you can identify a naked dsDNA sequence by comparing its unzipping force-index curve against a library of simulated curves. Very cool, and I want to thank your team for writing the paper in a way that I can understand. Here are my comments:
My understanding of your paper is that you have shown that you can identify a naked dsDNA sequence by comparing its unzipping force-index curve against a library of simulated curves. Very cool, and I want to thank your team for writing the paper in a way that I can understand. Here are my comments:


==1==
1. I think the major contribution of the paper is that you have an accurate simulation and matching method, which can easily account for elasticity of the tethering construct, viscosity, etc. I'd like to see more details (or at least know that you have the answers in your back pocket) on the information-recall side of the problem. See terms at the bottom of < http://en.wikipedia.org/wiki/Sensitivity_and_specificity >. I'm no expert here, but here are my thoughts (and most of the answers don't necessarily belong in the paper):
1. I think the major contribution of the paper is that you have an accurate simulation and matching method, which can easily account for elasticity of the tethering construct, viscosity, etc. I'd like to see more details (or at least know that you have the answers in your back pocket) on the information-recall side of the problem. See terms at the bottom of < http://en.wikipedia.org/wiki/Sensitivity_and_specificity >. I'm no expert here, but here are my thoughts (and most of the answers don't necessarily belong in the paper):
:[[User:Steven J. Koch|Steve Koch]] 17:17, 21 January 2009 (EST): Good wikipedia page, I need to spend time understanding it.  I think that kind of rigorous language would help us.
:[[User:Steven J. Koch|Steve Koch]] 17:17, 21 January 2009 (EST): Good wikipedia page, I need to spend time understanding it.  I think that kind of rigorous language would help us.
Line 31: Line 32:
:[[User:Steven J. Koch|Steve Koch]] 09:08, 25 January 2009 (EST): We actually have done the other things you suggested.  Larry computed the whole match scores for sims against sim (generating a 2700x2700 matrix).  I don't think he generated a histogram from it, though, yet, do you think it belongs in the paper?  We also did data versus data, does that belong?  For sure we will post this data at least online, and maybe in supplementary?
:[[User:Steven J. Koch|Steve Koch]] 09:08, 25 January 2009 (EST): We actually have done the other things you suggested.  Larry computed the whole match scores for sims against sim (generating a 2700x2700 matrix).  I don't think he generated a histogram from it, though, yet, do you think it belongs in the paper?  We also did data versus data, does that belong?  For sure we will post this data at least online, and maybe in supplementary?


==2==
2. In the excerpt of the match score formula, I see kT / C / stdev(force difference). Here are the implications that I read from the formula:
2. In the excerpt of the match score formula, I see kT / C / stdev(force difference). Here are the implications that I read from the formula:
a. for a given match score, a 1-unit force difference on 4 indices (neighboring or separate) is worth a 2-unit force difference on 1 index; and
a. for a given match score, a 1-unit force difference on 4 indices (neighboring or separate) is worth a 2-unit force difference on 1 index; and
Line 47: Line 49:
:[[User:Steven J. Koch|Steve Koch]] 09:15, 25 January 2009 (EST): (More to come later!)
:[[User:Steven J. Koch|Steve Koch]] 09:15, 25 January 2009 (EST): (More to come later!)


==3==
3. The beginning of the paper reads to me like "peeing on the tree". In particular, I think mentioning Pol II is overreaching.
3. The beginning of the paper reads to me like "peeing on the tree". In particular, I think mentioning Pol II is overreaching.
:[[User:Steven J. Koch|Steve Koch]] 16:04, 25 January 2009 (EST): I see what you mean, and we definitely felt like the introduction was too long about background stuff.  But we didn't know how to shorten it, although clearly it needs to be.  As for marking our territory, it's I guess related to our point.  Now that we're transitioning to open science, it's not clear to me how ''not'' to mark territory.  We do want people to know what we're working on, and we don't want to be scooped from out of the blue!  I think one strategy we could pursue in this introduction is to remove discussion of our research goals and just mention the many possible applications and say what we were specifically motivated by.  It's tough, though, because as physicists, we don't yet understand the applications well enough to succinctly describe them.
:[[User:Steven J. Koch|Steve Koch]] 16:04, 25 January 2009 (EST): I see what you mean, and we definitely felt like the introduction was too long about background stuff.  But we didn't know how to shorten it, although clearly it needs to be.  As for marking our territory, it's I guess related to our point.  Now that we're transitioning to open science, it's not clear to me how ''not'' to mark territory.  We do want people to know what we're working on, and we don't want to be scooped from out of the blue!  I think one strategy we could pursue in this introduction is to remove discussion of our research goals and just mention the many possible applications and say what we were specifically motivated by.  It's tough, though, because as physicists, we don't yet understand the applications well enough to succinctly describe them.


==4==
4. It really bothers me in figure 2 that the colors are swapped between A and B. Also, next to figure 4, in the last paragraph before "Future Improvements", you use the word "fell" instead of "felt". Also, I have not worked on the language very much, but the last line on page 8 is a little strange.
4. It really bothers me in figure 2 that the colors are swapped between A and B. Also, next to figure 4, in the last paragraph before "Future Improvements", you use the word "fell" instead of "felt". Also, I have not worked on the language very much, but the last line on page 8 is a little strange.
: [[User:Steven J. Koch|Steve Koch]] 16:04, 25 January 2009 (EST): We'll fix this (we used pre-made figures during our writing session)
: [[User:Steven J. Koch|Steve Koch]] 16:04, 25 January 2009 (EST): We'll fix this (we used pre-made figures during our writing session)


==5==
5. I'd like to be able to estimate what the 30-point smoothing window represents (in nm or bp) when multiplied with your stretch rate.
5. I'd like to be able to estimate what the 30-point smoothing window represents (in nm or bp) when multiplied with your stretch rate.


==6==
6. You mention viscosity a few times and say that it's significant, but it seems to be absent from the "Creation of Simulation Library" description.  However, later, you say that the viscosity is not important because it would just impose the same shift on the scores of both the matches and the mismatches (by affecting the OT data but not the simulated data).  It's not clear to me how the GC content affects your estimation --- is it that the higher force needed to unzip improves the signal compared to the viscosity offset?  The nature of standard deviation is such that if you had a mismatched sequence modeled with the correct force scale, it might have a better score than a matched sequence with the incorrect force scale.  Alternatively, if you colored the points in figure 3 by GC content, would all those with the highest GC content have the lowest (most similar) match scores?
6. You mention viscosity a few times and say that it's significant, but it seems to be absent from the "Creation of Simulation Library" description.  However, later, you say that the viscosity is not important because it would just impose the same shift on the scores of both the matches and the mismatches (by affecting the OT data but not the simulated data).  It's not clear to me how the GC content affects your estimation --- is it that the higher force needed to unzip improves the signal compared to the viscosity offset?  The nature of standard deviation is such that if you had a mismatched sequence modeled with the correct force scale, it might have a better score than a matched sequence with the incorrect force scale.  Alternatively, if you colored the points in figure 3 by GC content, would all those with the highest GC content have the lowest (most similar) match scores?



Revision as of 15:51, 27 January 2009

January 19th rough draft

Comments

Richard C. Yeh 20:44, 20 January 2009 (EST) First, congratulations on your new paper draft!

Steve Koch 17:10, 21 January 2009 (EST): Thank you! I'm going to respond to some of your comments below, and then I expect eventually they will be moved to sub-pages, since they are really good ideas warranting discussion and further data.

My understanding of your paper is that you have shown that you can identify a naked dsDNA sequence by comparing its unzipping force-index curve against a library of simulated curves. Very cool, and I want to thank your team for writing the paper in a way that I can understand. Here are my comments:

1

1. I think the major contribution of the paper is that you have an accurate simulation and matching method, which can easily account for elasticity of the tethering construct, viscosity, etc. I'd like to see more details (or at least know that you have the answers in your back pocket) on the information-recall side of the problem. See terms at the bottom of < http://en.wikipedia.org/wiki/Sensitivity_and_specificity >. I'm no expert here, but here are my thoughts (and most of the answers don't necessarily belong in the paper):

Steve Koch 17:17, 21 January 2009 (EST): Good wikipedia page, I need to spend time understanding it. I think that kind of rigorous language would help us.

You matched experimental F-j curve vs simulated F-j curve. Why not interpret your experimental F-j curve into a sequence (with wild-card characters) and then do the matching in ASCII space? Is there a rough map between your match score and a BLAST similarity score? (But, since I'm no BLAST expert, I would need to do lots more research before talking about BLAST in the paper.)

Steve Koch 17:17, 21 January 2009 (EST): Interesting idea and similar to one Anthony had earlier...I shot him down saying that sequencing by unzipping is practically impossible. That's interesting in terms of converting it to a wildcard sequence and then trying out the existing sequence homology searching stuff, such as BLAST, which I've also not used...(more comments later)

In the "Future Improvements" section, you discuss the general sequencing-by-unzipping problem. Imagine simulating all possible sequences from j=1200 to j=1700. In this space of 4^500 sequences, what are the characteristics of those close to your experimental sequence, and what is the minimum similarity that can be resolved? From the other side, in your matching problem, how much easier has the problem been made by considering only the 2700 restriction fragments instead of the 4^500 general genome? What if you don't restrict yourself to those restriction fragments, but allow any contiguous 500-bp sequence from yeast --- then how many false positives do you get? (In Figure 4, what is the outline that you would get by simulating random fragments from the 4^500 or yeast-sequence universe?) A related question is characterizing the sequences in the overlap of the Gaussians.

Steve Koch 08:59, 25 January 2009 (EST): These are really great points. I don't know whether we have the efficiency of computation and / or the computing power to try out every contiguous yeast sequence. Because unzipping requires covalent attachment to both the top and bottom DNA strand, using site-specific restriction endonucleases is a natural mechanism for attaching genomic fragments to the "unzipping anchor" (the "unzipping anchor" has the digoxigenin and biotin label, plus a nick to allow unzipping). I'd say a good next computational step would be to try a 4-cutter, or the human genome, both of which would have a larger library. Experimentally, we're going to try a "shotgun cloning" strategy to show how well the method works with actual XhoI-digest yeast fragments. (This is described in the grant application we just submitted.)
Steve Koch 08:59, 25 January 2009 (EST): The shape is interesting if you sort the mismatch scores, but I don't know what to do with that. This reminds me that we haven't posted our data and software yet, but Larry's going to do that soon. Sorry it's not up yet!
Steve Koch 08:59, 25 January 2009 (EST): As for the 4^500...well, currently we don't include nearest neighbors, so it's really 2^500 since A is same as T. Plus, we should be able to reduce the window size, something Larry is currently studying. Maybe instead of covering all the sequences, we could just generate 100,000's of random sequences?

Suppose when you unzip chromatin, you get only a naked-DNA signal over the linker DNA regions, and an un-simulated signal over the non-linker regions. Would you speculate on how that (change in signal) will affect your recall?

Steve Koch 09:04, 25 January 2009 (EST): We're going to use the kick-ass versatile feedback control software that you wrote! Thus, it will be very easy for us to write an algorithm to do this:
  • Unzip approx 500 base pairs (or whatever we decide)...stop in a naked DNA region (based on force)
  • reverse direction and rezip DNA
  • Unzip again (this will generate the identification tag from the newly created naked DNA)
  • Continue unzipping however much of the remaining fragment possible. (Someday (10 years?) with MEMS or other devices, the length of chromatin possible will not be limited.)

Suppose you simulate all 2700 restriction fragments. What if you make another version of figure 4 by scoring all simulated sequences versus each other (sim vs sim instead of expt vs sim)? (Also, by symmetry, what is the spectrum of scores obtained for individual OT runs --- both match and mismatch --- against each other (expt vs expt, instead of expt vs sim)?) Is the difference in distance between your blue and red peaks roughly equal to the difference in distance between correct and incorrect peaks for the 2700 x 2700 test? (If not, then you will have to explain what's special about your 32 sequences.) Do the 2700 fragments include the reverse-complement sequences?

Steve Koch 09:08, 25 January 2009 (EST): Reverse-complement...lemme think... The complementation is not important (as I mentioned above, A is same as T currenty). I am thinking that reversing wouldn't happen in the actual implementation (due to construction methods). But it may be a great way of assessing how much orientation matters.
Steve Koch 09:08, 25 January 2009 (EST): We actually have done the other things you suggested. Larry computed the whole match scores for sims against sim (generating a 2700x2700 matrix). I don't think he generated a histogram from it, though, yet, do you think it belongs in the paper? We also did data versus data, does that belong? For sure we will post this data at least online, and maybe in supplementary?

2

2. In the excerpt of the match score formula, I see kT / C / stdev(force difference). Here are the implications that I read from the formula: a. for a given match score, a 1-unit force difference on 4 indices (neighboring or separate) is worth a 2-unit force difference on 1 index; and b. for a given match score, the standard deviation of the force difference scales with the temperature of the environment.

On (a), I recognize that this is a draft and see that you may not want or need to refine the formula.

On (b), are you making a statistical-mechanical statement (and I see no support for it elsewhere in the paper), or are you just dressing up the standard deviation of force in statistical-mechanical clothing, by adding a kT and making it unitless? For the purposes of your experiment, kT/C is constant, so that portion of the relationship cannot be tested. My thought is, if you're not making a statistical-mechanical statement, then your match score should be as simple as possible:

m2 = stdev(force difference/scale force)

where the scale force is kT/2/C. Do you actually get a better Gaussian when you take exp(-1/m2)?

My own biases would reverse your scale, suggesting numbers close to 1 as a good match, and numbers close to zero as a poor match --- something that can be interpreted as a probability, like exp(-m2) / sum_{all sequences} exp(-m2). On second thought, perhaps it is good that the score does not change as we consider a different universe of sequences.

Steve Koch 09:15, 25 January 2009 (EST): I don't understand your last comment, but all the others above are valid. We would really like to have a more fundamentally grounded match score. What we're looking at is energy differences per basepair. For the "correct match," these differences have to do with experimental noise / drift, errors in the calibration, and errors in the simulation. So, I'd think the "correct" match scores, in terms of sum of squares would have a chi-squared distribution? The distribution of sum of squares differences for mismatches would be dominated by sequence "errors" I think. What distribution would that have?
Steve Koch 09:15, 25 January 2009 (EST): (More to come later!)

3

3. The beginning of the paper reads to me like "peeing on the tree". In particular, I think mentioning Pol II is overreaching.

Steve Koch 16:04, 25 January 2009 (EST): I see what you mean, and we definitely felt like the introduction was too long about background stuff. But we didn't know how to shorten it, although clearly it needs to be. As for marking our territory, it's I guess related to our point. Now that we're transitioning to open science, it's not clear to me how not to mark territory. We do want people to know what we're working on, and we don't want to be scooped from out of the blue! I think one strategy we could pursue in this introduction is to remove discussion of our research goals and just mention the many possible applications and say what we were specifically motivated by. It's tough, though, because as physicists, we don't yet understand the applications well enough to succinctly describe them.

4

4. It really bothers me in figure 2 that the colors are swapped between A and B. Also, next to figure 4, in the last paragraph before "Future Improvements", you use the word "fell" instead of "felt". Also, I have not worked on the language very much, but the last line on page 8 is a little strange.

Steve Koch 16:04, 25 January 2009 (EST): We'll fix this (we used pre-made figures during our writing session)

5

5. I'd like to be able to estimate what the 30-point smoothing window represents (in nm or bp) when multiplied with your stretch rate.

6

6. You mention viscosity a few times and say that it's significant, but it seems to be absent from the "Creation of Simulation Library" description. However, later, you say that the viscosity is not important because it would just impose the same shift on the scores of both the matches and the mismatches (by affecting the OT data but not the simulated data). It's not clear to me how the GC content affects your estimation --- is it that the higher force needed to unzip improves the signal compared to the viscosity offset? The nature of standard deviation is such that if you had a mismatched sequence modeled with the correct force scale, it might have a better score than a matched sequence with the incorrect force scale. Alternatively, if you colored the points in figure 3 by GC content, would all those with the highest GC content have the lowest (most similar) match scores?

Initially, I felt that there was a contradiction here, but now I think that the argument needs to be revised: you depend on the scale of the force to be correct so that you can use the standard-deviation-of-difference in your match score. But if your simulated force doesn't include viscosity, then we cannot trust the energy calculation in your simulation.

Steve Koch 16:09, 25 January 2009 (EST): The viscous drag is an unfortunate feature of the experimental data due to very fast stretch rate. We didn't do anything with it in the simulation (it would merely produce a vertical shift in the data). We don't have a group of data besides this for naked dna with dozens of data sets all with the same conditions. At least not that I could find. When we take new data this year, we will optimize the data acquisition for this specific purpose, so this problem will go away.

Richard C. Yeh 20:44, 20 January 2009 (EST)