DataONE:Notebook/Reuse of repository data/2010/06/11

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Reuse of Repository Data
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Notes for June 11, 2010

 * Much of today was spent re-organizing/adding information in the spreadsheet, adding/editing information on OWW pages and discussion with Hannah (see below transcript).
 * More time will be spent refining search strategies in this coming week.
 * See also: TreeBASE Citation Spreadsheet
 * For citations of resources found, please refer to this CiteULike page.

Resources searched with search terms and hit count

 * 1) Resource: Nature Search term(s): "study accession" treebase Results: 3 (without initial 2008, English only limit)
 * 2) Resource: Nature Search term(s): "study accession" treebase Limit: Published between two dates: 2008-2010 Results: 1 (with initial 2008, English only limit)
 * 3) Resource: Systematic Biology Search term(s): treebase, "study accession" all words in text/abstract/title  Limit: Jan 2008-May 2010 Results: 12
 * 4) Resource: Systematic Biology Search term(s): treebase, "study accession" NOT "deposited" all words in text/abstract/title  Limit: Jan 2008-May 2010 Results: 6
 * 5) Resource: Systematic Biology Search term(s): treebase, "study accession," NOT (deposit*) all words in text/abstract/title  Limit: Jan 2008-May 2010 Results: 8
 * 6) Resource: Systematic Biology Search term(s): treebase, "study accession," NOT (deposit*), NOT (submit*) all words in text/abstract/title  Limit: Jan 2008-May 2010 Results: 4

Observations

 * Valerie Enriquez 12:59, 11 June 2010 (EDT): An idea has occurred to me. In my searches of Web of Science, Scirus and Nature, I've only found articles stating that the data from their project is hosted at TreeBASE. However, either on the journal article pages or an additional search through WoS's citation finder, I can find the number of articles that have cited the original article from which the data hosted at TreeBASE was mentioned. If I can go through these secondary articles to find if the information cited is from the TreeBASE data.


 * Valerie Enriquez 14:14, 11 June 2010 (EDT): (as copy/pasted from email to Heather) I've been finding mostly articles stating that their data is hosted on TreeBASE as opposed to articles specificially citing TreeBASE. I still need to tweak my searches, which is interesting since Nature doesn't seem to allow Boolean search. I am also taking note of the number of articles that cite the original article from which the TreeBASE data originated, even though it's not an example of direct reuse, it could stand as examples of why there need to be standards for direct reuse of data.


 * Valerie Enriquez 20:45, 11 June 2010 (EDT): Despite using the "NOT" Boolean operator with the word "deposited" in the Systematic Biology search function, the first result I found contained the word "deposited." After revising my search as per the SysBio Boolean Logic help page, I received 8 hits, with the first one using the word "submitted" instead of "deposited:" "The aligned data matrices and resulting phylogenetic trees were submitted to TreeBASE (www.treebase.org; study accession number SN3797)." from doi:10.1093/sysbio/syp010. Another search attempting to use the wildcard character "*" (as per SysBio's Wildcard help page) in conjunction with NOT Boolean logic still resulted in finding four articles with some form of "submit*" or "deposit*" within the text.

Valerie's Conversation with Heather in Google Docs around 2:00 pm June 11, 2010
Note: Heather was both Anonymous User 166 and Anonymous User 183

Anonymous user 166: Hi Valerie! I'm looking at your google spreadsheet. Nice start!

me: thanks

Anonymous user 166: I have a few suggestions. When would be a good time?

me: I'm good now. I just finished adding the new page you told me about.

Anonymous user 166: cool. great.

me: http://openwetware.org/wiki/DataONE:Notebook/Reuse_of_repository_data oh wait, am I talking to Hannah or someone else?

Anonymous user 166: perfecto. You are talking to Heather. I think maybe sometimes you have given me the nicname Hannah by mistake :) I like the name Hannah, as it turns out LOL

me: whoops, sorry

Anonymous user 166: That's ok, no problem at all.

me: I think it's because you're listed under my former TA Hannah on my gchat

Anonymous user 166: ah hah! That'd do it. that notebook page looks great.

me: thanks, I more or less took the summary from the questions page

Anonymous user 166: Add to the top to help navigation, and add a link back to the Research Questions page, and then keep flushign out content I haven't used the calendar parts of Notebooks yet, but I think it may be handy

me: I've been copy/pasting my thoughts from emails into the dated entries

Anonymous user 166: Do you have questions for me before I suggest a few additional columns? Valerie, perfect! Yes I think putting the correspondence there makes sense. We're clearly figuring it out still, so let us all know if that organization ends up working for you

me: ok I like being able to see the dates and timestamps on things

Anonymous user 166: yeah, agreed it looks like you are getting a bit of traction on finding things

me: should I log everything?

Anonymous user 166: is the overall project making sense at this point? do you have questions now that you've had a chnance to think about the aims/context/etc a bit?

me: I've only been logging things that looked like data reuse, then making a note of it if they're not after running a search. I just want to make sure I'm not doing things wrong :)

Anonymous user 166: good question, I don't know if you should log everything. I think log at the level of detail that you feel is right for you... err on being open, if you are otherwise on the fence

me: the thing is, sometimes the search results end up in the hundreds with maybe only about 20 somewhat relevant ok

Anonymous user 166: oh, I see what you mean by everything. ok, well here is one idea

me: or should I just mention the number of search results for each source

Anonymous user 166: start a new wiki page for today (or click on today in your Notebook calendar, and add a new "search results" section under a "correspondence" section or something) And then keep a running commentary

me: that makes much more sense

Anonymous user 166: so say "first I ran this search" and then paste in the url

me: because otherwise the spreadsheet would end up having too much information

Anonymous user 166: and then say "it returned 23423 hits" and then maybe paste in the return-url if there is one (or what makes sense)

me: ok, that does make a lot more sense I really like OWW so far

Anonymous user 166: and then you could comment and say "but I did a quick look at them and they were almost all data CREATION links, so I then refined the search to this" etc oh good I'm glad

me: ok, I figured I should keep detailed information about my search methodology anyway

Anonymous user 166: yes.

me: just in case someone has better ideas

Anonymous user 166: yes :) and also, to keep track of what doesn't work

me: ok, that's definitely important too

Anonymous user 166: one of the founders/initial instigators of the open notebook science concept, Jean-Claude Bradley, feels that that is one of the main benefits of ONS the fact that you record what doesn't work

me: sweet. I kept a leadership journal like that once in college.

Anonymous user 166: great. then you are ahead of me. I've never been very systematic about it... but I think there are huge benefits in it so I'm hoping to start

me: I'm not sure if anyone referenced it, but I left it in the organization's library for future reference.

Anonymous user 166: wow, cool!

me: well, admittedly, a lot of the journal was written after the fact, but in some cases, hindsight is 20/20. it'll be interesting to keep a journal through the whole process though.

Anonymous user 166: yeah, agreed. ok, so for your search results I imagine you will play with the queries a bit, recording your decisions and tweaks, then get to something that looks good enough to dig into in more detail (you may already be there) I think there is some benefit in recording some detail about the links that aren't reuse...but it doesn't have to be at the same level of detail.

me: or the fact that sometimes even the original data creators don't cite their data properly (like the link that goes to harvard for some reason)

Anonymous user 166: right!

me: and about what you said earlier, it might not be that helpful for me to go through the articles citing the original research article

Anonymous user 166: I think it is worthwhile to keep some level of detail about how people talk about data creation links to the extent that you find them when you are looking for data reuse links

me: because if the other researchers got the information from the article and not TreeBASE, it's kind of a moot point. ok

Anonymous user 166: because that information would help you refine your query. said "we deposited data in", 5 times etc "we have uploaded data into", 10 times

me: ok, that makes sense

Anonymous user 166: not exactly that, modify it so that it is easy to update

me: I was going to try boolean searching next with NOT and "uploaded" or NOT "deposited"

Anonymous user 166: that may be too much detail... but capture some detail about what you are capturing that you don't want to help inform the NOTs. yes, exactly.

me: ok

Anonymous user 166: now I'd also keep track of the fact that you are doing that. obviously but also keep track of it in another section for example, on your main new project page you might want to start a section called "limitations" and add to it "if I end up using a query that has NOT uploaded in it, this will mean that I will not find papers that created AND reused data"

me: ah, this is true

Anonymous user 166: this woudl be fine, but it woudl be worth remembering that it is a limitation of the approach that affects generalizability I don't think it effects it too much, and may definitely be worth it in an 80/20 study like this one.(personally I think it will be a big help and you should add the NOTs)

me: ok. I'll also make a note of searches that for whatever reason don't allow boolean searching

Anonymous user 166: just brainstorming ways we can use OWW to keep track of potential thoughts and implications of our research as we are doing them yup, great

me: so in other words, all else fails: make an entry in OWW

Anonymous user 166: so you could have a "lessons learned" section (ro somethign) that says "the Nature website is really hard to query for purposes like this because it doesn't support ABC" or whatever yup :)

me: ok, I can definitely do that

Anonymous user 166: great. I personally thank you, because it will help me later :)

me: no problem

Anonymous user 166: for what it is worth, I expect you will start data gathering about three times in the next three weeks, as you learn more.

me: I figure showing one's work is part of the whole scientific process anyway. ok

Anonymous user 166: so expect to have to start over again and don't consider it failing, it is all part of the learning. especially if documented ;) yeah.

me: ok, because at first I was really worried if I didn't get any solid results

Anonymous user 166: so here are a few thoughts on additional columns for your spreadsheet. no, don't be worried about that.

me: ok, I remember Sarah mentioning something about how the large blocks of text would be hard to import or count in a database

Anonymous user 166: first of all, I think it will take you a week (?) of trying before you start to get a feel of what works and what doesn't this isn't an easy query task, so it will take a bit of exploring. both in terms of full-text, and in terms of looking up DOI prefixes in reference lists, etc. hrm, I wonder what Sarah meant by that, I'll have to go reread.

me: ok, I'm not sure if she sent everyone the message

Anonymous user 166: ok. I do think that small blocks of text could be really helpful though. so for example, it woudl be helpful if you added another column

me: this is what she said: "Generally speaking, I think its good that we collect the original text (i.e. in your policy sheet which has large chunks of text), but I think each of those fields should be accompanied by a coded categorical or quantitative field, otherwise they aren't useful for statistics. There are different pros and cons to coding the data during or post data collection...we should discuss this on Monday.

Anonymous user 166: into which you paste the actual text of the sentence that makes the citation reference so we can see what words are used

me: ah, that is a good idea

Anonymous user 166: yes, agreed! So do copy the text, but we also need to break out the important parts.

Anonymous user 183: so I think the fact that you have a separate breakout column for the url or the citation itself is useful, in addition to the full sentence mistake

me: wait, where should that go?

Anonymous user 183: trying undo! Hmm, not quite sure what happened there.

me: it's ok, I need to go back through and look at most of these again anyway

Anonymous user 183: Maybe you changed the F19 cell? Or did I? I can't see the revision history, maybe because I'm Anonymous for some reason....

me: yeah, I'm not sure why everyone's showing up anonymous

Anonymous user 183: hmmm. whoops.

me: oh, I think it's because I'm sharing with everyone so they're not logging who's in the document at any given time

Anonymous user 183: well, I'll just let it go. gotcha. I think I just lost our chat history though. Not sure if it is helpful to save... if so, could you save it please?

me: ok, I'll try to copy/paste it, although it'll probably say anonymous for us

Anonymous user 183: anyway, a few more quick thoughts.... did I explain well enough what I mean by the reuse-sentence column? that's ok

me: the sentence within the body of the article that cites TreeBASE?

Anonymous user 183: yes, that's right.

me: like The matrices in figure 1 (study accession number S### in TreeBASE) or something

Anonymous user 183: yes, exactly

me: ok, I can do that, especially since I can actually get fulltext Nature articles when I'm on a Simmons computer

Anonymous user 183: or "We used the results from three Treebase studies [34-37]" great

me: most of what I've found so far has been "We deposited our data in TreeBASE" (but I'm working on a way to exclude those results)

Anonymous user 183: it woudl also be helpful to have a column that contains a link to the articles themselves

me: like the DOI?

Anonymous user 183: though I know that link depends on a user's proxy settings so it may not be reasonable

me: a lot of the articles have DOIs

Anonymous user 183: yeah, a doi might be a better idea. add a doi column

me: at the very least, links to the abstracts, right? ok

Anonymous user 183: right two cols for doi and link to abstract would be great

me: ok. ok, just to be clear, who am I talking to now? Heather was "Anonymous user 166" earlier

Anonymous user 183: yup, I think that would be a really good start. Yes, sorry about that! Heather both ttimes

me: oh ok, for some reason I thought there were multiple people talking Anonymous user 183: one more thing

me: ok

Anonymous user 183: if it turns out you are getting too many hits, you can decide to limit the journals or issues you are looking in the number of hits in Nature might not be too great, but if you start searching in Scirus or something you might want to limit it to 2009 publications, perhaps something like that

me: ok, that makes sense I should probably explicitly state a date limit and language limit (works in English) right?

Anonymous user 183: yes

me: (unfortunately the only languages I can read in are dead) ok, I'll make a note of that

Anonymous user 183: one more idea is to add the journal Sys Bio to your short list http://sysbio.oxfordjournals.org/ they do lots of data sharing, and may also do lots of data reuse

me: oh neat

Anonymous user 183: I think that Sarah will be looking at it thoroughly for a few years, but you could do a repository-depth sort of serach. ??? anyway, it is a thought

me: I can look at that as well

Anonymous user 183: Though as I said it, I think the idea of doing a look across many journals, limited to 2009 for example, is a better fit since Sarah is focussing on specific journals

me: many journals limited to 2009 in English

Anonymous user 183: you can limit Google Scholar etc that way. yup, that would be my current thought.

me: yeah, there's tons of stuff in Google Scholar

Anonymous user 183: (ok, one different idea to explore.... use 2008 instead. The reason is that there may be better full text coverage in Pubmed Central for 2008" since the NIH requirements woudl have start hitting. ????. I'm not sure in this domain though since it isn't NIH.... so probably not as relevant. Your call, 2008 or 2009.

me: ok, some articles I found were in Pubmed and Biomed so I think I might start with 2008 and if that's too many I'll go to 2009

Anonymous user 183: ok! great. Send me more questions if/when you have them, maybe though your OWW pages... I'll go make sure I have them "watched"

me: all right. I'll make sure to do that either OWW or email or gchat

Anonymous user 183: great! looking forward to seeing those citation sentences... I'm curious to know how much they vary!

me: thanks again for all your help on this (since I'm rather new)

Anonymous user 183: oh, one more thing. You probably saw that I added a few comments to the "research questions" talk page

me: oh, yes

Anonymous user 183: if you could streamline the project research questions and plan there a bit more, so that it would be a good summary for people coming to check out the projects, that woudl be great. feel free to gut what is there and make it your own

me: ok, I wasn't sure how to answer all of the questions so I'll answer what I can and then get rid of the rest

Anonymous user 183: that's ok. You don't have to answer all the questions. you could just say "Pilot data collection is underway to help answer these questions" and then list the subset of the questions that you want to answer soon and if you decide that some of the questions aren't relevant, that's ok. Mostly I just wanted to give you a flavour of what you could be considering. yup, great

me: excellent

Anonymous user 183: cool! Talk more sometime soon then. bye!

me: later!


 * }