DataONE:meeting notes:10 June 2010 Chat

From OpenWetWare
Jump to navigationJump to search

10:00 am June 10, 2010 chat between Heather Piwowar and Valerie Enriquez [10:06] Valerie Enriquez: Hello

The active resource for this window has changed to "gmail.129749AA".

[10:06] Hi Valerie!

[10:06] Valerie Enriquez: Sorry about that, I forgot you have to send a request first

[10:06] No, no problem, it worked just fine.

[10:06] Is now still a good time for you?

[10:07] Valerie Enriquez: yes

[10:08] Cool.

[10:08] Then shall we start by a quick intro so we both know why we are here, and can direct the projects accordingly?

" (" signed off at Thu Jun 10 10:08:22 2010.

" (" signed on at Thu Jun 10 10:08:43 2010.

The active resource for this window has changed to "gmail.129749AA".

[10:08] Cool.

Then shall we start by a quick intro so we both know why we are here, and can direct the projects accordingly? [10:08] Valerie Enriquez: suree

[10:09] I got some messages saying you didn't receive my chat, just before. Maybe we have a funky connection.

[10:09] anyway, seems to be working now. Let's reconnect in email as a backup plan.

[10:09] Valerie Enriquez: it said for a second that you were disconnected

[10:09] Valerie Enriquez: ok

[10:10] Do you want to go first on the intros? My intro is nice and related to the project, so it would make a good segue into detailed plans :)

[10:10] Valerie Enriquez: ok

[10:10] what make you want to do this summer program, what skills do you have, what skills do you want to learn?

[10:10] Valerie Enriquez: For the most part, I've just been adding articles and other websites to the resource list for the literature review.

[10:12] Valerie Enriquez: As far as what skills I have, I have search skills from my reference class and knowledge of medical science databases from my medical librarianship class (small overlap with science databases like ISI Web of Science)

[10:12] great

[10:12] Valerie Enriquez: I'm still rather new at this, but I expect to grow in research capacity

[10:12] Valerie Enriquez: and learn more about available resources to the science community

[10:12] What made you want to apply to this summer program? How does it overlap with things you are interested in

[10:13] Valerie Enriquez: I'm an Archives major with an interest in digital repositories.

[10:13] Valerie Enriquez: Since medical and science are usually on top of technological developments, I figured this would be a good opportunity to practice the theoretical things I learned in class.

[10:14] great!

[10:14] yes, I think that sounds like it will be a good fit.

[10:14] Valerie Enriquez: thanks!

[10:14] And I have a biomed background in terms of info science, so I think we will speak the same language there

[10:15] Ok, a bit about me and this project

[10:15] I just graduated with my PhD in biomedical informatics from the U of Pittsburgh. [10:15] Valerie Enriquez: neat

[10:16] My backgorund is electrical engineering, so I don't have a lot of formal information science,

[10:16] but have been At One with PubMed and its associated Entrez databases for the last few years.

[10:16] My research has focused on measuring the extent to which scientists share their raw research data with one another.

[10:17] Interesting, but the REAL good stuff I think is in studying data reuse, not data sharing.

[10:17] The thing is that it is harder to study data reuse (probably not telling you anything you don't know)

[10:17] in large part because data reuse is documented in so many different ways

[10:17] Valerie Enriquez: ah

[10:18] so the projects that sarah and you will be working on are both interesting in their own rights

[10:18] as quantifications of current practice, and also great background info and pilot projects to thinking about how we can systematically track reuse and what the limitations might be of using automated methods

[10:19] Valerie Enriquez: like the web bibliographic tools?

[10:19] For example, if I wanted to see all of the reuse of data from TreeBASE, and I decided to look for citations in reference lists to TreeBASE datasets, how many instances of reuse would I be missing?

[10:19] Valerie Enriquez: ah

[10:20] How often is TreeBASE data used, but cited as an accesion number (does TreeBASE have those?) or as an acknowledgement to an author instead?

[10:20] Another way to look at how we can use this data is in imaginging IF we were able to standardize how reuse was attributed,

[10:21] for example, if we were able to convince all TreeBASE data reusers to make citations to datasets rather than acknowledging the authors, how many more dataset citations would there be?

[10:21] That could sort of data could be useful in motivating policy discussions, etc.

[10:22] (Note, I'm actually very unfamiliar with TreeBASE at this point, so I don't know how reasonable my examples are. Most of my research has been done with the NCBI's GEO database,

[10:22] Valerie Enriquez: I'm not familiar with TreeBASE yet either

[10:22] and so if you sub out TreeBASE with GEO in the above paragraphs, the ideas do make sense, I know that part ;) )

[10:22] Valerie Enriquez: ok

[10:23] GEO isn't very directly related to DataONE interests though, so probably not a database we want to focus on.

[10:23] Valerie Enriquez: would it also hold researchers more accountable if information has been falsified or misinterpreted?

[10:23] you mean if the data citations were more easily trackable? [10:23] Valerie Enriquez: yes

[10:24] yes, I think so. If it were found that primary data was wrong in some way, it would certainly help alert everyone who might have reused it

[10:24] if we had more standardized ways of identifying all the reuses.

[10:25] It would also help provide make data sources more transparent, and thereby potentially make it harder to "reuse" data that you've actually modified, I agreee.

[10:26] Valerie Enriquez: so it can only help scientific integrity in the long run

[10:26] Valerie Enriquez: (I just remember that news scandal with the climate change emails)

[10:27] yes. Though I think it isn't as directly related to scientific integrity as, for example, doing open science would be.

[10:27] Valerie Enriquez: ah

[10:27] Valerie Enriquez: I was wondering if Open Access was related to this

[10:27] yeah, good question.

[10:27] Open Access usually refers to the publication model of published papers

[10:28] Open Science is similar, but more broad

[10:28] Open Science (well, that is actually an umbrella term for lots of things, so let's say Open Noteobook Science) is kind of like doing open access for your science all the way from the beginning to the end

[10:29] Valerie Enriquez: ah

[10:29] Open Science I think would have helped the climate data people :)

[10:29] Open Data is another idea in Open Science.

[10:29] And it can mean lots of things.

[10:30] From making data available when it is initially collected, publicly available, openly on the internet....

[10:30] Valerie Enriquez: I remember an article about the Cochrane Reviews and how the protocols have either been absent at the beginning or changed during the review so that it matched the review's conclusion.

[10:30] To doing more what we are looking at... when you publish a research report, making at least the raw datapoints that support the research conclusions you draw available on the internet

[10:31] yes! right, so that is open trial registration data. And yeah, yikes.

[10:31] Though since they did make their stuff available at the beginning, in theory the transparency is there that people can then see it had been modified.

[10:32] Valerie Enriquez: which made the study of the Cochrane Reviews possible

[10:32] righ

[10:32] so mostly we aren't looking at that :)

[10:32] we are looking at data that is ususlly shared at the time of publication

[10:33] and then trying to quantify all the ways it is cited, and which ways are most common

[10:33] I'll give you a few examples, just to ground it:

[10:33] GEO has accession numbers, so often in the full text of a paper it will say:

[10:34] "We used datasets GDS34556 and GDS98098 from NCBI

[10:34] 's Gene Expression Dataset for validation of our machine learning algorithms"

[10:34] or instead it might say "We used data from Smith et al [45]"

[10:35] where Smith et al is the paper that described the initial data collection

[10:35] or instead it might say "We used three datasets from TreeBASE [56, 57, 58]"

[10:36] where the references are publication-looking things that are actually citations to datasets in TreeBASE, complete with DOIs

[10:36] or it might say, we downloaded 35 datasets (see Table 3)"

[10:36] Valerie Enriquez: ah, I was wondering where the DOIs (or URIs) came in

[10:36] and in Table 3 they list the paper bibliographic details, but they paper bibliographic details don't appear in the references list per se so they wouldn't be found by ISI citation lookups

[10:37] And finally (well, this isn't the last incarnation, I'm sure, but my final example so that you can get a word in edgewise)

[10:38] there is a really tough case where they might say "We reused all datasets from GEO that were on mice and chemotherapy agents"

[10:39] basically just defining their reuse set by a search term. So, um, good luck to us for finding those reuse links automatically!

[10:39] Valerie Enriquez: wow

[10:39] Anyway, hopefully that gives you a flavour of all the possibilities and why it is a bit tough

[10:39] Valerie Enriquez: I can see why there's a need for standardization.

[10:39] Valerie Enriquez: because there's a recommended citation format for many sites

[10:40] Valerie Enriquez: but it's a matter of if researchers/article writers actually use it

[10:40] yes. And possibly many do!

[10:40] I think for some databases it is often standardized.

[10:40] But for others it isn't.... and I think it will help us to know the landscape.

[10:40] Valerie Enriquez: so that's why we're doing the survey

[10:40] yup.

[10:41] And Sarah is going to do it across all repositories, focusing her search on given journal issues.

[10:41] That's great, but it may not give us a depth look into the behaviour with respect to specific repositories.

[10:42] So your task, should you choose to accept it, is to look for all the instances of reuse of data from certain (to be determined) repositories.

[10:42] This will be hard, because the search terms are not well defined

[10:42] Valerie Enriquez: ok, I wanted to be sure as far as which repositories I would look at

[10:42] And sometimes you'll have to look in full text, and sometimes you'll have to use ISI citation databases

[10:42] Valerie Enriquez: all right, I do enjoy a challenge :D

[10:42] etc

[10:42] Good! You've got one :)

[10:43] For what it is worth, I actually have a similar project going on, looking at GEO data reuse

[10:43] so I'm guessing we'll be able to share notes.

[10:43] Valerie Enriquez: neat

[10:43] Though I think your problem is harder than mine :)

[10:43] I think that mostly because I'm familiar with GEO, and I get to use PubMed Central and all the great medical lit resources... I think the full text picture may be more difficult in the DataONE domain.

[10:44] But it'll be a great learning experience for all concerned.

[10:44] Valerie Enriquez: ok, just to be clear, am I looking for individual articles with example citations of the data repositories?

[10:45] So lots of the value from your project will come in scoping it so it isn't too daunting, and documenting all the things you try and all the things you don't try. It is too big a problem to do comprehensively in 6 weeks... but you could sure help us a lot by getting started, figuring out the issues, and getting some data on a subset of the area.

[10:45] Yes.

[10:45] You are looking for articles that cite data housed in certain repositories.

[10:46] So I could imagine you would pick three repositories to study, one at a time.

[10:46] Valerie Enriquez: ok, like Pangaea

[10:46] Yes.

[10:46] Then define 3 or 5 ways you are going to look for citations to Pangaea data

[10:47] Then manually do those searches and read the articles and cut and paste into some big spreadhseet the text that describes how they attribute the data reuse (kind of like the fake examples I gave above)

[10:47] then classify those varieties by a few attributes (like do they have a doi, do they use an accession number, are they a link to the primary paper) etc.

[10:48] Valerie Enriquez: like comparing the DOI citation with something more vague like "volcano activity in Iceland"

[10:48] Valerie Enriquez: ah

[10:48] hmmm, not sure about your volcano example. Can you explain a bit more?

[10:48] Valerie Enriquez: I just randomly thought of it, I don't think that's a real dataset

[10:49] Valerie Enriquez: (just because the most recent geological thing in the news other than the BP disaster was that Iceland volcanic eruption)

[10:49] oh I see. yes, I think so.

[10:49] but to make sure I understand, can you ask your question again?

[10:50] Valerie Enriquez: your previous example was of "all studies of mice and chemotherapy agents" vs. a number starting with GDS####

[10:50] right

[10:51] Valerie Enriquez: (I'll have to dig a bit more in Pangaea before I find a better example)

[10:51] no I think I understand now

[10:51] so one possible outcome for your research might be to say

[10:52] that 34% of the papers that reuse data from Pangaea reference it with DOIs, whereas 89% of reuses from TreeBASE use DOIs; whereas

[10:52] 88% of Pangaea reuses also mention search terms, but only 3% of reuses in TreeBASE are defined in terms of search terms

[10:52] or something like that

[10:52] is that what you meant?

[10:53] Valerie Enriquez: yes

[10:53] cool.

[10:53] So I've dumped lots of info on you. Ask me some more questions....

[10:53] Valerie Enriquez: will I also be looking for citations by author (where there might not always be a DOI)?

[10:54] yes. Well, that is a kind of reuse attributions. Frankly, that is going to be a pretty hard one for you to look for, though.

[10:54] Valerie Enriquez: ok

[10:54] So instead, I think an approach for your project might be to say "there are also these sort of attributions. We aren't looking for them."

[10:55] Valerie Enriquez: since I only have a few weeks

[10:55] "For an estimate on how prevalent they may be, please refer to Sarah's work" :)

[10:55] Valerie Enriquez: neat

[10:55] Right. And I just don't know how you would systematically look for all of those citations and figure out if they were in the context of reuse, etc.

[10:55] Let's brainstorm what your 3-5 search strategies might be

[10:56] Then you can do a quick look and see if they are reasonable ones for the databases you are considering

[10:56] Valerie Enriquez: ok

[10:56] (Asking Sarah's opinion woudl be good to since I think she's started to read a few of the articles and may have a sense)

[10:57] Hrm, actually shall we start with which databases first? Because I think that informs the search strategies.

[10:57] Valerie Enriquez: yeah

[10:57] Specifically, I understand that Pangaea has dois that are tracked by ISI Web of Science citations tracking.

[10:57] Which is pretty awesome.

[10:57] Valerie Enriquez: You had mentioned Dryad, Genbank NCBI's Gene Expression Omnibus

[10:57] Valerie Enriquez: ok

[10:57] Valerie Enriquez: so definitely Pangaea

[10:58] So certainly including Pangaea woudl be a good idea,

[10:58] Yup. So while we are very interested in Dryad, it is really young

[10:58] and so there won;t be many reuses of its data yet.

[10:58] Valerie Enriquez: and I think Nic might be looking into it as well

[10:58] so I think it doesn't make the short list.

[10:58] yup.

[10:58] Valerie Enriquez: ok

[10:58] Valerie Enriquez: TreeBASE?

[10:59] yes, treeBASE

[10:59] Not GEO, because its datatype is more biomed than evolution.

[10:59] Valerie Enriquez: ok

[10:59] Let me clarify something that I think I may have confused....

[10:59] I envision you having (at least) two spreadsheets.

[11:00] The one that you already started tries to look at various databases/repositories and how they suggest that people cite their data

[11:00] (overlaps with Nic's work)

[11:01] Here I think it is valuable to include Dryad, GEO, Genbank because they have data citation recommendations that are familiar to some of us.

[11:01] Mostly I think having this spreadsheet is valuable because it will inform ways that we look for the data.

[11:01] Valerie Enriquez: ok

[11:02] If the databases say "please cite the datasets in your citation lists like this" then we shold definitely try to look for them like that, as one of our search techniques

[11:02] Valerie Enriquez: ok

[11:02] The main spreadhseet that you will be generating will hold all of your extracted data about the reuse insteances

[11:02] Valerie Enriquez: what should the target number of articles be?

[11:02] Valerie Enriquez: sample-size

[11:02] and will only include references to the three databases you focus on

[11:02] does that make sense?

[11:03] Valerie Enriquez: ah

[11:03] yeah, good question.

[11:03] I don't know.

[11:03] Valerie Enriquez: ok, I guess it will depend on what I find

[11:03] Yes. And how hard it is to search full-text in these journals.

[11:03] So biomedicine has PubMed Central and Highwire press for searching full text across many journals at the same time

[11:04] (in addition to Google Scholar and Scirus)

[11:04] Valerie Enriquez: earth science and bio, not so much?

[11:04] Valerie Enriquez: er eco

[11:04] I don't know if similar tools exist in earth science. ??? it would help me a lot if you figured that out :)

[11:04] Valerie Enriquez: ok

[11:04] Valerie Enriquez: that will be one question I consider

[11:04] Valerie Enriquez: although I do recall Sarah working across disciplines as well

[11:05] yes.

[11:05] So just to make it clear, one place to start could be to say.... I'm going to look at reuse of TreeBASE data in the journal Sys Bio.

[11:05] and I'm going to look for reuses in these three ways.

[11:06] 1. ISI web of science citations to the word TreeBASE (or to treebase dois if that is possible???)

[11:06] 2. Mentions of the word TreeBASE in the full text of articles in that journal

[11:06] and

[11:06] 3. Hrmmmm not quite sure :)

[11:07] 2. Mentions of the word TreeBASE in the full text of articles in that journal and 3. Hrmmmm not quite sure

[11:07] :)

[11:07] Valerie Enriquez: hm.

[11:07] Valerie Enriquez: so I'm looking at TreeBASE use in journals and databases

[11:07] Valerie Enriquez: (and I need to pick 3-5 journals/databases)

[11:08] Yes. Well, what you really want is TreeBASE use in articles, but to do that I think you will need to search ISI citation database and journal cites / Google Scholar / something

[11:08] Valerie Enriquez: ah

[11:08] And then for all the hits you get from those searches, manually read the papers and extract the "reuse sentences"

[11:08] and keep track of these "reuse sentences and citations" in your mongo spreadsheet

[11:08] Valerie Enriquez: oh

[11:09] Valerie Enriquez: so not just when the dataset is cited, but when the information is cited in papers following the article the data was cited in?

[11:10] hmmm. I'm not sure.

[11:10] ask again?

[11:10] I'm not 100% sure I understand

[11:10] Valerie Enriquez: ok, article A cites dataset from TreeBASE

[11:10] Valerie Enriquez: article B cites article A but not dataset from TreeBASE

[11:10] oh! No

[11:10] Valerie Enriquez: oh good

[11:10] Valerie Enriquez: because that would be super-hard

[11:10] No, we don't care about article B

[11:10] luckily :)

[11:11] But we do care about article A

[11:11] and we care about the phrasing it uses to specify that it has reused data from TreeBASE

[11:11] does that make sense?

[11:11] Valerie Enriquez: now it does

[11:11] And the trick is just to find all of the article As

[11:12] because really, the outcome of your research will be how people should look for article As. :)

[11:12] Valerie Enriquez: ah

[11:12] And for what it is worth, the 3 databases may be too optimistic.

[11:12] Maybe start with one.

[11:12] TreeBASE or Pantgaea.

[11:13] Valerie Enriquez: ok

[11:13] And start with one journal that seems to reuse data from that database often

[11:13] And for which it is easy to search the journal's full text.

[11:13] Valerie Enriquez: ah

[11:13] Valerie Enriquez: (and one I have access too via my school library :D)

[11:14] yes, and that.

[11:14] you aren't in the UC system, are you? If so, and we are going to look at Nature journals, we'd better do it fast ;)

[11:14] Valerie Enriquez: no, I'm at Simmons

[11:15] Valerie Enriquez: I have access to ISI Web of Science and SciFinder

[11:15] great!

[11:15] SciFinder, I don't think I've ever used that.

[11:15] Have you used Scirus? It is nice.

[11:15] And do look at Highwire Press to see if any of the journals are covered there... they have a nice search interface.

[11:16] Valerie Enriquez: ok, cool

[11:16] Other questions?

[11:16] Valerie Enriquez: I think I found Nature on my library electronic journals

[11:17] Yeah. I didn't mean to throw you off there. Mostly I think that TreeBASE or Pantaea in SysBio would be a good start

[11:17] Valerie Enriquez: ok

[11:17] Valerie Enriquez: maybe I'll start with TreeBASE

[11:17] Valerie Enriquez: and look in ISI, Nature and Scirus to start

[11:17] Have a look there and see if you can find some reuses, using any techniques you want. Start a spreadsheet to record things, and

[11:18] maybe tomorrow we can touch base again to see if it is still making sense

[11:18] and if you have a sense of good ways to find the reuses.

[11:18] Valerie Enriquez: ok

[11:18] also talk to Sarah

[11:18] Valerie Enriquez: I'll poke around today and send Sarah a message.

[11:18] and zoom me questions at any point

[11:19] Valerie Enriquez: ok

[11:19] it is a bit of a fuzzy project, but it will be really helpful, so just do what you can to figure out boundaries that make it manageable

[11:19] Valerie Enriquez: all right, I can do that

[11:19] great! good luck!

[11:20] Valerie Enriquez: Thanks for clarifying what I need to do, you've helped a lot!

[11:20] Valerie Enriquez: thanks

[11:20] talk again soon......

[11:20] Valerie Enriquez: goodbye

[11:21] bye