DataONE:Notebook/Summer 2010/2010/07/13

From OpenWetWare
Jump to navigationJump to search
Owwnotebook icon.png Project name Report.pngMain project page
Resultset previous.pngPrevious entry      Next entryResultset next.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

Group chat agenda

Knoxville wrap-up

  • OWW page for Knoxville... agenda + slides + links to notebook summaries?
  • slides on plone
  • reimbursement
  • other


  • blurbs on spreadsheet README
  • ASIS&T poster?
  • Other deadlines, thoughts?


  • Google Fusion recap
  • Data validation status
  • R and stats status


  • spreadsheet walkthrough
  • article thoughts
  • DAAC followups?


  • data collection status
  • R and stats status

Group Meeting Transcript July 13, 2010

12:59 PM Heather: You've been invited to this chat room! 1:00 PM Sarah has joined

Heather: Hi all!

me: hello

Sarah: good morning!

Heather: I just posted a last-minute agenda here: Hi

Heather: Additions? 1:01 PM looks good

me: not really, those points look like they address my questions 1:02 PM Heather: ok... let's start.


Realized that although some people posted notes about the meeting, we don't have a face-to-face page on OWW

would probably be good, post agenda, link to discussions, slides, etc.

volunteer to make one? 1:03 PM me: I have some notes and could make a page about the meeting on OWW that others could add to

Heather: thanks valerie, perfect.

maybe link from the notebook date, too?

have you guys put your presentations up on plone yet?

me: you mean make a notebook with date entries with each of the notes?

no, I don't believe so 1:04 PM (the main DataONE site?)

Heather: I haven't, I need to do that. Can each of you do that too if you haven;t?

yup, the main DataONE site... we are to put all of our "final" products there, and I think our presentations coudn as interm final products :) ok

me: ok, cool. I can do that 1:05 PM Sarah: i have my notes on my oww mean just link it to the main agenda?

Heather: Valerie, sorry, I meant go to and click on the first date of the face-to-face meeting and add a pointer from there

Sarah: i'm putting my ppt up now

me: oh, ok

Heather: Sarah, yup. And/or link from the f2f page that Valerie will make to your notes pages 1:06 PM My goal: A "summer 2010" page that summarizes, briefly, our face 2 face meeeting

so links from there to sarah's notes, the agenda, our slides, etc.

does that make sense?

kind of like our "correspondance" chat transcripts, but flushed out with a few more links..... 1:07 PM still confusing?

not confusing?



1:08 PM Sarah: i'm clear I think it's clear 1:09 PM Heather: Valerie, you good? if not, ping me offline....

Valerie, you good? if not, ping me offline....

me: I'm good

Heather: any other questions about knoxville wrap-up. reimbersement or anything?

(Not that I know much about that, but you can ask....) not here 1:10 PM Heather: ok.

you probably saw the attempts to provide more context to our OWW site

disclaimers, clarifications, etc.

me: yes

Heather: any suggestions? 1:11 PM if so, now or later, let us know.

or, for that matter, just make the changes yourself!

it is a wiki after all :) I thought it was comprehensive... with respect to the "readme" disclaimer... should we be doing that with each ss?

me: should we link to both our project pages and the main Summer 2010 page?

Heather: Valerie, from your SS you mean? 1:12 PM Nic, I think yes. ok

Heather: Valerie, from your spreadsheet README blurbs you can just link to one OWW page, I think, whichever one you would want to get to first if you were the person following the link....

me: yes 1:13 PM ok

Sarah: i just added them on the first sheet of each ss

Heather: great, thanks Sarah.

ok, any other thoughts on that stuff?

if so, chime in. 1:14 PM if not, just want to check in on the ASIS&T deadline and any other deadlines coming up....

Nic, you thinking a poster? Anyone else thinking a poster? or not? 1:15 PM Sarah: i don't think i would be able to make it out to the meetings, and i don't know if it is the best place for my research anyways

Heather: sarah, that works, agreed.

me: I don't think I can make the meetings either, and I'm not sure if my data is best presented as a poster/video/demonstration. 1:16 PM Heather: ok, good valerie. there are six tracks for this years conf

Heather: Nic, what do you think? were you planning to go? I think mine fits well into the Track 6 – Information in Context: Economic, Social, and Policy Perspectives

and Im already planning on going

Heather: yup... sounds like your stuff, eh? 1:17 PM so I'm hoping to get a draft done late tonight

Heather: ok, great. Circulate in email tomorrow so that mentors have a chance to weigh in? sure

Heather: and make sure the funding sentences are good and whatever else we need to make sure we get right for a "formal" DataONE release.

I mean... the mentors can then make sure.... 1:18 PM I was going to search for a presentation to see what others had done and then try to model that

Heather: great. yeah, don't stress over it. there might be something on the plone site?


Heather: just wanted to reiterate that having mentors see it before submission is a good idea, so you need to give them two or three days ideally 1:19 PM not sure, probably

feel free to bounce it off of me whenever, if you want faster feedback. otherwise, I look forward to seeing it when you send it out. 1:20 PM great

Heather: want to give us a quick summary of what you think of Google Fusion?

do you recommend it?

if so, for what? sure. its really nice for making annotations

sure. its really nice for making annotations

Heather: if not, ??? but its very hard to make edits to individual cells 1:21 PM I think it would be nice to use in a situation where a group is trying to hash out the fields they need to gather

Sarah: I like it better than regular docs, but am having some bugs

me: ah, I noticed it only liked uploading one sheet at a time (as opposed to whole workbooks) it's nice that way

me: (unless I'm doing it wrong)

Heather: I noticed that too Valerie

Sarah: agreed, i like the commenting feature but it might not be as useful to us at this point

Heather: hrm... what does that mean about our README idea, btw? maybe I'm missing something, but I couldnt figure out a way to merge me discussions with new sheets

Sarah: yes, but you can put the readme with the description 1:22 PM Heather: true good idea sarah

Sarah: also, can you save figures (visualizations) that you like for others to see?

Heather: sarah, what kind of bugs?

Sarah: oh, just controlling the data

Heather: nic, in what ways is it difficult to make edits? I like that you can set up alerts for comments 1:23 PM well when I was trying to change cells in a column it didn;t allow me to use keyboard nav

Heather: I don't know, Sarah, about saving vizes

arg that is a pain 1:24 PM so I was clicking between each cell and then moving the curser with the mouse... small complaint, but if you're editing a big set it can get time consuming

Heather: yeah for sure

so what do you guys think?

are you goign to keep experimenting? 1:25 PM or decide to skip it at this point if it doesn't solve any major problems you were having?

or ?

(btw feel free to type while I'm typing and interrrupt me, I don't mind a bit.....) 1:26 PM I think I'll use it to share my sheets but not create them (meaning once I get them edited I'll begin uploading there instead of googledocs)

it would be a lot easier for a mentor to give feedback that way

Heather: yup. and you want to share them that way is because then other people can easily comment by using the comments?

yup. and you want to share them that way is because then other people can easily comment by using the comments?

gotcha 1:27 PM so here is a different idea....

I think Google Spreadsheets supports RSS feeds. 1:28 PM You could have a "comments" column or two where you explicitly ask people to give comments, then monitor via RSS

Downside: the comments aren't cell-specific that could work

Heather: Upside: you could set it up such that people could actually edit the cells, which I think is a better approach for gathering input from the commumity

Sarah: does fusion have rss


Heather: less approval based 1:29 PM hrm, probably? i know in the discussion you can check an "alert me"

not rss for the entire sheet though I don't think 1:30 PM Heather: I guess I'm thinking that if it is a pain to enter data there, that is a pretty big knock against it unless there are strong advantages to using it.

(even just entering data as a modification activity, after doing most of the work creating the spreadsheet) 1:31 PM but I don't have strong opinions, just wanted to suggest alternatives

the ability to merge tables via Fusion does look pretty cool.... shrug, I dunno. If I can't get comfortable with it by the end of the day I'll probably take the "comments" approach back to google spreadsheets 1:32 PM Heather: ok. and Sarah or Valerie if you keep using it that is fine too... I think collecting info about what works and what doesn't for what usecases is valuable 1:33 PM me: ok, sure

Heather: ok. Nic, how are your tables going?

You are commenting now? 1:34 PM Let us know when you are at the point where you want people to dive in and help curate the ambigous data? good, yesterday and today I spent time defining what columns I had tried to collect and then figuring out what I was missing

Heather: ok one sec Im trying to get the links for my tables

Heather: I think that in parallel with this it will help to be thinking about stats 1:35 PM the reason I say that is that thinking about stats is often a fast and real way to figure out what data you really need, in what format, etc

so I'd say don't try to get your spreadsheets perfect and then think about stats ok

Heather: because it never works that way :) 1:36 PM so in thinking about stats... I started to play with the R commands that you gave me... but I think I need to spend more time validating so I know what is valuable to look for 1:37 PM Heather: ok, so you've got R installed and running and you can see plots etc? i could perform most of the commands 1:38 PM Heather: great. ok, then I suggest that maybe we have a dedicated chat to work through the next phase of R things I'm not real familiar with it but Im anxious to play around

Heather: it might be lengthy, so don't necessarily want to do it here now ok

Heather: valerie and sarah you are welcome to participate, but not sure how valuable? your call

me: I haven't really poked around much with R yet. 1:39 PM (sorry)

Heather: That's ok

Valerie, I think maybe hold off on R for now because you have lots of cool article prep to do

To the extent we do R things, we can do them customized to your data later

me: ok, cool

Heather: Sarah, I don't think we are goign to do anything you don't tknow 1:40 PM Nic, when you want to have an R talk?

Sarah: yeah, i'm good in terms of r Maybe tomorrow ?

Heather: ok.

maybe 10am Pacific? sure 1:41 PM Heather: great

see you on google chat then. ok

Heather: anything else you want us to go over here now? I don't think so

Heather: ok. 1:42 PM Maybe we just skip to Sarah quickly.

Sarah, how's data collection going? Stats? Anything you want to cover here now? (I want to make sure we spend lots of time on Valerie's spreadsheets ;) ) 1:43 PM Sarah: sorry, i was on another window trying to get my ppt on the plone

(which isn't working, i get an error...has anyone else tried?)

anyways, 1:44 PM i'm finishing up data collection and anticipate being done by tomorrow

Heather: haven't tried. Maybe as a PDF?


Sarah: i'm shooting for at least 15 articles per journal per that adequate for stats?

that's for the 2000/2010 comparison

that's for the 2000/2010 comparison

and not all of them have a reuse 1:45 PM Heather: Hrm, it is low, 25 is probably better, but who knows. stats is a bit of an art when you don't know the magnitude of the effect you are expecting.

Sarah: my problem right now is that many of the journals barely have 50 articles per year, so should i would be sampling a greater proportion from those journals 1:46 PM Heather: I don't think that is a big problem actually

It means that those journal-years woudl be weighted more heavily, indirectly

but in multivariate analysis that would mostly be taken care of

better any remaining bias from that, I think, than not enough samples 1:47 PM Sarah: ok, well then, push my data collection projection back a bit

also, the 2000 snapshot doesn't seem to be that informative

very few have any reuse and sharing instances 1:48 PM so that cuts the sample size from 15 to 1 or 2

Heather: not surprising, but informative nontheless

Sarah: so, should i proceed in 2000?

Heather: yeah. that's ok though.

hrm, I think it depends on your time constraints. I don't have a good enough sense of how long it takes you per article

so what the cost is of you doing the extra 10 per year 1:49 PM me: Is that something that can be noted in the discussion or methods, an explanation of the sample size discrepancies?

Sarah: um ... 2hours for 10 articles

Heather: It makes it an easier story to tell if you have 25 for all years

Sarah: that's conservative, but ends up being realistic when dealing with difficult articles

Heather: Valerie, yes for sure. Methods if it is a reason for not doing something (ok, or discussion, fields vary)

and discussion for how the sample size may limit the generalizability of results 1:50 PM yup, I believe it.

That is still pretty fast, given all you are extracting.

So if you were to go for 25 across all journal-years, when would you push your data collection done date back to?

Sarah: friday probably

considering download times and such

considering download times and such

Heather: Yeah, I

Yeah, I 1:51 PM I'd say go for it and get a 25 across the board picture 1:52 PM That's my opinion. You are closer, to push back when/if your gut disagrees.

Sarah: so....25 in 2010, 25 in 2000, and then 25 per year in my two time series


Heather: definitely 25 in 2010, 25 in 2000

Sarah: like i told you before, I think the time series, even though not relevant for trends, is the most statistically usable dataset 1:53 PM Heather: so yes, I'd say 25 in the time series too, ideally

that said, since that dataset has more... um... what is the word that I'm looking for

Sarah: robust


better sampling of actual resuses? 1:54 PM Heather: the years are more similar to each other, so more overlap. yeah, robust... or hrm,,,,

Sarah: i.e. though my sample size is 25, not all of those can be assessed for reuse/sharing practices

Heather: not duplication, but similar datapoints


Sarah: yeah, i get it even though were lacking the word

Heather: that dataset has more similar datapoints, so having 15 per year (or whatever) but lots of years 1:55 PM wouldn't be as bad as 15 per year with 10 years between them

know what I mean?

Sarah: yeah

Heather: so if you have to cut back on collecting extra

you could probably manage without beefing up the time series 1:56 PM I hear you that not all can be assessed for sharing/reuse

Ideally, sure, we'd have 25 sharing and 25 reuse or something (or a sample size large enough to acheive that)

but that is clearly outside the possibility of this summer project

so we'll make due with what we have

and add a wish list in the discussion section 1:57 PM Sarah: ok, and just use the word "preliminary" a lot in the writeup

Heather: yeah exactly

Sarah: got it.

i'm good for now then if you want to cover valerie's stuff

Heather: great.


where do you want to start? 1:58 PM (maybe she's editing the OWW


me: probably the search spreadsheet

(sorry) ah

me: yeah, I was posting some links there 1:59 PM Heather: great. what do you think? want to start by giving us a tour of your spreadsheets? or talkig about DAAC data, or ?

me: I think the overall search might be the data I work with the most.

ok, tour of the spreadsheets sounds like a good starter 2:00 PM I can re-link if needed

Heather: want to post links, or point us to the page that summarizes links, or?

me: ok yes

Heather: ohhhh I just thought of it, sarah. The word I was looking for was redundant. 2:01 PM anyway. valerie, carry on.

me: My first raw data spreadsheet

My first raw data spreadsheet

Heather: I love the red warning block.

me: My search comparison spreadsheet:


I figured it was eye-grabbing (if not eye-gouging) 2:02 PM Heather: It definitely makes me want to think twice before basing my next research grant on your current results :)

me: ha

The spreadsheet I made based on Sarah's Shared Fields template 2:03 PM I should probably start chronologically

Heather: great. so those three are your main ones?

me: yes

Heather: (sorry, btw, that I didnt' keep better tabs on this. Portland conference and all that, but no excuse.)


me: it's ok, a lot of the information overlaps/is still pretty raw 2:04 PM I more or less wanted to go through to see if there was anything else I needed to capture

Heather: ok, so where would you start explaining these to someone?

me: well, the first link, the data_citations spreadsheet was when I was running random searches for the databases 2:05 PM Heather: yup

me: I captured information only for articles that had cited data, although a lot of times, I found it was cases of deposit

so I created the "phase II" or "edit" sections

Heather: so I'm a bit confused... ok, each row is a "hit" is that right?

me: I went from a generalized search to a specific search


yes 2:06 PM actually no

in phase 1, some of them were misses after I copied the sentences

and they were revealed to be data deposit and not data reuse

Heather: where is this phase II/edit section?

ah hah!

me: on the TreeBASE and Pangaea pages, they're down

Heather: I was on the DAAC tab and didn't see it

me: (I mean, if you scroll down) 2:07 PM yeah, I hadn't done that with the DAAC tab because those were from the spreadsheet Bob sent

Heather: gotcha! 2:08 PM ok, so if you found the same reference via multiple searches, it shows up on multiple rows?

me: I think there was one case where that happened, and I made a note of it in the same row

I probably should have made separate rows to avoid confusion

I probably should have made separate rows to avoid confusion

but I'm pretty sure that only happened once 2:09 PM Heather: only one case? I'm surprised

me: unless I have duplicates

which may be possibl


Heather: I woudl have thought that doing similar searches in ISI or google scholar or whatever woudl have produced overlapping resutls

me: ah, this is something I should be noticing actively and taking note of then?

Heather: no, not necessarily 2:10 PM me: after awhile, the results sort of blur together

Heather: I'm just trying to make sure I understand 100%

me: so it is likely that there are overlaps

Heather: yeah, I hear you

me: I'm sorry I got confused

Heather: also, I think at this point you were doing very exploratory searches, right?

are the results from all of your "formal" 27 searches in here? 2:11 PM me: yes

Heather: ok

me: and the summary of those searches is the search_comparisons spreadsheet

there's actually 38 there

but I used multiple examples

Heather: gotcha

me: (for particular author names/datasets/etc.)

Heather: I'm not at all suggesting you go back and get this now.... 2:12 PM but if/when doing things like this again, another piece of information that woudl be helpful is the sentence that makes the citation

so not the citation itself (though I'm glad you have that too!)

but the sentence that makes the citation

to understand the context and words it uses to talk about its reuse

me: ok. I thought I had put that in column P 2:13 PM Heather: maybe you did, hold on

hmmm, so in row 53 of the treebase tab,

hmmm, so in row 53 of the treebase tab,

column P looks like this, right?

HIGDON JW Phylogeny and divergence of the pinnipeds (Carnivora : Mammalia) assessed using a multigene dataset BMC EVOLUTIONARY BIOLOGY 7 : ARTN 216 2007

me: oh

Heather: was that in the article bibliography? 2:14 PM me: yeah, I might have had that as a placeholder

Heather: ok, I can see that in some other rows above you do have the sentence

me: (Since I didn't have fulltext on some of them)

Heather: gotcha 2:15 PM hmmm, in that case it might just make sense to replace the placeholder with "unknown, no access to full text" or something like that?

me: ok, that makes more sense

Heather: I do see all the other sentences though in other rows. that's great

I'll turn it around now then and ask for the other thing too.... 2:16 PM me: oh wait, are you looking at the RAW or EDIT page for TreeBASE?

Heather: so in addition to the sentence, which it looks like you do mostly have....

me: I may have found the fulltext for all/most of the articles

Heather: it would be useful to also have the reference, to see if it did say anything at all about the data within the bibliometric citation

I was looking at TreeBASE RAW. Should I have been looking at EDIT? 2:17 PM me: oh yeah, sorry

Heather: no problem

me: I think I only went through and found the fulltext for the edited sheet

Heather: ah yes, column P is much more complete! awesome

me: I was worried for a second because I'm pretty sure I was able to get most of these through UNM 2:18 PM Heather: ok nice

me: all right, but yeah, it was a lot more useful to see the citation in context

like one example where it just mentions treebase in row 19

"The TreeBASE interface supports six query types: author, citation, study accession number, matrix accession number, taxon and structure. "

it's more of a mention than a citation 2:19 PM Heather: yeah. and not really data in or out, eh?

so I'm looking at the DAAC tab in that spreadsheet and I'm a bit confused.

me: yeah, I wasn't sure what to put for that

Heather: what is in it? 2:20 PM me: I put in 1s where it should probably be 0

since it didn't actually use data

since it didn't actually use data

Heather: right, yup, that might be best

so on the DAAC tab, it has 17 data rows.

so on the DAAC tab, it has 17 data rows. 2:21 PM you didn't pull all of the data from Bob's spreadsheet for this sheet, but you did pull some?

me: I think I pulled some

to construct the searches for this spreadsheet

Heather: ok, gotcha

me: like the cited author search

or the doi search 2:22 PM Heather: ok. I think I've got a handle on that spreadsheet now.

on to your searching ss?

me: should I have gone through all of the ones on Bob's spreadsheet?

ok, yes

as I mentioned, even though I came up with 27 types of searches, this sheet has 38 rows because I tried to use multiple examples of author name/doi 2:23 PM Heather: ok, makes sense

now your dropbox sheet?

me: ok

Heather: (Nic and Sarah feel free to jump in whenever if you have comments!) 2:24 PM me: I had used Sarah's formulas on the ISIraw_PasteFullRecordHere page

Heather: I'm a fan of your DAAC tab, I know that so far

me: and copied and pasted the ISI downloads into it


the one thing I ran into was for my non-ISI searches, I ended up just entering the DOI on the Reuse pages as opposed to trying to fill out the ISI full record page 2:25 PM Heather: ok, so is the "Article" tab part of what makes the formulas work? so it contains transient info?

me: I think so. I started filling it out, but as I got into articles not from ISI, I did most of my filling out of information on the Reuse pages 2:26 PM Heather: ok

so the DAAC tab contains one row for every row in Bob's spreadsheets, right?

me: I tried to copy/paste them all onto each page and place 0s where they didn't apply


plus the other ORNL DAAC articles I had found through other searches

Heather: what about thte other reuse sheets? 2:27 PM they contain the reuse articles you foudn for that repository type?

they contain the reuse articles you foudn for that repository type?

me: now that I think about it, I should have made a way to integrate the search spreadsheet with this spreadsheet so I would know which articles came from what searches


Heather: do they also contain reuse articles that sarah found (I'm guessing not, just want to make sure)

me: the 0s are placeholders because I tried to list all of the articles on each page 2:28 PM I think I cleared Sarah's data before putting mine in just in case of overlap

Heather: ok

me: although she had sent me the template early in her data collection

Heather: so where there is orange zero blocks, that is because that article reused data from a repository other than the one the tab is named for, is that right?

me: yes

either the orange blocks or a 0

Heather: ok 2:29 PM me: although when I look through each page, I don't think that the rows all have the same numbers, so I may have accidentally pasted over some things

Heather: ok. and the share tabs don't really have any useful content at this point, right? 2:30 PM me: I hadn't looked into the sharing, since I was mostly looking for reuse only.

Heather: right.

just making sure I understand

just making sure I understand

and that there isn't secret text hiding behind the orange background or something


nice Tennessee orange btw

me: there might be. I'm not quite familiar with all of the formulas

ha, I didn't realize that

Heather: nah, I was joking.

Sarah: it's just conditional formatting 2:31 PM Heather: gotcha

me: ah

Sarah: i did so i could see empty cells that needed my attention

Heather: good idea

Sarah: so in theory, if the record is complete, nothing should be orange

though in mine, i haven't taken the time to enter all the "no" data yet 2:32 PM Heather: ok, valerie is there anything else you want to point to in this data deluge?

me: well, I was just wondering if I should try to combine either this sheet or my raw data sheets with the search comparison sheet

would it be clearer to explain? 2:33 PM Heather: hmmm, I'd wait

I'd figure out what your top goal is and be driven by that at this point

me: ok

Heather: You've done lots of bottom-up, which is super. Now time to flip around I think....... 2:34 PM me: is this where I go through the spreadsheets to do reverse searches?

Heather: Well, I'm not sure.

So I'm starting to get an idea in my mind about what the backbone of your article could be based on

Do you have ideas about that? 2:35 PM Thinking if we start there, it will inform what other searches to do, data to gather, spreadsheets to consolidate, etc

me: well, I'm figuring out that each search function is built to accommodate different methods

Heather: So maybe let's talk ideas

me: good plan. I wasn't sure if I was coming to the right conclusions 2:36 PM I had been making observations in OWW but nothing particularly in depth

Heather: Yup.

Want me to relay my DAAC idea, and you can see if it rings true for you, or inspires other things, or ?

me: sure


Heather: (or maybe you read it already and just want to cut straight to the commenting?) 2:37 PM here, I'll try saying it one more time has left has left

me: the searching by DOI idea?

Heather: because sometimes hearing things multiple ways/times can help make them clear.

Yeah. or I guess more generally, using the DAAC experience

as the backbone of the article 2:38 PM You paint a picture of a repository that wants to know how its data is reused 2:39 PM so that they can learn, give feedback to funders, provide links to data creators, etc

They ask a librarian to look for these reuses once a year

(hrm in case it isn't clear, you paint a picture of the DAAC repository itself, not a hypothetical one)

me: ok 2:40 PM Heather: Ask the DAAC librarian who does it how long it takes her

how she does it, etc

Report that in the article

I'm guessing it takes a while and is frustrating

me: and compare that with my own experience?

Heather: if her experience is anything like yours!

me: she's probably a million times better at it

Heather: I wasn't thinking compare, as much as compliment.

me: ah, ok 2:41 PM Heather: Focus on the DAAC experience as a case study for the first half of the article, maybe

me: and then mention the others?

Heather: and towards the end of the DAAC part, you could say "and furthermore they offer DOIs!"

ok without the !

me: ah

Heather: and talk about how DOIs are supposed to make traking articke reuse easier

but darn it, no one is using them 2:42 PM me: ooh, that's good

Heather: and you can quantify the "no one"

by finishing up the great analysis you were doing on the DAAC tab of your dropbox spreadsheet

that will allow you to say somethin glike "only 14% of all the reuses found by the DAAC librarians actually used DOIs"

or something like that 2:43 PM me: I wasn't really doing much analysis, I was just plugging away based on the awesome model Sarah set up.

I wasn't really doing much analysis, I was just plugging away based on the awesome model Sarah set up.

Heather: right, but I think you were capturing the references, is that right?

or if you weren't, you could?

me: yeah

Heather: to see the patterns of citation?

me: the sentence and location 2:44 PM and the fact that almost none of them have it in the references section

Heather: the benefit of the DAAC dataset is that it is a librarian-derived set of repository based reuses

and so it provides a great baseline... something your study has been missing until now 2:45 PM you could say, because of this diversity in data citations, my initial attempts to find DAAC reuses met with little success.

only 23% mention the DAAC url, etc 2:46 PM then follow on to the DAAC first half

me: ah, ok

others just mention the data authors, etc.

Heather: by saying "most repostiories don't have DOIs and so finding their reuses is even harder"


so the points of the article would be something like 2:47 PM a) finding instances of data reuse is hard (estimate of difficulty: estimate from DAAC librarian plus a bit of anecdotal colour from you) 2:48 PM b) there are plans to make it easier, but so far the uptake has been low (estimate from # DOIs in references within DAAC set)

c) hrm, I'm sure there is a third point in here somewhere :)

Sarah has left

Sarah has left 2:49 PM Heather: And this is done using the motivation/dataset/discussion of the DAAC case to start with, and flushed out towards the end with your experiences qualitatively and with different repositories

Thinking you have a conversation with Bob Cook and the DAAC librarian and maybe others to collect their thoughts and experiences.

Whatcha think?

me: this sounds good

definitely a way to open up a dialogue

definitely a way to open up a dialogue 2:50 PM Heather: Doesn't have to be like that exactly of course, make it your own

me: it's good to have a framework to work with

Heather: But I think the DAAC dataset and reuse-hunting-experiences provide a useful framework

yeah exactly

me: I've written arguments before, but not in a scientific/bibliometric capacity. 2:51 PM ok, neat

Heather: so not sure where you start

me: should I email the DAAC librarian?

Heather: yeah. maybe Bob Cook first?

me: ok 2:52 PM Heather: maybe before you do that, you might want to have a look at the DAAC data you have so far extracted

on the DAAC tab

me: come up with a list of interview questions?

oh ok

Heather: so that you could come into the conversation with a bit of knowledge 2:53 PM along the lines of "I looked at 54 of the 124 reuses in that spreadhheet

and it appears that only about 12 of them have DOIs"

"this was less than I would have guessed"

is that in line with your experience?


or whatever the reaal numbers are 2:54 PM me: yeah, that makes sense

I could quantify those

Heather: tally up a few other things too, maybe, so that when the librarian tells you how she looks for the reuses her answers make sense to you based on the data you have

like are there some where when you look at the full text you have no idea how the librarian would have found them? 2:55 PM if so, you can ask her explicitly about those.

you know what I mean.

Spend a bit of time with the DAAC citations you've extracted so you are familiar and can ask good questions and understand the answers :) 2:56 PM me: ok

Heather: one thought is before you do this, sleep on it

me: that was why I was wondering about combining the search compariosns with the other sheet

Heather: just to make sure you don't have some direction that you'd rather take it 2:57 PM another idea is to send an email to data_citations email list describing this direction for the article, seeing if they have any suggestions, etc before you email Bob and the librarian

anyway, follow your gut on that. just wanted to share a backbone idea so that you could start to frame your article and drill to what you needed

me: excellent idea 2:58 PM I'll write out a rudimentary outline

I'll write out a rudimentary outline

and some initial figures/question


and then email the datacitations list

Heather: yeah, ok, if you think combining the search spreadhseet witht eh DAAC one would help, I could definitely see that

me: ok 2:59 PM I think it could help

at least for keeping me from getting dizzy juggling all the sheets :D

Heather: I'm guessing you might also want to merge together your DAAC dropbox sheet

with some of the columns of Bob's original sheet 3:00 PM like the column that said what dataset number they were reusing

me: yeah

Heather: what DAAC project, etc

(basically all the columns, why not) 3:01 PM it would be interesting to know what % of the DAAC reuses included the Data_Set_ID number in their citations/papers, etc.

it would be interesting to know what % of the DAAC reuses included the Data_Set_ID number in their citations/papers, etc.

whatcha think?

me: as opposed to the article doi

good plan

Heather: right. or in addition, or who knows? 3:02 PM shall we go through the DAAC dropbox sheet in detail, briefly?

me: sure

Heather: I know they are Sarah's columns, but want to make sure they are capturing everything you need to capture for your arguement and story 3:03 PM Now in this case the orange 0s are because you didn't have full text, or you didn't get to those, or ?

Now in this case the orange 0s are because you didn't have full text, or you didn't get to those, or ?

me: the orange 0s were the ones carried over from the other reuse pages

except for one

where I didn't have full text

I meant to keep using my own color coding scheme that I started on the ISI raw data page 3:04 PM Heather: the DAAC spreadsheet from Bob had 116 rows of reuse articles I think, right?

I think you want a sheet that only has those 116 reuse articles on it

me: oh

hm. I wonder why I have another ORNL DAAC spreadsheet with even less than that 3:05 PM I think I mixed up my spreadsheets

Heather: so maybe make a copy of this sheet and cut out everything that comes from your searches instead

me: It looks like I didn't copy/paste all of bob's spreadsheet


Heather: jsut for the sake of the "first half" of the paper as I'm envisioning it.... 3:06 PM yeah. make sure you have exactly those papers, no more, no less, then you can calculate stats based on "what the DAAC librarian found"

me: ok

Heather: ok, column D in the dropbox sheet

ok, column D in the dropbox sheet

is Y when they mention the DAAC somewhere in their reuse paper or citations? 3:07 PM me: yes

Heather: great

type of dataset

me: There's a key I made up that either abbreviated the name of the project or did something else 3:08 PM Heather: ok

me: like RP: River Productivity Data

Heather: so maybe make a standalone spreadsheet of this stuff

and in the README you could put those codes?

or some other way you think it woudl be easy to communicate

me: ok

ok 3:09 PM Heather: location of intext citation?

intro methods abstract etc, right

me: yes

Heather: now if it is in multiple places, how did you decide what sentence to cut and paste into column I? 3:10 PM me: I tried to get all of them

adding [...] between each sentence

adding [...] between each sentence

Heather: ok, just concatiated togetehr. gotcha.

me: it got lengthy

Heather: is the "R" in location for references?

me: yes 3:11 PM Heather: ok.

me: although in some cases I might have mistaken D for R

by saying Repository instead of Depository

Heather: it might be useful to make a second column beside I to hold the references cut and pastes

yeah, I hear you.

The R/not R distinctioni is important, since it determines whether the info can be looked up through full text or ISI

me: oh you mean like a 1 or 0 for if there is a relevant excerpt followed by the excerpt? 3:12 PM Heather: hmmm, not sure what you mean.

me: oh

Heather: I mean have two columns like your "I" column now

me: there is a "relevant bibliographic citation" in the next rows

Heather: where one of them has text that is in the body of the article

me: in K, I believe

Heather: ok, gotcha, so I'm jumping ahead of myself, eh?

ok, gotcha, so I'm jumping ahead of myself, eh? 3:13 PM me: Sarah was very thorough in her headers

Heather: yeah, ok. hrmmm/.

ok, let me come back to this thought in a minute

me: ok

Heather: in the mean time, in column G, what is R? 3:14 PM references again?


me: I think that might have been R for repository

I meant to put D for Depository

I meant to put D for Depository

Heather: ok

me: since those cite ORNL or Oak Ridge, etc

since those cite ORNL or Oak Ridge, etc 3:15 PM Heather: what is column H?

me: it seemed redundant

I took it as where did it come from

Heather: what does "NI" stand for, do you know?

me: either from the author or the repository

or Not Indicated 3:16 PM Heather: ah hah.


is having a Y in column J

the same thing as having an "R" in column F?

me: not necessarily 3:17 PM Heather: so then what does it mean to have an R in column F?

me: oh, I think I put it in R for references if that was the only mention 3:18 PM Heather: or there is an R in column F if the references themselves mention data

me: yes

Heather: as opposed to the references being the original data-collection paper?

me: or the repository name



as is the case for most citations by author name

Heather: there is an R in column F if the references themselves mention data or the repository name

there is an R in column F if the references themselves mention data or the repository name

is that right?

me: yes 3:19 PM Heather: gotcha

then the citation itself is in column K

me: yes

where either the repository name, author name, doi, etc. are mentioned

where either the repository name, author name, doi, etc. are mentioned

in the reference page

in the reference page

Heather: and what was the criteria you used for determining column L?

me: er reference section

Heather: yup, I hear you. that makes sense. 3:20 PM me: L was more or less if it was according to ORNL's data citation policy

where it includes the DOI

so I put N for most of them

Heather: ok.

me: I have a link to ORNL's data citation policy I can reference in the paper

Heather: you might want to go recode that a bit

one thing that would be useful to pull out explicitly, into its own column, is whether there is a DOI 3:21 PM it looks like, for example, rows 81 and 82 have a Y in that column but no DOI

it looks like, for example, rows 81 and 82 have a Y in that column but no DOI 3:22 PM me: oh, I think I counted it if the URL was included

Heather: ok.

me: like

Heather: yup, I'd break that into two columsn

for your own information

me: so y/n DOI column

Heather: yup

me: ok cool

Heather: plus a y/n url column 3:23 PM plus anything else that you think would be helful to break out

maybe a y/n "it mentions the name DAAC in the bibliographic reference" for example

part of the point of data citations is that it woudl be WAY easier to track them if we could use bibliometric resources

me: ok, so you mean like the first spreadsheet I made?

Heather: like ISI or Scopus

like we do with articles

but we can't do that unless 3:24 PM a) people use bibliometric citations rather than just in-text mentions

and b) we know what to look for (and what field to look for it in) within the bibliometric citation

standard with articles

but as you found with ISI it frankly isn't clear what to look up where to find doi and citations in bibliographies! 3:25 PM me: yeah

Heather: whoops: but as you found with ISI it frankly isn't clear what to look up where to find doi and data citations in bibliographies!

me: there's not a field for it

Heather: right.

that is a very useful poitn to make in your article

I'm surprised ISI doesn't have an [all] search aiblity, to search in all facets of the citation

it is a pain not to have it!

me: yeah

no fulltext 3:26 PM every other search has fulltext

Heather: which reminds me... do you have scopus access?

me: I'm not sure

I haven't used it

Heather: yeah. I wouldn't put it at the top of your list, but I think it might be useful to use

me: ok

Heather: if you are going to redo any of your searches 3:27 PM it does have an [all] so you can search in all aspects of the citation

me: good to know

Heather: yeah... for what it is worth, I'd be careful using the word "full-text" with regard to citations

it is easily confusing, for me

it is easily confusing, for me

because you mean the full string of the citation, right? but most people think of the full text of the article, the intro, results, etc. 3:28 PM me: oh

that was what I meant

full-text article searching

Heather: hmmm.

me: because I know Google and Scirus do that

but not ISI

Heather: yeah, so agreeed, full-text article searching would be helpful and would solve the problem

me: although it does have its limits 3:29 PM Heather: riught

like they don't have the coverage that ISI has

because ISI only has metadata and references, right?

me: yes

Heather: because that is what publishers are willing to share with them.

so while it woudl be nice if ISI had full-text searching, that isn't likely to happen any time soon 3:30 PM me: at the very least, a doi search

Heather: what I wish that ISI had, in a practical, I-don't-see-why-they-can't sense

is the ability to search for a word in any part of the citation

as opposed to just in the authors field or just the journals field or just the first-page field

admittedly there aren't many people who want to find all citations to papers by Dr Apple and published in the journal called Apple 3:31 PM me: yes

Heather: but I think it woudl be useful in our case because we frankly don't know where the doi is going to show up, or where they are going to slot the "Data from Oak Ridge" phrase

but I think it woudl be useful in our case because we frankly don't know where the doi is going to show up, or where they are going to slot the "Data from Oak Ridge" phrase

It doesn't look like ISI supports that to me, does it to you?

you can OR all the parts together, but that is cumbersome and still maybe lossy 3:32 PM me: ok

I was just thinking from the ANDS angle

where they were working with Thomson Reuters and Elsevier to improve search functions for data

Heather: yeah. agreed. 3:33 PM might be worth touching base with them when you have some of your article flushed out, in case they want to add or give context to something

me: ok, neat

Heather: ok, so how are you feeling?

me: more up to speed

Heather: do you have stuff to go on? 3:34 PM me: I'll process these notes today and sleep on it

Heather: does it feel like you have a clear path? one that makes sense to you? one that you believe in?

that's the goal anyway


me: yes

Heather: ok, sounds good.

me: I'll go post this conversation. 3:35 PM Heather: ok, great.

let me know if you need a sounding board if some of it isn't making sense or doesn't sit right

let me know if you need a sounding board if some of it isn't making sense or doesn't sit right

me: sure

thanks a ton

Heather: you'll aim to have an email out to datacitations in the next day or two?

me: yes

once I get a solid outline of what I want to cover 3:36 PM and after I look through the 100+articles

Heather: cool. maybe we have another chat towards the end of the week? we'll play it by ear.

me: definitely

Heather: if you have trouble getting full text, remember that Bob offered to post them or something.

me: ok

Heather: guessing that will take a little time, so if you expect to have problems,

probably best to ask for that earlier than later 3:37 PM I'm kicking myself that we didn't meet with Oak Ridge Bob when we were there.

me: I guess we just didn't have time.

Heather: Was he out of town? I don't know. Anyway, he has been very supportive

via email so we'll just soldier on remotely

me: yes, the spreadsheet helped guide my searches 3:38 PM Heather: Have a good rest of the day and talk soon!

me: you too!