DataONE:Notebook/Summer 2010/2010/07/20

From OpenWetWare
Jump to navigationJump to search
Owwnotebook icon.png DataONE summer internships 2010 Report.pngMain project page
Resultset previous.pngPrevious entry      Next entryResultset next.png

<!-- sibboleth --><div id="lncal1" style="border:0px;"><div style="display:none;" id="id">lncal1</div><div style="display:none;" id="dtext"></div><div style="display:none;" id="page">DataONE:Notebook/Summer 2010/2010/07/20</div><div style="display:none;" id="fmt">yyyy/MM/dd</div><div style="display:none;" id="css">OWWNB</div><div style="display:none;" id="month"></div><div style="display:none;" id="year"></div><div style="display:none;" id="readonly">Y</div></div>

Owwnotebook icon.png <sitesearch>title=Search this Project</sitesearch>

Customize your entry pages Help.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

Group chat Agenda, July 20 2010, 10am Pacific on google chat

  • High-level things
    • abstract mockups?
    • sharing stats code on OWW
    • other?
  • Intern updates + chance for questions...
    • Sarah (I have an appointment, so I need to leave at 10:30)
      • Data collection
        • Nature? (confounding: no corresponding depository, shorter, fewer citations, different format, majority "letters" not "articles")
      • Scoring
        • define exact focus
      • Stats
    • Valerie
      • DAAC analysis
      • DAAC interviews
      • other?
    • Nic
      • ASIST poster submission experience
      • Data cleaning
      • Stats

Group Chat Transcript, July 20, 2010

In the chat room: Heather Piwowar (,

12:57 PM Heather: You've been invited to this chat room! 12:58 PM Hi

Sarah has joined

Heather: Hi all. Todd is joining us too, just trying to add him in.... 12:59 PM Question for you guys. Do you all do this chat from within the gmail interface, or another interface? 1:00 PM i just use the pop-out feature

Heather: yup. but it starts from within gmail? me too. yes it starts within gmail

Sarah: pop-out as well 1:01 PM does anyone use a task management application that they'd like to recommend? 1:02 PM there are so many I try out, but I never really settle into a good one

Sarah: google tasks is ok, but clunky....i just use google calendar for everything (even tasks) so i get email reminders 1:03 PM Heather: yeah. often I just use an excel spreadsheet :) the email reminder is a good idea on the calendar 1:04 PM Sarah: yeah, i like it...keeps me on top of things without really having to remember every little detail

me: I've been using google calendar as well, with email reminders. 1:05 PM Heather: hmmm, not having much success adding Todd

I'll try one more thing, then get started because I know Sarah has to go 1:09 PM ok, well alas I think let's get started without him

High-level things abstract mockups? sharing stats code on OWW other?

Quick point germane to sharing stats code on OWW:

I know I talked about github a few weeks ago. 1:10 PM I think that is probably outside the scope for now has left

Heather: that said, pasting R into OWW pages, or uploading it as files, will has joined

Heather: probably become pretty unweildy pretty fast

so here is a pointer for another idea: 1:11 PM you paste in code (or any text) and it gives you a URL that has revisions.

and it is linked into git so it is associated with lots of best practices

so when you get to R, let's try it.

sound ok?

in the spirit of giving sarah time, now let's zoom to her stuff.... 1:12 PM me: sure

Sarah: ok. I just have a few things (i think) has joined hi guys

Heather: Hi Todd!

We jsut blew through some early things

now Sarah is going to dig into her details because she has to run in 15mins 1:13 PM ok go ahead

Sarah:, mostly, I was going to have my mock abstract/results posted, but i'm having trouble focusing everything.

is the best direction still the "quality" of reuse/sharing?

and looking at the across journals

data type


Heather: do you have a candidate for better best direction? 1:14 PM and it may hinge on what you mean by quality

Sarah: no, just that the "snapshots" of a few journals would be conducive to information about which journals are and aren't having data reuse/sharing

i.e. percent of reuse is pretty terms of policy effectiveness 1:15 PM All 3 papers I think can be cast as looking at different aspects of how to improve best practices. For you it would be aimed at journal policy and author behavior

Sarah: and by quality, I'm meaning the criteria used in my knoxville presentation and/or defined by the journal/depository recommendations

Heather: yup, in some ways. though I think that your study is limited to just a few journals, so will be difficult to generalize from your study alone about what makes policies effective 1:16 PM yeah, so I think one of the main contributions of your paper is a systematic look at behaviour comparison to the journal/repository policies would be very interesting

Sarah:, I'm seeing those as two different approaches data reuse happening and according to policies (stats = percentages of reuse, etc) 1:17 PM I can see how they would be different aspects of the same analysis

Sarah: quality data reuse/sharing happening period (stats="scoring" of a reuse vs. journal/dataset type)

Heather: yeah, I think they can be two components of the same paper.... 1:18 PM probably #2 logically first, then #1.

Sarah: that doesn't totally articulate it, but for the time constraints of this project, i think it should be more heavily weighed towards one or the other

Heather: was hard to write an abstract that way?

Sarah: yeah....just trying to figure out if were looking at the quality of citations in and of itself, vs. current practices within each journal

Heather: yeah, so if you had to weight it, I'd put the emphasis on " quality data reuse/sharing happening period"

for now

+ HOW is data reuse/sharing happening

Sarah: to me (in terms of stats and writing) they've been coming out as two different thigns 1:19 PM what do you mean by How?

Heather: well, I think quality is a fuzzy word.

esp in this domain, not well defined

so rather than quality being yes/no, I see part of the results being

"here's how they did it"

and then a follow on part being 1:20 PM "and this is what we could consider a quality citation, this one less so...

... with those criteria, x% did it well, y% less well, "

Sarah: maybe saying x percent cited authors in the biblio, x percent mentioned the depository

Heather: yes And this is/isn't consistent with how the journal/repository instructs them to cite data

Sarah: and then x percent did all components

Heather: also one level higher up

Sarah: ok.

Heather: will take some thinking but I think that

there could be a lot of value in thinking about quality along several dimensions 1:21 PM like "this citation is resolvable"

Sarah: and then say, (in discussion), "from observing current practices, we recommend a, b, and c" for policies and authors

Heather: or resolvable by a machine, or in 5 minutes, or ???

and another aspect could be

Sarah: generally, I think that's how I'm defining quality

Heather: "this citation format gives attribution to the authors"

Sarah: as, could i retrace this?

Heather: (which may or may not be the same as it being easily resolvible in some ways) and could Valerie find this!

Heather: yeah :)

Sarah: yeah, and does it give credit to the original data authors

Heather: yeah

and another aspect could be 1:22 PM "is this distinct from the paper attribution"


Sarah: okay, that's making more sense. for some reason i was having trouble reconciling it

Heather: I don't know what all the dimensions of quality would be, but I bet you could think of some

me: that was how I tried to track in one of my spreadsheets, how many cited the data author name vs. other ways (doi, repository name, etc.)

Sarah: what do you mean by that last one?

Heather: and from those you could define a Quality Vector or something :)

Sarah: i.e. do they give accession and author? Maybe it would help to focus mostly on getting all the results included in the results section, and then we can help shape the framing/interpretation

Heather: um, did they just cite the original paper 1:23 PM vs citing a data doi or a data url

Sarah: yeah, I'm planning on that today, just needed a little guidance/brainstorming before i sat down to write

Heather: there is "quality" in a data citation, one could argue, in making a distinction between those two yes, paper vs data citation would be great to quantify

Heather: that help?

Sarah: yeah, i've distinguished those in data collection

yes. thanks/

my only other question is about the journal nature

Heather: yeah, I think you have all the raw guts to quantify the "quality" on dimensions 1:24 PM Sarah: it's been cumbersome to narrow down which ones should be used and then to download them

Heather: I don't think we've given much thought as to the dimensions, though, so to the extent you can spearhead this it would be a real help. (ok, on to nature....)


what do you think, punt on nature?

Sarah: and in preliminary writeups i've been defining the selected journals as selected b/c they had a matching repository

Heather: I think given difficulties and timeframe, punt.

Sarah: yeah, so i think get rid of it for that and for confounding factors (shorter articles, very different format, shorter methods sections, etc) 1:25 PM Heather: for now anyway.... yes, agreed.

Sarah: but, any objections?


Heather: Todd? agree, but discuss limited disciplinary scope as a caveat in the discussion - may be more variation out there

Sarah: yeah, the few i looked at had relatively apparent reuse, but not always sharing simply because of limited methods sectiosn

Heather: yeah, agree, good point.

Sarah: yeah, of course

Heather: and it would be sweet as future work, no doubt. 1:26 PM but out of scope for now

Sarah: ok. just checking

Heather: (out of scope for now, as it turns out, that is.....)

other ???s

Sarah: ok. i think i'm good then. i haven't played with stats but am not worried about it

Heather: or comments on the other projects, based on whatever email or OWW pages you've seen? 1:27 PM my advice is to think about displaying the data in tables as much as possible

Sarah: um, not for now. but is valerie totally reshaping her approach? and does nic have updated/coded spreadsheets (and where are they posted).

ok, i'll shoot for tables Sarah, I've got spread sheets to upload this afternoon 1:28 PM Sarah: ok, mostly i think i need funders and journals.

just wasn't sure if they were up to date

me: I sort of went narrower in my approach

mostly focusing on ORNL should be on the OWW-- I'll directly zip them if not

me: and when I get an email back from the ORNL librarian, I'll compare/contrast search methods 1:29 PM Heather: Nic, when you post them, esp if Sarah is going to use them please highlight which columns include data that is less finalized. valerie, I didn't understand the reasoning behind the narrowing - I think it detracts from the impact ok

Heather: + send an email around to data_citation (and on blog!?!) asking for help flushign them out

Todd, it was my idea

me: ah, I think I did that because I didn't have other search results to compare with for TreeBASE or Pangaea

Sarah: ok. i'm good.....i'm going to pop out now. is anyone willing to post the transcript?

me: (and because Heather suggested it)

I can 1:30 PM Heather: ok, bye Sarah Heather!?!?!?

Heather: Todd!!!?!?! :)


me: uh bye sarah

Sarah: ok thanks. good luck!

Sarah has left

me: later

Heather: So the idea was not to hide the other repositories

but rather to focus on the DAAC usecase as a motivator

and an evaluation

and hence the focus

but, in the second half of the article, to broaden it by 1:31 PM drawing parallels using all of the other searches Valerie has been doing too Alternatively...

here are the data for all three

how do we know how much we are missing?

well in this one case we have external data to compare it to 1:32 PM Heather: right. so the advantage I thought to leading with DAAC

is that it also really provides the motivation. taking that at face value suggests there are X and Y data citations missing for DAAC,,,

Heather: they have a librarian who actually looks for these things every year

so we could say why, and how hard they find it, etc.

me: they also solicit on the citation policy page to email copies of articles citing DAAC data

Heather: then say it is difficult with these other repositories too It might make sense to also emphasize Altmann and DataCite more in the intgro 1:33 PM Heather: esp when named after ancient continents etc wow, emailing articles is so not scalable

me: ok, as opposed to in the discussion section?

yeah... and I have yet to find out how many they actually get I like to see the discussion provide answers to questions raised in the introduction 1:34 PM Its an aesthetic symmetry thing :)

me: ah yes

Heather: hrm, so I'm a bit lost

me: I have a bad habit of raising more questions than I answer.

Heather: my assumption was that the article wouldn't be in intro/discussion format

me: I blame that on watching too many Christopher Nolan films.

Heather: but rather more of a long narrative. save the unanswerable questions for the end of the discussion as future work

Heather: also, was there an article emailed around?

me: ok 1:35 PM I sent it to data_citations

as a link to a google doc

for comments, etc. I think I set it to allow comments/editing, but I'm not sure Narrative would be ok with me, but that's not the current structure

So I was going w that

me: I have two other copies on dropbox and on my school file server 1:36 PM so I shouldn't do the standard "introduction, method, results, discussion" format?

Heather: oh I see... I thought it was just the version that I'd already seen... my bad I've been editing the google doc, so can we keep that the reference copy?

me: yes

Heather: Gotcha.

me: I sort of copy/pasted stuff from the previous draft and tried to add new things. 1:37 PM Heather: Yeah, sorry Valerie... hmmmm... I think there may be an advantage to making this more of a narrative.

What do you think?

Had you considered that and decided not to?

me: I think I can do that.

I'm sorry, I think I forgot to reframe it. 1:38 PM I kept thinking in the same article format as everything I've scanned through. Also, the finding friends on facebook by hair color idea seems to have gotten lost

Heather: Yes....

me: so back to writing more like my blog entries, but more polished 1:39 PM is it all right if I use the first person?

Heather: so this isn't exactly the tone,

me: it feels odd, but in some ways, so does writing in the third person

Heather: but something like this is what I was thinking of..... Use 'We' - I am imagining this will have multiple authors

me: ah, good

Heather: it lacks the actual analytics that yours will ahve 1:40 PM me: I was going to ask how I should issue authorship credit.


Heather: and is maybe a bit flip at the beginning, but gives the idea of how to write an article that isnt' a reasearch article and that might be just a little whimsical for our purposes

Heather: I'll look for more examples....

me: ah, yes. I was wondering how I'd do that

thank you.

I had read through the editorials I had linked on the web resources page, but I had a feeling that wasn't quite what you were looking for 1:41 PM for authorship, let's go with who actually contributes to the paper. Invite folks on the mailing list but only acknowledge them if they don't add text or provide substantive feedback 1:42 PM Heather: So here is another article that doesn't follow a strict scientific method approach: what journals did we discuss as targets at Knoxville? I don't recall

me: oh, I recall Dlib on the list

and I had a bunch of other interdisciplinary science journals

Heather: it doesn't have a lot of the "facebook by haircolour" flavour, bit it does have a bit

me: although this might not be a "hard" enough piece for them 1:43 PM The PLoS article is a pretty good model in that it has some quantitative and qualitative data but is mostly an essay

Heather: right

me: so it's not quite an essay or an editorial

Heather: another one we talked about "Learned Publishing"

me: oh yes

Heather: and maybe "Library Trends" ? 1:44 PM me: I don't remember that, but I'll add it to my list

Heather: I think Learned Publishing has some essay-type pieces too..... I would rather see this read by the publisher and journal world than the librarian world

me: that makes sense 1:45 PM as for the haircolor emphasis, I would suggest putting it upfront, like so...

my use case is that I have a dataset

who has reused it?

how would I find that out?

Heather: "Adventures in Open Data" is an article that I recall might be a good model, published in Learned Publishing paper citations are too broad a net

can I find data citations?

Heather: will dig up a link, hard to find one now, drat subscription journals 1:46 PM or are people citing articles because of lack of infrastructur and bad habit

me: ok, I think I found a link on ingenta for adventures in open data, so I'll bookmark that for later as opposed to adventures in semantic publishing?

Heather: ok, so Todd you think it is worth taking a primary-research viewpoint, rather than a repository one? 1:47 PM it seemed to me that the DAAC story was strong

(and interesting) 1:48 PM I'm not sure I understand the distinction you are making.

The DAAC seems like a case study in how a repository is trying very hard to affect behavior

Heather: well, instead of upfront saying "my use case is that I have a dataset" But even then, if the authors and journals don't cooperate, then it doesn't work well 1:49 PM So I think the case study is a nice example of the larger point

In fact, it may be a 'box' within the paper, depending on the format of the journal

Heather: I'm suggesting we say upfront that people want to monitor reuse. case in point, repositories are currently trying to monitor reuse

here is how hard it is for them.

me: so a repository tracking reuse as opposed to a data author attempting to track their own reuse? 1:50 PM Heather: right

hold on, let me dig up our prev chat... I see... they don't seem that different to me

I can imagine both use cases being described in 1 pgph 1:51 PM is that enough direction for now, Valerie? Nic is still waiting...

Heather: ok, I see what you are saying.

yeah, sorry Todd, these meetings have been going longer than an hour.

me: oh yes, I just needed that reminder to make it less formal and more conversational, as well as the structural comments you've given 1:52 PM Heather: Before we leave Valerie's stuff,

me: but yes, I think I will do another rewrite

Heather: let me ask,


well, rather than ask let me just suggest

that Valerie yes, do another rewrite, but kind of a light one 1:53 PM or rather a total rearrangment of structure and tone etc

but with the idea that

the framing etc is still subject to ongoing consideration.....

as in, drafting early and often 1:54 PM because I think that Todd and I have different ideas still about where the paper is going

and so I want to make sure I don't lead you down the road too far one way


me: ok

I figure I'm keeping multiple versions up anyway agreed to include results for TreeBASE and Pangaea?

me: all else fails, I'll just combine elements of different drafts that you like

yes 1:55 PM (I needed feedback about if/how much, which you have given) the multiple version comment scares me a bit

me: what? why? how do we know that we are working off the same one?

me: oh yeah, good point.

well, the one I put up is article 2.0 1:56 PM the one I had in Knoxville was Article 1

the next one will probably be labeled 3.0 with the changes you suggested in the comments on 2.0 and based on the comments in this meeting 1:57 PM An alternative is for you to archive old versions, and we keep editing the same 'current' version on Google Docs

me: I just didn't want to run the risk of having only one document and have someone say "oh but I think the version 2.0 did a better job of discussing this"

oh, yeah. that's a much better idea 1:58 PM you can label it v 2.0 or 3.0 at the top and we can notate who has done editing it

me: ok, the comments have names assigned to them as well


Heather: ok. Nic? yes 'passing the ball' with manuscripts can sometimes get tricky, so it's good to be very clear about who is editing when, and who is not editing when. 1:59 PM sorry, I don't have much to contribute to Valeries paper yet

me: it's ok. I wasn't very helpful in editing your poster. :( I think google docs does well for tracking initially

but once big sections start to get removed and tweaked, it gets a bit messy 2:00 PM Heather: yeah, or once you start to need to judge for length based on number of pages in some funky format.... yes 2:01 PM so yesterday after our chat Heather I ran the confidence intervals for the numbers we provided in our paper

Heather: ok. Nic, how goes your spreadsheet cleanup?

its in the middle of the page

Heather: Yes, and I saw your thoughts about potential stats

Or rather potential analyses

Good stuff, starting to think it though in that way obviously with smaller sample sizes there were large gaps for repositories and funding agencies

I didn't really know where to go from there

Heather: right 2:02 PM so I went back to cleaning up the ss's

Heather: so we'll get together later today and do some more regression stuff

one thing jumped out at me from your analyses brainstorm

and that was the direction of causation :)

so I think that some of your language was talking about would a and b affect impact factor 2:03 PM and of course a or b could impact impact factor

but it is also likely that impact factor is a correlate for something that is affecting a and b

(perhaps more likely)

me: or it could just be a correlation yeah

Heather: right. so a correlation often suggests causation

but not always

and it does little to illuminate what direction the causation goes 2:04 PM for that we need common sense and/or different analysis

so whatever stats we do, we're actually unlikely here to determine causation

(which will be a limitation of this work)

but it makes sense to think about which way we expect the causation to go

and talk about it that way 2:05 PM agree that stating hypotheses upfront would be a good way to go up front meaning before I run more stats or upfront as in a publication? 2:06 PM in the paper

Heather: and in OWW notebook because there an unlimited number of statistical comparisons that you could make

Heather: as you have started to do the hypotheses will guide which ones you focus on

and how you interpret the results

its very different to find a significant correlation when youve been on a fishing expedition for 100 possible correlations 2:07 PM and when you are testing just one 2:08 PM Heather: I think the analyses you mentioned on you OWW page were a good start.

and we can take it from there in figuring out stats

ok. anything else to talk about here with that, Nic? 2:09 PM ok, but I should focus on a set of correlations that will make sense for the data that I've gathered and try to calculate those-- then present that work

Heather: either in stats, or in data cleanup and re-uploading, or in prelim abstract-writing?

hmmmm. not quite sure what you mean by the above as far as re-uploading, I'm going to put things back into google docs and link off my OWW page

Heather: great 2:10 PM I should be done with another hour of work

Heather: I wouldn't suggest going into R and doing correlations just yet

Let's talk more about doing those correlations first I think ok 2:11 PM Heather: I'm free for the next few hours, so we can find a good time. I'm free for the rest of the day, so whenever is convenient 2:12 PM should we be careful about how much effort goes into the data analysis before it is fully cleaned up

Heather: ok. I need to check on sometihng, then let me ping you after this chat about exact timing.

before the data is fully cleaned up, you mean? yes

Heather: yeah, so mostly I think we are planning what analysis makes sense to run

and learning the R for it 2:13 PM right that's good

Heather: and sticking initially to the firm fields like impact factor

but I agree, before putting stock in stats that come from looser fields we should make sure the data is clean 2:14 PM and putting effort into recruiting folks to help clean it now while there is still time

Heather: right Ok, so as soon as I get them finished this afternoon I can post something on the blog and put them into my OWW

Heather: great. and hopefully at least our data_citation group can weigh in and say which fields or particular cells should be revised 2:15 PM 'them' being links to google spreadsheets, or how did you decide to do that collaboratively ? I think google spreadsheets will work best for editing

me: I'm inclined to agree 2:16 PM fusion tables has some nice functionality for commenting, but its really hard to edit in that format and then ask people to send you what changes they made and why?

Heather: We experimented with fusion but found them hard to edit too bad

me: yeah, fusion is good for some things, but not this maybe for this, I could just upload, allow people to make comments and then redownload them and make changes as necessary 2:17 PM that might be better so you can see what needed cleaning that might make things easier - rather than expecting people to notify me about what changes they've made 2:18 PM by upload I mean upload to Fusion Tables fine ok

Heather: ok, then Nic make that your priority for today and we'll touch base about more stats 2:19 PM great

Heather: Any other questions?

<<consider this a plug for doing more blogging>> not from me

me: I think I'm good for now as well. questions that is

Heather: ok.

Let's have another chat later this week.

me: I'll read the articles you linked and start revising

Heather: Individually, obviously, but also maybe as a group

since we're getting close to the end 2:20 PM excellent progress guys

me: thanks for all the guidance

Heather: great, Valerie! Looking forward to seeing it. Todd you mentioned a survey or some form of feedback after Knoxville

Heather: bye for now, all. is that something we'll do at the end? I was expecting it to happen about now 2:21 PM Rebecca Koskela will be putting it on surveymonkey soon, I think

me: ah

sweet ok, just curios ( wanted to make sure I wasn't missing something) all good with the checks? yup 2:22 PM me: I haven't gotten mine yet, but I got in contact with the office who cleared up the whole mistake about how I was in the system as an international employee.

so it'll probably be today or sometime this week ok, let me know if it doesn't

me: ok, thanks again ok, thanks for the meeting today ciao! bye

me: thanks


Heather: bye