This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.
Group chat Agenda, July 20 2010, 10am Pacific on google chat
- High-level things
- abstract mockups?
- sharing stats code on OWW
- Intern updates + chance for questions...
- Sarah (I have an appointment, so I need to leave at 10:30)
- Data collection
- Nature? (confounding: no corresponding depository, shorter, fewer citations, different format, majority "letters" not "articles")
- DAAC analysis
- DAAC interviews
- ASIST poster submission experience
- Data cleaning
Group Chat Transcript, July 20, 2010
In the chat room: Heather Piwowar (firstname.lastname@example.org), email@example.com
Heather: You've been invited to this chat room!
Sarah has joined
Heather: Hi all. Todd is joining us too, just trying to add him in....
Question for you guys. Do you all do this chat from within the gmail interface, or another interface?
nicholas.m.weber: i just use the pop-out feature
Heather: yup. but it starts from within gmail? me too.
nicholas.m.weber: yes it starts within gmail
Sarah: pop-out as well
nicholas.m.weber: does anyone use a task management application that they'd like to recommend?
there are so many I try out, but I never really settle into a good one
Sarah: google tasks is ok, but clunky....i just use google calendar for everything (even tasks) so i get email reminders
Heather: yeah. often I just use an excel spreadsheet :)
nicholas.m.weber: the email reminder is a good idea on the calendar
Sarah: yeah, i like it...keeps me on top of things without really having to remember every little detail
me: I've been using google calendar as well, with email reminders.
Heather: hmmm, not having much success adding Todd
I'll try one more thing, then get started because I know Sarah has to go
ok, well alas I think let's get started without him
sharing stats code on OWW
Quick point germane to sharing stats code on OWW:
I know I talked about github a few weeks ago.
I think that is probably outside the scope for now
firstname.lastname@example.org has left
Heather: that said, pasting R into OWW pages, or uploading it as files, will
email@example.com has joined
Heather: probably become pretty unweildy pretty fast
so here is a pointer for another idea:
you paste in code (or any text) and it gives you a URL that has revisions.
and it is linked into git so it is associated with lots of best practices
so when you get to R, let's try it.
in the spirit of giving sarah time, now let's zoom to her stuff....
Sarah: ok. I just have a few things (i think)
firstname.lastname@example.org has joined
todd.vision: hi guys
Heather: Hi Todd!
We jsut blew through some early things
now Sarah is going to dig into her details because she has to run in 15mins
todd.vision: ok go ahead
Sarah: okay..so, mostly, I was going to have my mock abstract/results posted, but i'm having trouble focusing everything.
is the best direction still the "quality" of reuse/sharing?
and looking at the across journals
Heather: do you have a candidate for better best direction?
todd.vision: and it may hinge on what you mean by quality
Sarah: no, just that the "snapshots" of a few journals would be conducive to information about which journals are and aren't having data reuse/sharing
i.e. percent of reuse is pretty interesting....in terms of policy effectiveness
todd.vision: All 3 papers I think can be cast as looking at different aspects of how to improve best practices. For you it would be aimed at journal policy and author behavior
Sarah: and by quality, I'm meaning the criteria used in my knoxville presentation and/or defined by the journal/depository recommendations
Heather: yup, in some ways. though I think that your study is limited to just a few journals, so will be difficult to generalize from your study alone about what makes policies effective
yeah, so I think one of the main contributions of your paper is a systematic look at behaviour
todd.vision: comparison to the journal/repository policies would be very interesting
Sarah: ....so, I'm seeing those as two different approaches
one....is data reuse happening and according to policies (stats = percentages of reuse, etc)
todd.vision: I can see how they would be different aspects of the same analysis
Sarah: two....is quality data reuse/sharing happening period (stats="scoring" of a reuse vs. journal/dataset type)
Heather: yeah, I think they can be two components of the same paper....
probably #2 logically first, then #1.
Sarah: that doesn't totally articulate it, but for the time constraints of this project, i think it should be more heavily weighed towards one or the other
Heather: was hard to write an abstract that way?
Sarah: yeah....just trying to figure out if were looking at the quality of citations in and of itself, vs. current practices within each journal
Heather: yeah, so if you had to weight it, I'd put the emphasis on "..is quality data reuse/sharing happening period"
+ HOW is data reuse/sharing happening
Sarah: to me (in terms of stats and writing) they've been coming out as two different thigns
what do you mean by How?
Heather: well, I think quality is a fuzzy word.
esp in this domain, not well defined
so rather than quality being yes/no, I see part of the results being
"here's how they did it"
and then a follow on part being
"and this is what we could consider a quality citation, this one less so...
... with those criteria, x% did it well, y% less well, "
Sarah: ok...so maybe saying x percent cited authors in the biblio, x percent mentioned the depository
todd.vision: And this is/isn't consistent with how the journal/repository instructs them to cite data
Sarah: and then x percent did all components
Heather: also one level higher up
Heather: will take some thinking but I think that
there could be a lot of value in thinking about quality along several dimensions
like "this citation is resolvable"
Sarah: and then say, (in discussion), "from observing current practices, we recommend a, b, and c" for policies and authors
Heather: or resolvable by a machine, or in 5 minutes, or ???
and another aspect could be
Sarah: generally, I think that's how I'm defining quality
Heather: "this citation format gives attribution to the authors"
Sarah: as, could i retrace this?
Heather: (which may or may not be the same as it being easily resolvible in some ways)
todd.vision: and could Valerie find this!
Heather: yeah :)
Sarah: yeah, and does it give credit to the original data authors
and another aspect could be
"is this distinct from the paper attribution"
Sarah: okay, that's making more sense. for some reason i was having trouble reconciling it
Heather: I don't know what all the dimensions of quality would be, but I bet you could think of some
me: that was how I tried to track in one of my spreadsheets, how many cited the data author name vs. other ways (doi, repository name, etc.)
Sarah: what do you mean by that last one?
Heather: and from those you could define a Quality Vector or something :)
Sarah: i.e. do they give accession and author?
todd.vision: Maybe it would help to focus mostly on getting all the results included in the results section, and then we can help shape the framing/interpretation
Heather: um, did they just cite the original paper
vs citing a data doi or a data url
Sarah: yeah, I'm planning on that today, just needed a little guidance/brainstorming before i sat down to write
Heather: there is "quality" in a data citation, one could argue, in making a distinction between those two
todd.vision: yes, paper vs data citation would be great to quantify
Heather: that help?
Sarah: yeah, i've distinguished those in data collection
my only other question is about the journal nature
Heather: yeah, I think you have all the raw guts to quantify the "quality" on dimensions
Sarah: it's been cumbersome to narrow down which ones should be used and then to download them
Heather: I don't think we've given much thought as to the dimensions, though, so to the extent you can spearhead this it would be a real help. (ok, on to nature....)
what do you think, punt on nature?
Sarah: and in preliminary writeups i've been defining the selected journals as selected b/c they had a matching repository
Heather: I think given difficulties and timeframe, punt.
Sarah: yeah, so i think get rid of it for that and for confounding factors (shorter articles, very different format, shorter methods sections, etc)
Heather: for now anyway.... yes, agreed.
Sarah: but, any objections?
todd.vision: agree, but discuss limited disciplinary scope as a caveat in the discussion - may be more variation out there
Sarah: yeah, the few i looked at had relatively apparent reuse, but not always sharing simply because of limited methods sectiosn
Heather: yeah, agree, good point.
Sarah: yeah, of course
Heather: and it would be sweet as future work, no doubt.
but out of scope for now
Sarah: ok. just checking
Heather: (out of scope for now, as it turns out, that is.....)
Sarah: ok. i think i'm good then. i haven't played with stats but am not worried about it
Heather: or comments on the other projects, based on whatever email or OWW pages you've seen?
todd.vision: my advice is to think about displaying the data in tables as much as possible
Sarah: um, not for now. but is valerie totally reshaping her approach? and does nic have updated/coded spreadsheets (and where are they posted).
ok, i'll shoot for tables
nicholas.m.weber: Sarah, I've got spread sheets to upload this afternoon
Sarah: ok, mostly i think i need funders and journals.
just wasn't sure if they were up to date
me: I sort of went narrower in my approach
mostly focusing on ORNL
nicholas.m.weber: should be on the OWW-- I'll directly zip them if not
me: and when I get an email back from the ORNL librarian, I'll compare/contrast search methods
Heather: Nic, when you post them, esp if Sarah is going to use them please highlight which columns include data that is less finalized.
todd.vision: valerie, I didn't understand the reasoning behind the narrowing - I think it detracts from the impact
Heather: + send an email around to data_citation (and on blog!?!) asking for help flushign them out
Todd, it was my idea
me: ah, I think I did that because I didn't have other search results to compare with for TreeBASE or Pangaea
Sarah: ok. i'm good.....i'm going to pop out now. is anyone willing to post the transcript?
me: (and because Heather suggested it)
Heather: ok, bye Sarah
Heather: Todd!!!?!?! :)
todd.vision: bye sarah
Sarah: ok thanks. good luck!
Sarah has left
Heather: So the idea was not to hide the other repositories
but rather to focus on the DAAC usecase as a motivator
and an evaluation
and hence the focus
but, in the second half of the article, to broaden it by
drawing parallels using all of the other searches Valerie has been doing too
here are the data for all three
how do we know how much we are missing?
well in this one case we have external data to compare it to
Heather: right. so the advantage I thought to leading with DAAC
is that it also really provides the motivation.
todd.vision: taking that at face value suggests there are X and Y data citations missing for DAAC,,,
Heather: they have a librarian who actually looks for these things every year
so we could say why, and how hard they find it, etc.
me: they also solicit on the citation policy page to email copies of articles citing DAAC data
Heather: then say it is difficult with these other repositories too
todd.vision: It might make sense to also emphasize Altmann and DataCite more in the intgro
Heather: esp when named after ancient continents etc
todd.vision: wow, emailing articles is so not scalable
me: ok, as opposed to in the discussion section?
yeah... and I have yet to find out how many they actually get
todd.vision: I like to see the discussion provide answers to questions raised in the introduction
Its an aesthetic symmetry thing :)
me: ah yes
Heather: hrm, so I'm a bit lost
me: I have a bad habit of raising more questions than I answer.
Heather: my assumption was that the article wouldn't be in intro/discussion format
me: I blame that on watching too many Christopher Nolan films.
Heather: but rather more of a long narrative.
todd.vision: save the unanswerable questions for the end of the discussion as future work
Heather: also, was there an article emailed around?
I sent it to data_citations
as a link to a google doc
for comments, etc. I think I set it to allow comments/editing, but I'm not sure
todd.vision: Narrative would be ok with me, but that's not the current structure
So I was going w that
me: I have two other copies on dropbox and on my school file server
so I shouldn't do the standard "introduction, method, results, discussion" format?
Heather: oh I see... I thought it was just the version that I'd already seen... my bad
todd.vision: I've been editing the google doc, so can we keep that the reference copy?
me: I sort of copy/pasted stuff from the previous draft and tried to add new things.
Heather: Yeah, sorry Valerie... hmmmm... I think there may be an advantage to making this more of a narrative.
What do you think?
Had you considered that and decided not to?
me: I think I can do that.
I'm sorry, I think I forgot to reframe it.
I kept thinking in the same article format as everything I've scanned through.
todd.vision: Also, the finding friends on facebook by hair color idea seems to have gotten lost
me: so back to writing more like my blog entries, but more polished
is it all right if I use the first person?
Heather: so this isn't exactly the tone,
me: it feels odd, but in some ways, so does writing in the third person
Heather: but something like this is what I was thinking of.....
todd.vision: Use 'We' - I am imagining this will have multiple authors
me: ah, good
Heather: it lacks the actual analytics that yours will ahve
me: I was going to ask how I should issue authorship credit.
Heather: and is maybe a bit flip at the beginning, but gives the idea of how to write an article that isnt' a reasearch article
todd.vision: and that might be just a little whimsical for our purposes
Heather: I'll look for more examples....
me: ah, yes. I was wondering how I'd do that
I had read through the editorials I had linked on the web resources page, but I had a feeling that wasn't quite what you were looking for
todd.vision: for authorship, let's go with who actually contributes to the paper. Invite folks on the mailing list but only acknowledge them if they don't add text or provide substantive feedback
Heather: So here is another article that doesn't follow a strict scientific method approach:
todd.vision: what journals did we discuss as targets at Knoxville? I don't recall
me: oh, I recall Dlib on the list
and I had a bunch of other interdisciplinary science journals
Heather: it doesn't have a lot of the "facebook by haircolour" flavour, bit it does have a bit
me: although this might not be a "hard" enough piece for them
todd.vision: The PLoS article is a pretty good model in that it has some quantitative and qualitative data but is mostly an essay
me: so it's not quite an essay or an editorial
Heather: another one we talked about "Learned Publishing"
me: oh yes
Heather: and maybe "Library Trends" ?
me: I don't remember that, but I'll add it to my list
Heather: I think Learned Publishing has some essay-type pieces too.....
todd.vision: I would rather see this read by the publisher and journal world than the librarian world
me: that makes sense
todd.vision: as for the haircolor emphasis, I would suggest putting it upfront, like so...
my use case is that I have a dataset
who has reused it?
how would I find that out?
Heather: "Adventures in Open Data" is an article that I recall might be a good model, published in Learned Publishing
todd.vision: paper citations are too broad a net
can I find data citations?
Heather: will dig up a link, hard to find one now, drat subscription journals
todd.vision: or are people citing articles because of lack of infrastructur and bad habit
me: ok, I think I found a link on ingenta for adventures in open data, so I'll bookmark that for later
todd.vision: as opposed to adventures in semantic publishing?
Heather: ok, so Todd you think it is worth taking a primary-research viewpoint, rather than a repository one?
it seemed to me that the DAAC story was strong
todd.vision: I'm not sure I understand the distinction you are making.
The DAAC seems like a case study in how a repository is trying very hard to affect behavior
Heather: well, instead of upfront saying "my use case is that I have a dataset"
todd.vision: But even then, if the authors and journals don't cooperate, then it doesn't work well
So I think the case study is a nice example of the larger point
In fact, it may be a 'box' within the paper, depending on the format of the journal
Heather: I'm suggesting we say upfront that people want to monitor reuse. case in point, repositories are currently trying to monitor reuse
here is how hard it is for them.
me: so a repository tracking reuse as opposed to a data author attempting to track their own reuse?
hold on, let me dig up our prev chat...
todd.vision: I see... they don't seem that different to me
I can imagine both use cases being described in 1 pgph
is that enough direction for now, Valerie? Nic is still waiting...
Heather: ok, I see what you are saying.
yeah, sorry Todd, these meetings have been going longer than an hour.
me: oh yes, I just needed that reminder to make it less formal and more conversational, as well as the structural comments you've given
Heather: Before we leave Valerie's stuff,
me: but yes, I think I will do another rewrite
Heather: let me ask,
well, rather than ask let me just suggest
that Valerie yes, do another rewrite, but kind of a light one
or rather a total rearrangment of structure and tone etc
but with the idea that
the framing etc is still subject to ongoing consideration.....
as in, drafting early and often
because I think that Todd and I have different ideas still about where the paper is going
and so I want to make sure I don't lead you down the road too far one way
I figure I'm keeping multiple versions up anyway
todd.vision: agreed to include results for TreeBASE and Pangaea?
me: all else fails, I'll just combine elements of different drafts that you like
(I needed feedback about if/how much, which you have given)
todd.vision: the multiple version comment scares me a bit
me: what? why?
todd.vision: how do we know that we are working off the same one?
me: oh yeah, good point.
well, the one I put up is article 2.0
the one I had in Knoxville was Article 1
the next one will probably be labeled 3.0 with the changes you suggested in the comments on 2.0 and based on the comments in this meeting
todd.vision: An alternative is for you to archive old versions, and we keep editing the same 'current' version on Google Docs
me: I just didn't want to run the risk of having only one document and have someone say "oh but I think the version 2.0 did a better job of discussing this"
oh, yeah. that's a much better idea
todd.vision: you can label it v 2.0 or 3.0 at the top and we can notate who has done editing it
me: ok, the comments have names assigned to them as well
Heather: ok. Nic?
todd.vision: 'passing the ball' with manuscripts can sometimes get tricky, so it's good to be very clear about who is editing when, and who is not editing when.
nicholas.m.weber: sorry, I don't have much to contribute to Valeries paper yet
me: it's ok. I wasn't very helpful in editing your poster. :(
nicholas.m.weber: I think google docs does well for tracking initially
but once big sections start to get removed and tweaked, it gets a bit messy
Heather: yeah, or once you start to need to judge for length based on number of pages in some funky format....
so yesterday after our chat Heather I ran the confidence intervals for the numbers we provided in our paper
Heather: ok. Nic, how goes your spreadsheet cleanup?
its in the middle of the page
Heather: Yes, and I saw your thoughts about potential stats
Or rather potential analyses
Good stuff, starting to think it though in that way
nicholas.m.weber: obviously with smaller sample sizes there were large gaps for repositories and funding agencies
I didn't really know where to go from there
nicholas.m.weber: so I went back to cleaning up the ss's
Heather: so we'll get together later today and do some more regression stuff
one thing jumped out at me from your analyses brainstorm
and that was the direction of causation :)
so I think that some of your language was talking about would a and b affect impact factor
and of course a or b could impact impact factor
but it is also likely that impact factor is a correlate for something that is affecting a and b
(perhaps more likely)
me: or it could just be a correlation
Heather: right. so a correlation often suggests causation
but not always
and it does little to illuminate what direction the causation goes
for that we need common sense and/or different analysis
so whatever stats we do, we're actually unlikely here to determine causation
(which will be a limitation of this work)
but it makes sense to think about which way we expect the causation to go
and talk about it that way
todd.vision: agree that stating hypotheses upfront would be a good way to go
nicholas.m.weber: up front meaning before I run more stats or upfront as in a publication?
todd.vision: in the paper
Heather: and in OWW notebook
todd.vision: because there an unlimited number of statistical comparisons that you could make
Heather: as you have started to do
todd.vision: the hypotheses will guide which ones you focus on
and how you interpret the results
its very different to find a significant correlation when youve been on a fishing expedition for 100 possible correlations
and when you are testing just one
Heather: I think the analyses you mentioned on you OWW page were a good start.
and we can take it from there in figuring out stats
ok. anything else to talk about here with that, Nic?
nicholas.m.weber: ok, but I should focus on a set of correlations that will make sense for the data that I've gathered and try to calculate those-- then present that work
Heather: either in stats, or in data cleanup and re-uploading, or in prelim abstract-writing?
hmmmm. not quite sure what you mean by the above
nicholas.m.weber: as far as re-uploading, I'm going to put things back into google docs and link off my OWW page
nicholas.m.weber: I should be done with another hour of work
Heather: I wouldn't suggest going into R and doing correlations just yet
Let's talk more about doing those correlations first I think
Heather: I'm free for the next few hours, so we can find a good time.
nicholas.m.weber: I'm free for the rest of the day, so whenever is convenient
todd.vision: should we be careful about how much effort goes into the data analysis before it is fully cleaned up
Heather: ok. I need to check on sometihng, then let me ping you after this chat about exact timing.
before the data is fully cleaned up, you mean?
Heather: yeah, so mostly I think we are planning what analysis makes sense to run
and learning the R for it
todd.vision: right that's good
Heather: and sticking initially to the firm fields like impact factor
but I agree, before putting stock in stats that come from looser fields we should make sure the data is clean
todd.vision: and putting effort into recruiting folks to help clean it now while there is still time
nicholas.m.weber: Ok, so as soon as I get them finished this afternoon I can post something on the blog and put them into my OWW
nicholas.m.weber: and hopefully at least our data_citation group can weigh in and say which fields or particular cells should be revised
todd.vision: 'them' being links to google spreadsheets, or how did you decide to do that collaboratively ?
nicholas.m.weber: I think google spreadsheets will work best for editing
me: I'm inclined to agree
nicholas.m.weber: fusion tables has some nice functionality for commenting, but its really hard to edit in that format
todd.vision: and then ask people to send you what changes they made and why?
Heather: We experimented with fusion but found them hard to edit
todd.vision: too bad
me: yeah, fusion is good for some things, but not this
nicholas.m.weber: maybe for this, I could just upload, allow people to make comments and then redownload them and make changes as necessary
todd.vision: that might be better so you can see what needed cleaning
nicholas.m.weber: that might make things easier - rather than expecting people to notify me about what changes they've made
by upload I mean upload to Fusion Tables
Heather: ok, then Nic make that your priority for today and we'll touch base about more stats
Heather: Any other questions?
<<consider this a plug for doing more blogging>>
nicholas.m.weber: not from me
me: I think I'm good for now as well.
nicholas.m.weber: questions that is
Let's have another chat later this week.
me: I'll read the articles you linked and start revising
Heather: Individually, obviously, but also maybe as a group
since we're getting close to the end
todd.vision: excellent progress guys
me: thanks for all the guidance
Heather: great, Valerie! Looking forward to seeing it.
nicholas.m.weber: Todd you mentioned a survey or some form of feedback after Knoxville
Heather: bye for now, all.
nicholas.m.weber: is that something we'll do at the end?
todd.vision: I was expecting it to happen about now
Rebecca Koskela will be putting it on surveymonkey soon, I think
nicholas.m.weber: ok, just curios ( wanted to make sure I wasn't missing something)
todd.vision: all good with the checks?
me: I haven't gotten mine yet, but I got in contact with the office who cleared up the whole mistake about how I was in the system as an international employee.
so it'll probably be today or sometime this week
todd.vision: ok, let me know if it doesn't
me: ok, thanks again
nicholas.m.weber: ok, thanks for the meeting today