DataONE:Notebook/Summer 2010/2010/07/20

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] DataONE summer internships 2010
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

{| width="800"
 * style="background-color: #cdde95;" align="center"|
 * style="background-color: #cdde95;" align="center"|




 * align="center" style="background-color: #e5edc8;" |

title=Search this Project


 * colspan="2" style="background-color: #F2F2F2;" align="right"|Customize your entry pages 
 * colspan="2"|
 * colspan="2"|
 * colspan="2"|

Group chat Agenda, July 20 2010, 10am Pacific on google chat

 * High-level things
 * abstract mockups?
 * sharing stats code on OWW
 * other?
 * Intern updates + chance for questions...
 * Sarah (I have an appointment, so I need to leave at 10:30)
 * Data collection
 * Nature? (confounding: no corresponding depository, shorter, fewer citations, different format, majority "letters" not "articles")
 * Scoring
 * define exact focus
 * Stats
 * Valerie
 * DAAC analysis
 * DAAC interviews
 * other?
 * Nic
 * ASIST poster submission experience
 * Data cleaning
 * Stats

Group Chat Transcript, July 20, 2010
In the chat room: Heather Piwowar (hpiwowar@gmail.com), nicholas.m.weber@gmail.com

12:57 PM Heather: You've been invited to this chat room! 12:58 PM nicholas.m.weber: Hi Sarah has joined Heather: Hi all. Todd is joining us too, just trying to add him in.... 12:59 PM Question for you guys. Do you all do this chat from within the gmail interface, or another interface? 1:00 PM nicholas.m.weber: i just use the pop-out feature Heather: yup. but it starts from within gmail? me too. nicholas.m.weber: yes it starts within gmail Sarah: pop-out as well 1:01 PM nicholas.m.weber: does anyone use a task management application that they'd like to recommend? 1:02 PM there are so many I try out, but I never really settle into a good one Sarah: google tasks is ok, but clunky....i just use google calendar for everything (even tasks) so i get email reminders 1:03 PM Heather: yeah. often I just use an excel spreadsheet :) nicholas.m.weber: the email reminder is a good idea on the calendar 1:04 PM Sarah: yeah, i like it...keeps me on top of things without really having to remember every little detail me: I've been using google calendar as well, with email reminders. 1:05 PM Heather: hmmm, not having much success adding Todd I'll try one more thing, then get started because I know Sarah has to go 1:09 PM ok, well alas I think let's get started without him High-level things abstract mockups? sharing stats code on OWW other? Quick point germane to sharing stats code on OWW: I know I talked about github a few weeks ago. 1:10 PM I think that is probably outside the scope for now nicholas.m.weber@gmail.com has left Heather: that said, pasting R into OWW pages, or uploading it as files, will nicholas.m.weber@gmail.com has joined Heather: probably become pretty unweildy pretty fast so here is a pointer for another idea: http://gist.github.com/ 1:11 PM you paste in code (or any text) and it gives you a URL that has revisions. and it is linked into git so it is associated with lots of best practices so when you get to R, let's try it. sound ok? in the spirit of giving sarah time, now let's zoom to her stuff.... 1:12 PM me: sure Sarah: ok. I just have a few things (i think) todd.vision@gmail.com has joined todd.vision: hi guys Heather: Hi Todd! We jsut blew through some early things now Sarah is going to dig into her details because she has to run in 15mins 1:13 PM todd.vision: ok go ahead Sarah: okay..so, mostly, I was going to have my mock abstract/results posted, but i'm having trouble focusing everything. is the best direction still the "quality" of reuse/sharing? and looking at the across journals data type etc Heather: do you have a candidate for better best direction? 1:14 PM todd.vision: and it may hinge on what you mean by quality Sarah: no, just that the "snapshots" of a few journals would be conducive to information about which journals are and aren't having data reuse/sharing i.e. percent of reuse is pretty interesting....in terms of policy effectiveness 1:15 PM todd.vision: All 3 papers I think can be cast as looking at different aspects of how to improve best practices. For you it would be aimed at journal policy and author behavior Sarah: and by quality, I'm meaning the criteria used in my knoxville presentation and/or defined by the journal/depository recommendations Heather: yup, in some ways. though I think that your study is limited to just a few journals, so will be difficult to generalize from your study alone about what makes policies effective 1:16 PM yeah, so I think one of the main contributions of your paper is a systematic look at behaviour todd.vision: comparison to the journal/repository policies would be very interesting Sarah: ....so, I'm seeing those as two different approaches one....is data reuse happening and according to policies (stats = percentages of reuse, etc) 1:17 PM todd.vision: I can see how they would be different aspects of the same analysis Sarah: two....is quality data reuse/sharing happening period (stats="scoring" of a reuse vs. journal/dataset type) Heather: yeah, I think they can be two components of the same paper.... 1:18 PM probably #2 logically first, then #1. Sarah: that doesn't totally articulate it, but for the time constraints of this project, i think it should be more heavily weighed towards one or the other Heather: was hard to write an abstract that way? Sarah: yeah....just trying to figure out if were looking at the quality of citations in and of itself, vs. current practices within each journal Heather: yeah, so if you had to weight it, I'd put the emphasis on "..is quality data reuse/sharing happening period" for now + HOW is data reuse/sharing happening Sarah: to me (in terms of stats and writing) they've been coming out as two different thigns 1:19 PM what do you mean by How? Heather: well, I think quality is a fuzzy word. esp in this domain, not well defined so rather than quality being yes/no, I see part of the results being "here's how they did it" and then a follow on part being 1:20 PM "and this is what we could consider a quality citation, this one less so... ... with those criteria, x% did it well, y% less well, " Sarah: ok...so maybe saying x percent cited authors in the biblio, x percent mentioned the depository Heather: yes todd.vision: And this is/isn't consistent with how the journal/repository instructs them to cite data Sarah: and then x percent did all components Heather: also one level higher up Sarah: ok. Heather: will take some thinking but I think that there could be a lot of value in thinking about quality along several dimensions 1:21 PM like "this citation is resolvable" Sarah: and then say, (in discussion), "from observing current practices, we recommend a, b, and c" for policies and authors Heather: or resolvable by a machine, or in 5 minutes, or ??? and another aspect could be Sarah: generally, I think that's how I'm defining quality Heather: "this citation format gives attribution to the authors" Sarah: as, could i retrace this? Heather: (which may or may not be the same as it being easily resolvible in some ways) todd.vision: and could Valerie find this! Heather: yeah :) Sarah: yeah, and does it give credit to the original data authors Heather: yeah and another aspect could be 1:22 PM "is this distinct from the paper attribution" etc Sarah: okay, that's making more sense. for some reason i was having trouble reconciling it Heather: I don't know what all the dimensions of quality would be, but I bet you could think of some me: that was how I tried to track in one of my spreadsheets, how many cited the data author name vs. other ways (doi, repository name, etc.) Sarah: what do you mean by that last one? Heather: and from those you could define a Quality Vector or something :) Sarah: i.e. do they give accession and author? todd.vision: Maybe it would help to focus mostly on getting all the results included in the results section, and then we can help shape the framing/interpretation Heather: um, did they just cite the original paper 1:23 PM vs citing a data doi or a data url Sarah: yeah, I'm planning on that today, just needed a little guidance/brainstorming before i sat down to write Heather: there is "quality" in a data citation, one could argue, in making a distinction between those two todd.vision: yes, paper vs data citation would be great to quantify Heather: that help? Sarah: yeah, i've distinguished those in data collection yes. thanks/ my only other question is about the journal nature Heather: yeah, I think you have all the raw guts to quantify the "quality" on dimensions 1:24 PM Sarah: it's been cumbersome to narrow down which ones should be used and then to download them Heather: I don't think we've given much thought as to the dimensions, though, so to the extent you can spearhead this it would be a real help. (ok, on to nature....) yeah. what do you think, punt on nature? Sarah: and in preliminary writeups i've been defining the selected journals as selected b/c they had a matching repository Heather: I think given difficulties and timeframe, punt. Sarah: yeah, so i think get rid of it for that and for confounding factors (shorter articles, very different format, shorter methods sections, etc) 1:25 PM Heather: for now anyway.... yes, agreed. Sarah: but, any objections? ok Heather: Todd? todd.vision: agree, but discuss limited disciplinary scope as a caveat in the discussion - may be more variation out there Sarah: yeah, the few i looked at had relatively apparent reuse, but not always sharing simply because of limited methods sectiosn Heather: yeah, agree, good point. Sarah: yeah, of course Heather: and it would be sweet as future work, no doubt. 1:26 PM but out of scope for now Sarah: ok. just checking Heather: (out of scope for now, as it turns out, that is.....) other ???s Sarah: ok. i think i'm good then. i haven't played with stats but am not worried about it Heather: or comments on the other projects, based on whatever email or OWW pages you've seen? 1:27 PM todd.vision: my advice is to think about displaying the data in tables as much as possible Sarah: um, not for now. but is valerie totally reshaping her approach? and does nic have updated/coded spreadsheets (and where are they posted). ok, i'll shoot for tables nicholas.m.weber: Sarah, I've got spread sheets to upload this afternoon 1:28 PM Sarah: ok, mostly i think i need funders and journals. just wasn't sure if they were up to date me: I sort of went narrower in my approach mostly focusing on ORNL nicholas.m.weber: should be on the OWW-- I'll directly zip them if not me: and when I get an email back from the ORNL librarian, I'll compare/contrast search methods 1:29 PM Heather: Nic, when you post them, esp if Sarah is going to use them please highlight which columns include data that is less finalized. todd.vision: valerie, I didn't understand the reasoning behind the narrowing - I think it detracts from the impact nicholas.m.weber: ok Heather: + send an email around to data_citation (and on blog!?!) asking for help flushign them out Todd, it was my idea me: ah, I think I did that because I didn't have other search results to compare with for TreeBASE or Pangaea Sarah: ok. i'm good.....i'm going to pop out now. is anyone willing to post the transcript? me: (and because Heather suggested it) I can 1:30 PM Heather: ok, bye Sarah todd.vision: Heather!?!?!? Heather: Todd!!!?!?! :) Yes. me: uh todd.vision: bye sarah Sarah: ok thanks. good luck! Sarah has left me: later Heather: So the idea was not to hide the other repositories but rather to focus on the DAAC usecase as a motivator and an evaluation and hence the focus but, in the second half of the article, to broaden it by 1:31 PM drawing parallels using all of the other searches Valerie has been doing too todd.vision: Alternatively... here are the data for all three how do we know how much we are missing? well in this one case we have external data to compare it to 1:32 PM Heather: right. so the advantage I thought to leading with DAAC is that it also really provides the motivation. todd.vision: taking that at face value suggests there are X and Y data citations missing for DAAC,,, Heather: they have a librarian who actually looks for these things every year so we could say why, and how hard they find it, etc. me: they also solicit on the citation policy page to email copies of articles citing DAAC data Heather: then say it is difficult with these other repositories too todd.vision: It might make sense to also emphasize Altmann and DataCite more in the intgro 1:33 PM Heather: esp when named after ancient continents etc todd.vision: wow, emailing articles is so not scalable me: ok, as opposed to in the discussion section? yeah... and I have yet to find out how many they actually get todd.vision: I like to see the discussion provide answers to questions raised in the introduction 1:34 PM Its an aesthetic symmetry thing :) me: ah yes Heather: hrm, so I'm a bit lost me: I have a bad habit of raising more questions than I answer. Heather: my assumption was that the article wouldn't be in intro/discussion format me: I blame that on watching too many Christopher Nolan films. Heather: but rather more of a long narrative. todd.vision: save the unanswerable questions for the end of the discussion as future work Heather: also, was there an article emailed around? me: ok 1:35 PM I sent it to data_citations as a link to a google doc for comments, etc. I think I set it to allow comments/editing, but I'm not sure todd.vision: Narrative would be ok with me, but that's not the current structure So I was going w that me: I have two other copies on dropbox and on my school file server 1:36 PM so I shouldn't do the standard "introduction, method, results, discussion" format? Heather: oh I see... I thought it was just the version that I'd already seen... my bad todd.vision: I've been editing the google doc, so can we keep that the reference copy? me: yes Heather: Gotcha. me: I sort of copy/pasted stuff from the previous draft and tried to add new things. 1:37 PM Heather: Yeah, sorry Valerie... hmmmm... I think there may be an advantage to making this more of a narrative. What do you think? Had you considered that and decided not to? me: I think I can do that. I'm sorry, I think I forgot to reframe it. 1:38 PM I kept thinking in the same article format as everything I've scanned through. todd.vision: Also, the finding friends on facebook by hair color idea seems to have gotten lost Heather: Yes.... me: so back to writing more like my blog entries, but more polished 1:39 PM is it all right if I use the first person? Heather: so this isn't exactly the tone, me: it feels odd, but in some ways, so does writing in the third person Heather: but something like this is what I was thinking of..... http://www.aclweb.org/anthology/J/J08/J08-3010.pdf todd.vision: Use 'We' - I am imagining this will have multiple authors me: ah, good Heather: it lacks the actual analytics that yours will ahve 1:40 PM me: I was going to ask how I should issue authorship credit. ok Heather: and is maybe a bit flip at the beginning, but gives the idea of how to write an article that isnt' a reasearch article todd.vision: and that might be just a little whimsical for our purposes Heather: I'll look for more examples.... me: ah, yes. I was wondering how I'd do that thank you. I had read through the editorials I had linked on the web resources page, but I had a feeling that wasn't quite what you were looking for 1:41 PM todd.vision: for authorship, let's go with who actually contributes to the paper. Invite folks on the mailing list but only acknowledge them if they don't add text or provide substantive feedback 1:42 PM Heather: So here is another article that doesn't follow a strict scientific method approach: http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.1000242 todd.vision: what journals did we discuss as targets at Knoxville? I don't recall me: oh, I recall Dlib on the list and I had a bunch of other interdisciplinary science journals Heather: it doesn't have a lot of the "facebook by haircolour" flavour, bit it does have a bit me: although this might not be a "hard" enough piece for them 1:43 PM todd.vision: The PLoS article is a pretty good model in that it has some quantitative and qualitative data but is mostly an essay Heather: right me: so it's not quite an essay or an editorial Heather: another one we talked about "Learned Publishing" me: oh yes Heather: and maybe "Library Trends" ? 1:44 PM me: I don't remember that, but I'll add it to my list Heather: I think Learned Publishing has some essay-type pieces too..... todd.vision: I would rather see this read by the publisher and journal world than the librarian world me: that makes sense 1:45 PM todd.vision: as for the haircolor emphasis, I would suggest putting it upfront, like so... my use case is that I have a dataset who has reused it? how would I find that out? Heather: "Adventures in Open Data" is an article that I recall might be a good model, published in Learned Publishing todd.vision: paper citations are too broad a net can I find data citations? Heather: will dig up a link, hard to find one now, drat subscription journals 1:46 PM todd.vision: or are people citing articles because of lack of infrastructur and bad habit me: ok, I think I found a link on ingenta for adventures in open data, so I'll bookmark that for later todd.vision: as opposed to adventures in semantic publishing? Heather: ok, so Todd you think it is worth taking a primary-research viewpoint, rather than a repository one? 1:47 PM it seemed to me that the DAAC story was strong (and interesting) 1:48 PM todd.vision: I'm not sure I understand the distinction you are making. The DAAC seems like a case study in how a repository is trying very hard to affect behavior Heather: well, instead of upfront saying "my use case is that I have a dataset" todd.vision: But even then, if the authors and journals don't cooperate, then it doesn't work well 1:49 PM So I think the case study is a nice example of the larger point In fact, it may be a 'box' within the paper, depending on the format of the journal Heather: I'm suggesting we say upfront that people want to monitor reuse. case in point, repositories are currently trying to monitor reuse here is how hard it is for them. me: so a repository tracking reuse as opposed to a data author attempting to track their own reuse? 1:50 PM Heather: right hold on, let me dig up our prev chat... todd.vision: I see... they don't seem that different to me I can imagine both use cases being described in 1 pgph 1:51 PM is that enough direction for now, Valerie? Nic is still waiting... Heather: ok, I see what you are saying. yeah, sorry Todd, these meetings have been going longer than an hour. me: oh yes, I just needed that reminder to make it less formal and more conversational, as well as the structural comments you've given 1:52 PM Heather: Before we leave Valerie's stuff, me: but yes, I think I will do another rewrite Heather: let me ask, hrmmmmm well, rather than ask let me just suggest that Valerie yes, do another rewrite, but kind of a light one 1:53 PM or rather a total rearrangment of structure and tone etc but with the idea that the framing etc is still subject to ongoing consideration..... as in, drafting early and often 1:54 PM because I think that Todd and I have different ideas still about where the paper is going and so I want to make sure I don't lead you down the road too far one way ok? me: ok I figure I'm keeping multiple versions up anyway todd.vision: agreed to include results for TreeBASE and Pangaea? me: all else fails, I'll just combine elements of different drafts that you like yes 1:55 PM (I needed feedback about if/how much, which you have given) todd.vision: the multiple version comment scares me a bit me: what? why? todd.vision: how do we know that we are working off the same one? me: oh yeah, good point. well, the one I put up is article 2.0 1:56 PM the one I had in Knoxville was Article 1 the next one will probably be labeled 3.0 with the changes you suggested in the comments on 2.0 and based on the comments in this meeting 1:57 PM todd.vision: An alternative is for you to archive old versions, and we keep editing the same 'current' version on Google Docs me: I just didn't want to run the risk of having only one document and have someone say "oh but I think the version 2.0 did a better job of discussing this" oh, yeah. that's a much better idea 1:58 PM todd.vision: you can label it v 2.0 or 3.0 at the top and we can notate who has done editing it me: ok, the comments have names assigned to them as well thanks Heather: ok. Nic? nicholas.m.weber: yes todd.vision: 'passing the ball' with manuscripts can sometimes get tricky, so it's good to be very clear about who is editing when, and who is not editing when. 1:59 PM nicholas.m.weber: sorry, I don't have much to contribute to Valeries paper yet me: it's ok. I wasn't very helpful in editing your poster. :( nicholas.m.weber: I think google docs does well for tracking initially but once big sections start to get removed and tweaked, it gets a bit messy 2:00 PM Heather: yeah, or once you start to need to judge for length based on number of pages in some funky format.... nicholas.m.weber: yes 2:01 PM so yesterday after our chat Heather I ran the confidence intervals for the numbers we provided in our paper Heather: ok. Nic, how goes your spreadsheet cleanup? nicholas.m.weber: http://openwetware.org/wiki/DataONE:Notebook/Data_Citation_and_Sharing_Policy/2010/07/19 its in the middle of the page Heather: Yes, and I saw your thoughts about potential stats Or rather potential analyses Good stuff, starting to think it though in that way nicholas.m.weber: obviously with smaller sample sizes there were large gaps for repositories and funding agencies I didn't really know where to go from there Heather: right 2:02 PM nicholas.m.weber: so I went back to cleaning up the ss's Heather: so we'll get together later today and do some more regression stuff one thing jumped out at me from your analyses brainstorm and that was the direction of causation :) so I think that some of your language was talking about would a and b affect impact factor 2:03 PM and of course a or b could impact impact factor but it is also likely that impact factor is a correlate for something that is affecting a and b (perhaps more likely) me: or it could just be a correlation nicholas.m.weber: yeah Heather: right. so a correlation often suggests causation but not always and it does little to illuminate what direction the causation goes 2:04 PM for that we need common sense and/or different analysis so whatever stats we do, we're actually unlikely here to determine causation (which will be a limitation of this work) but it makes sense to think about which way we expect the causation to go and talk about it that way 2:05 PM todd.vision: agree that stating hypotheses upfront would be a good way to go nicholas.m.weber: up front meaning before I run more stats or upfront as in a publication? 2:06 PM todd.vision: in the paper Heather: and in OWW notebook todd.vision: because there an unlimited number of statistical comparisons that you could make Heather: as you have started to do todd.vision: the hypotheses will guide which ones you focus on and how you interpret the results its very different to find a significant correlation when youve been on a fishing expedition for 100 possible correlations 2:07 PM and when you are testing just one 2:08 PM Heather: I think the analyses you mentioned on you OWW page were a good start. and we can take it from there in figuring out stats ok. anything else to talk about here with that, Nic? 2:09 PM nicholas.m.weber: ok, but I should focus on a set of correlations that will make sense for the data that I've gathered and try to calculate those-- then present that work Heather: either in stats, or in data cleanup and re-uploading, or in prelim abstract-writing? hmmmm. not quite sure what you mean by the above nicholas.m.weber: as far as re-uploading, I'm going to put things back into google docs and link off my OWW page Heather: great 2:10 PM nicholas.m.weber: I should be done with another hour of work Heather: I wouldn't suggest going into R and doing correlations just yet Let's talk more about doing those correlations first I think nicholas.m.weber: ok 2:11 PM Heather: I'm free for the next few hours, so we can find a good time. nicholas.m.weber: I'm free for the rest of the day, so whenever is convenient 2:12 PM todd.vision: should we be careful about how much effort goes into the data analysis before it is fully cleaned up Heather: ok. I need to check on sometihng, then let me ping you after this chat about exact timing. before the data is fully cleaned up, you mean? todd.vision: yes Heather: yeah, so mostly I think we are planning what analysis makes sense to run and learning the R for it 2:13 PM todd.vision: right that's good Heather: and sticking initially to the firm fields like impact factor but I agree, before putting stock in stats that come from looser fields we should make sure the data is clean 2:14 PM todd.vision: and putting effort into recruiting folks to help clean it now while there is still time Heather: right nicholas.m.weber: Ok, so as soon as I get them finished this afternoon I can post something on the blog and put them into my OWW Heather: great. nicholas.m.weber: and hopefully at least our data_citation group can weigh in and say which fields or particular cells should be revised 2:15 PM todd.vision: 'them' being links to google spreadsheets, or how did you decide to do that collaboratively ? nicholas.m.weber: I think google spreadsheets will work best for editing me: I'm inclined to agree 2:16 PM nicholas.m.weber: fusion tables has some nice functionality for commenting, but its really hard to edit in that format todd.vision: and then ask people to send you what changes they made and why? Heather: We experimented with fusion but found them hard to edit todd.vision: too bad me: yeah, fusion is good for some things, but not this nicholas.m.weber: maybe for this, I could just upload, allow people to make comments and then redownload them and make changes as necessary 2:17 PM todd.vision: that might be better so you can see what needed cleaning nicholas.m.weber: that might make things easier - rather than expecting people to notify me about what changes they've made 2:18 PM by upload I mean upload to Fusion Tables todd.vision: fine nicholas.m.weber: ok Heather: ok, then Nic make that your priority for today and we'll touch base about more stats 2:19 PM nicholas.m.weber: great Heather: Any other questions? <> nicholas.m.weber: not from me me: I think I'm good for now as well. nicholas.m.weber: questions that is Heather: ok. Let's have another chat later this week. me: I'll read the articles you linked and start revising Heather: Individually, obviously, but also maybe as a group since we're getting close to the end 2:20 PM todd.vision: excellent progress guys me: thanks for all the guidance Heather: great, Valerie! Looking forward to seeing it. nicholas.m.weber: Todd you mentioned a survey or some form of feedback after Knoxville Heather: bye for now, all. nicholas.m.weber: is that something we'll do at the end? todd.vision: I was expecting it to happen about now 2:21 PM Rebecca Koskela will be putting it on surveymonkey soon, I think me: ah sweet nicholas.m.weber: ok, just curios ( wanted to make sure I wasn't missing something) todd.vision: all good with the checks? nicholas.m.weber: yup 2:22 PM me: I haven't gotten mine yet, but I got in contact with the office who cleared up the whole mistake about how I was in the system as an international employee. so it'll probably be today or sometime this week todd.vision: ok, let me know if it doesn't me: ok, thanks again nicholas.m.weber: ok, thanks for the meeting today todd.vision: ciao! nicholas.m.weber: bye me: thanks bye Heather: bye


 * }