DataONE:meeting notes:15 June 2010 chat

Chat with hpiwowar@gmail.com 1 message ________________________________________ hpiwowar@gmail.com 	Tue, Jun 15, 2010 at 4:47 PM To: walker.sarah.3@gmail.com 3:35 PM Heather: Hi Sara! Sorry, I had the window behind things. 3:38 PM Sarah: that's fine. i did just now too Heather: Hi Sarah: hi. i just had a few, hopefully quick questions about journal sampling design and some data extraction fields I'm debating Heather: Sounds good! 3:39 PM Sarah: and, just so you know, it's just as easy for me to extract data sharing info as data reuse...I have to read through the materials and methods in detail anyways. I'm timing about 15 min for data extraction (with some, but not complete coding of the data) so, first off...in terms of taking a time vs. random sample Heather: great, good news. 3:40 PM it will probably speed up too, after you've done oodles. Sarah: i hope so! and I'm hoping the coding won't be that bad, it's just quicker to do extraction first and coding in a later batch 3:41 PM Heather: yes, that makes sense. And definitely keep thinking about ways that we can make the extraction more efficient / distributed / automated as it makes sense 3:42 PM Sarah: of course. unfortunately it's looking like assessing the data reuse practices looks like it will need to be done manually...there is horrible inconsistency in how datasets are cited, but i think we all knew that going in. Heather: yes, time vs random sample. what are your thoughts? Sarah: yeah. so. 3:43 PM I'm thinking my initial "first 25-50 article in 2010" sampling will still be useful let me explain i'm thinking that i can go at this from a few different angles 1. a "snapshot" of current practices. i.e. the first issue for the major journals in 2010 this will get at the % that cite data and that do it properly. 3:44 PM also it provides qualitative (and quantitative) data on the current state of things Heather: yes, true Sarah: also, it makes my preliminary data extraction, which was primarily to determine fields, not go to waste Heather: yes. that is always nice :) Sarah: then, we're also comparing all the major journals at the exact same moment in time 3:45 PM 2. Heather: yes Sarah: a random sample but i want this to consider change over time and journal (so I can stick to the established journals which makes it easier and for disciplinary coverage)  so I'm thinking that I'd block my sampling by year and by journal 3:46 PM then take a set number of articles from each journal per year Heather: yes, because some journals have many more articles than others, and you don't want them overweighted, correct? Sarah: perhaps starting with 25-50 per journal per year as i work my way back in time, then going back and adding more if i have time  yes. exactly 3:47 PM Heather: the reason to block by year isn't quite as clear to me Sarah: sysbio publishes way less, ecology way more Heather: because I would imagine the number of articles in any given year in any given journal would be fairly stable? Sarah: by year to get at change in practices over time Heather: yes, so totally agreed that changes over time are important Sarah: and so i'm not sampling article pre-internet 3:48 PM Heather: hmmm. Sarah: i'll obviously start with the most recent years and get back as far as i can Heather: I'm not against the idea of blocking by year, just making sure I understand your thinking. Sarah: just to make sure i get the same # per year for equal sampling for change over time assessments 3:49 PM and to avoid pre-1990 article and pre-data sharing policies Heather: yeah, ok.... I think it woudl come out mostly in the wash if it were random too because number ofarticles per year is pretty stable within a journal and you could keep the articles post 1990 Sarah: this could also be used to correlate with the time of data sharing policy by each individual journal Heather: yup. so I think you could accomplish the same thing without blocking by year, but I don't see any big reason not to block by year, and 3:50 PM blocking by year might make the randomization easier Sarah: i guess mostly i don't want my sample to yield 5 articles from 2009 and 20 from 2001 for sysbio, but then 20 from 2009 and 5 from 2001 for amnat that would bias the results Heather: yeah. well, a little. but mostly not, because it would be random..... 3:51 PM Sarah: benefitting amnat in this case since datasharing discussions have been more intense in the past decade Heather: anyway I'll drop it because we are mostly in agreement so yes, that sounds good :) Sarah: ok...blocking by year shouldn't lose anything and it should only help equal sampling, right? 3:52 PM Heather: mostly yes, I think. Blocking does have a few disadvantages.... it can introduce its own kind of biases 3:53 PM because by forcing equal distribution you are, say, oversampling results pre 2005 because there were fewer articles then relative to 2010  but I don't think it is an effect worth worring about  and I think you are right it will help in other ways  so I think either way is fine for our purposes 3:54 PM Sarah: ok. any tips for randomly selecting within my blocks? I've only used random numbers for randomization...well and GIS applications, but that's unrelated. Heather: nope, I'd do random numbers too Sarah: i think random #s would work by knowing the range of doi and randomly selecting within those 3:55 PM Heather: I don't know of any sophisticated say, and I don't think any is really needed yes, agreed Sarah: so, i just look at the first and last article for that journal that year and generate random numbers in that range of doi? correct? i haven't experiment with it yet 3:56 PM Heather: Hmm. What I would probably do is get a list of all articles for that journal that year Generate a list of N random numbers between 1 and the length of the articles and then just pick the articles that are in the random-numbered place in the list I don't think that DOIs are always exactly consecutive so I wouldn't count on that 3:57 PM Sarah: okay. that makes sense. will do. Heather: It would be a bit manually intensive unless you can get a well-formatted list of articles.... but probably not too bad. 3:58 PM Sarah: okay, next series of questions about a few data extraction fields Heather: yup just a sec had a quick meeting with Todd after our meeting today and he had some feedback that is relevant as you choose your random set of articles and figure out which to annotate first 3:59 PM of course ideally you will annotate "all" of them that you intend to, but sometimes life happens or the annotation takes longer than intended so all else being equal, he'd vote for you looking for longitudinal trends within a journal first so rather than go by year across your 5 (or whatever) journals go within one journal first, through all of time, then start on the next journal. something like that. 4:00 PM Sarah: ok. yeah. i was thinking i would do my snapshot first and then work back through each year for that journal Heather: whatcha think? Sarah: mostly b/c the journals have their own styles which i get used to for data extraction Heather: yup, snapshot first is great, yes, totally agreed on getting used to format :) cool! that'll be great Sarah: but, that leads me to a related question i just thought of Heather: yup 4:01 PM Sarah: does he/dataone have any preference of which journal i start with?  and approx how many articles should i extract per year? Heather: good question, I don't know.  I think the best idea will be to make a really explicit plan, picking something Sarah: i was thinking of starting small then working up as time permitted Heather: and then send it around to mentors saying "this is what I am going to do." 4:02 PM Sarah: i.e. go through ecology with 25 articles per year Heather: "if you think I should focus on another journal first, let me know by tomorrow" Sarah: then go through sysbio, etc  then do another set of 25 per year for ecology as time permitted Heather: yeah, a really clear plan like that woudl be great. Sarah: ok Heather: then they can critique. give them a day or so to respond (and tell them that). Sarah: i'll type that up tomorrow, after we meet with the nic/valerie 4:03 PM Heather: also post on your oww, but I think an email woudl be useful in that context too Sarah: should i give different options and what questions they answer differently? and also my current rate of extraction? Heather: oh, reminds me... meeting tomorrow at 10am pacific, right? changed from 9am. I need to clarify that... Sarah: to put it all in perspective oh. i dunno 4:04 PM Heather: I mean I think we'll make it 10am but I need to tell you guys that :) I'll send an email ASAP Sarah: fine with me Heather: yeah, perspective is good.  that said, I'd make your plan self-contained and clear and unambiguous, and make the options separate at the bottom 4:05 PM such that if you don't get feedback, you know what you are going to do, it is the clear stuff at the top  something like that.... Sarah: ok. i'll figure that out. 4:06 PM so, data extraction questions now? Heather: yeah, I'm sure it'll be fine.  yup, ready.  have you talked data extraction fields with valerie yet,  just to give me context? Sarah: no  was planning on tomorrow Heather: ok.  sounds good Sarah: does she have a google doc? i couldn't find one  but i thought she mentioned one. 4:07 PM never mind, i'll ask her Heather: um, yes I think so. Sarah: ok. anyways  1.  should i be collecting author metadata? # of authors discipline of author resident country of author etc. resident country probably isn't important Heather: Good question. Sarah: but easy to obtain from the addresses provided Heather: Yeah, hmmm. Sarah: so is general discipline 4:08 PM Heather: I think no, don't bother I think there must be a way to get that in a more automated, metadata fashion on the web Sarah: i think discipline might be impt because it may influence attitude towards data sharing Heather: I don't know what way yet, but it must exist. Oh, from ISI or something, that would work. Yes, disciplines of all authors might be useful for you to capture 4:09 PM because I don't think that would be available in citation databases corresponding author and address is though, so I think don't spend time collecting that. Sarah: zotero, which i'm currently using only records authors, but not any metadata about them. the article contains their address which has the name of their department which usually indicates ecology vs. molecular, etc Heather: yes. the dept won't be the same for all authors, but it is a good clue 4:10 PM (and it is hard to figure out how to use the departments for all authors, anyway) Sarah: usually it lists them for all Heather: really? that's handy. Sarah: but we could just do primary and assume all Heather: yeah, so I'd say no, don't collect that. Sarah: yeah, with the asterix behind there names sorry, i meant just do it for the primary author. was that clear? Heather: as part of our meeting tomorrow maybe we could figure out which of those data fields are easily extracted from ISI or other sources 4:11 PM how about this... hold off on deciding for sure about those fields Sarah: ok. zotero lets you customize some of the extraction, but i can't get it to grab things like open-access or not, author affiliation, etc. Heather: and we'll look tomorrow to see which ones we can easily get. Sarah: ok. i'll put it on my list of questions for tomorrow Heather: right. are your articles open-access on an article-by-article basis? 4:12 PM = in the journals you are looking at, do the authors have the option to pay to make their particular aritcle OA? Sarah: yep Heather: because I think Alex is capturing whether the journal itself is OA. Sarah: amnat, for instance carries approx 2 open access articles and 12 subscription articles per issue 4:13 PM Heather: yeah, ok then. I'm guessing we'll have to capture whether the article itself is OA at the article level... so yes, please extract that. Guessing it will be an important variable. Sarah: i think that's important to know b/c open access could very well relate to more transparent data sharing Heather: And not available in citation databases exactly. my prelim thesis data suggests it is, as you might expect :) Sarah: yeah. for now i copy the issue page which lists all the articles to extract that. clunky, but what can you do? Heather: great 4:14 PM Sarah: it probably also correlates at a journal level Heather: "clunky, but what can you do?" love it. that captures multiple months of my reserach life. 4:15 PM Sarah: anyways, #2 editor info?  i'm thinking not relevant Heather: I agreee Sarah: ditto for length from submission to publication? Heather: agreed Sarah: I thought that might be an indicator of the amount of "hassle" the author is put through 4:16 PM in the publication process Heather: some of these things might be possibly slightly relevant, but we have to draw the line somewhere Sarah: but maybe that's a long shot Heather: to make it a reasonable data collection process  and the more variables we try to correlate, the more datapoints we need, esp if the associations are weak 4:17 PM Sarah: those times are easy to collect, right under the abstract, it just might take a little work to automate the calculation Heather: I'd guess editor and time to publication fall into the weak category. Sarah: no, i agree we have to be careful with the stats and not "searching for significance" Heather: right. Sarah: ok. agreed, just thought i would check Heather: though I hear you on "easy to collect" there is something to that so feel free to go with your gut when that is true 4:18 PM Sarah: i'll collect the time one, but just not calculate it unless it seems relevant. editors is a pain to collect Heather: and it might be true for some journals and not others.... mostly, I vote for it not being a key, important variable, and not worth much time. Sarah: ok #3 4:19 PM which i think is more important sorry, i had them listed in oww, but apparently not in order of importance Heather: no problem Sarah: model/software usage or production in our chat yesterday you mentioned that there is a depository for models/software is that correct? Heather: right. good question. 4:20 PM I don't know about software, I don't think there is one well defined place I do know there are depositories for models, but I don't know how applicable they are to this domain. for what it is worth, this came up in my conversation with Todd I told him we were just looking at data (not software) He was ok with that. Sarah: it's really hard to tell (in terms of reuse/sharing at least) if the author made it reusable or is just using a cd he bought 4:21 PM Heather: Though, clearly, there could be value in looking at software. Sarah: so is a model software or not? Heather: Right. So if it is really hard to tell, then skip it for now. Let's keep to our knitting and not bite off too much. Sarah: i'm inclined to say it's software b/c usually it's associated with a gui yeah. i agree Heather: I think we get to decide. Is model info easy to extract or not? 4:22 PM Sarah: especially the molecular papers are horrendous to decide if they are using software vs. equation vs. theoretical idea um... i'm thinking it will be easy for papers that are purely models (i.e. no data reuse/sharing) Heather: horrendous to decide = don't do it Sarah: but not for ones that make a model as one step in a bigger analyssi Heather: yeah. then my take is we decide it is out of scope for the current paper 4:23 PM one approach is to have a field that says "might use or generate a model?" and you could put "yes" or "maybe" and then if someone wanted to to back and look they'd have a great headstart Sarah: okay, but maybe mention in the 'future directions' of the discussion that it should be looked at based on anectodal experience with a few papers Heather: agreed Sarah: yeah. i can do that in my "data produced" fields 4:24 PM b/c they aren't typically deposited (at least not clearly in the few i've seen) i'm dividing those two categories per our group chat yesterday Heather: we'll do better with more papers abstracted than model or software fields, given our plans. so keep that in mind in case it helps you make these tradeoffs Sarah: ok. i'm still interested in the model depository url if you have it by chance, but no biggie 4:25 PM Heather: yeah, let me look. I'll get back to you.... Sarah: ok. my last question is... typically, anything about sharing/reuse is found in the materials and methods section Heather: yup 4:26 PM Sarah: can i just extract (i.e. read over) that section, and neglect the rest (and state that clearly in my methods), or should i account for the few times when i'ts also in the intro/results Heather: sometimes acknowlegements, are you finding that? Sarah: nope, mostly just funding there but i haven't coded my acknowledgement extractions yet Heather: some journals like PNAS have a separate "data" section (that isn't what they call it... availability? something like that) hmmm. no, I think given the nature of this study, it is worth reading the whole paper 4:27 PM "reading" = skimming or whatever Sarah: some have a supplementary data section (interally deposited data), but i haven't encountered a data section yet ok. Heather: then your paper can be the background info that everyone else uses when they get to just read the methods section :) Sarah: i've had a few weird ones today that cite genbank in crazy ways and different thorughout the entire paper, only to suddenly use an accession number in the results 4:28 PM very strange and highlights the lack of best citation practices Heather: yikes! yeah.  I think you are keeping track of the section, right? Sarah: yeah. pretty bad. i'm making note of them for "whizbang" discussion  yeah  i keep full text of all relevant sections Heather: so for example if they use a doi, you will know if they cite the data doi in the body of the paper, in the references list, or both? 4:29 PM Sarah: then code them as i go or save for later Heather: perfect. Sarah: yep  i keep all relevant citations too Heather: I haven't looked yet.. do you have any of your coded examples online yet? Sarah: i keep the full text of relevant parts so i don't have to dig it out again Heather: or do you think you can put a few up before our talk tomorrow? Sarah: nope 4:30 PM yeah. mostly, my spreadsheet is in disarray right now because i keep adding fields and doing long annotations to myself to fix it, rather than just fixing it i'm hoping to resolve that today Heather: I hear you. Even if it isn't resolved, if you can put it up I think it woudl be helpful 4:31 PM concrete examples always help Sarah: yeah. i agree Heather: and we have all created our share of messy spreadsheets, so no fear there Sarah: sorry i've been slow on that Heather: heh, that's life. Sarah: which reminds me of one complication you might have an opinion on 4:32 PM i'm running into a few papers that cite multiple datasets (i.e. a gene and then a GIS dataset) Heather: yeah. interesting. Sarah: i've been putting it all in one field which is messy think i'll parse it out later and i don't think it would be that bad, but should i have a set of fields for "dataset1" and then the same set for "dataset 2" etc 4:33 PM i.e. "Depository of dataset 1" and then "depository of dataset 2" that then treats them almost as separate records Heather: oh I see. Sarah: at least from a database perspective rather than article being the smallest nested entity, then dataset it Heather: here's how I would do it, I think. Sarah: *is not "it" 4:34 PM Heather: I'm guessing the same dataset types and repositories keep coming up? Or that 5 types account for 80% of datatypes or something? if that is true, then I'd have columns for each of those five types 4:35 PM Sarah: so like "gene cited" y/n, then "GIS cited" y/n Heather: so one column for "has DNA sequence data?" and another for "deposited in Genbank?" etc for each of the 5 types Sarah: followed by "gene cited with accession"y/n etc Heather: then a misc Sarah: ok Heather: yes Sarah: that's a bit painful in the extraction process, but would be easy enough in coding Heather: that format will make it easiest to analyze later Sarah: i've been doing "dropdown menu" type 4:36 PM i.e. "type of dataset cited" Heather: gotcha. Sarah: then my standard categories, "gene, GIS, phylo, etc" and I've been sticking pretty well to those categories, but then i get GIS and gene or what have you Heather: yeah, so if that method is easiest for you in capture it is easy enough to translate it automatically into columns 4:37 PM right, but it doesn't work well if there are multiple so your call whether you want to switch your data input method to be more generalizable or how you want to handle it Sarah: but, this will complicate things even more....with these nutty genbank papers today, they cite genbank improperly in mat and meth then give an accession # for one of their genes, but not another and only in the results section 4:38 PM Heather: yup. arg. Sarah: so they are citing different genes Heather: heh! Sarah: but citing them in totally different ways and using the cited genes for totally different purposes Heather: so an idea then could be to just flag it as a nutty paper in some ways, since you and valerie are both abstracting similar things Sarah: then, do i give it credit for having used an accession number even if it didn't mention it in the best place? or cite in the biblio 4:39 PM Heather: and you are unlikely to abstract that one the same way, and there aren't tooo many like that... there might be something to flagging it for revisiting later, after you have more experience and ideas about how to handle it Sarah: i.e. i have a "how cited" field which is either self, correspondence, url, depostitory, or accession with accession being the "best" Heather: yup, gotcha. 4:40 PM Sarah: ok. yeah, i think someone else looking at the paper would do it differently Heather: so I think maybe if there are lots of paper like this Sarah: yeah....at least i encountered a bunch today Heather: it might tell us that those need to be multiple columns too, so that more than one of them could be true at the same time.... Sarah: yeah...i'll talk with valerie and see what she's running into 4:41 PM okay, i just remembered one more (sorry!) Heather: fyi this sort of stuff is great. It is exactly the sort of hassles you should be running into at this point. Just to make sure you know that. shoot. 4:42 PM Sarah: oh good, so..do "methods" count as datasets? let me explain...it might take a minute. some authors include additional methods as part of their supplementary material that they post with the journal this includes explicit protocols Heather: interesting Sarah: i've read in some of the datasharing articles that good metadata about methods/processing is needed just as much as the dataset itself Heather: I'd say no, at this point. Sarah: so do i count that additional methods info as dataset? 4:43 PM Heather: Agreed, they could be considered data. And it could definitely be interesting to look at from that perspective. Sarah: yeah, but maybe another "future directions" thing to shuffle into the discussion of a manuscript Heather: But I think it is far enough removed that we don't have the right headspace to think about all of the different variables that we might want to capture to study that well.... 4:44 PM so, yes, I would not bite off that part for now. if it were easy to annotate "has more methods in supplementary information" then that could be helpful for future researchers 4:45 PM Sarah: yeah, especially when that's all the author makes available of their data, i note that Heather: but I wouldn't do more than that. yup. Sarah: ok. well, i think that's all i had 4:47 PM Heather: and messy spreadsheets are fine :) Get them up so we can have a good stare at them together :) Sarah: yep. that's my goal for the end of today Heather: great! ok, Sarah, thanks for finding me. Talk to you tomorrow! 4:48 PM Sarah: great. talk to you then