DataONE:meeting notes:15 June 2010 chat

From OpenWetWare
Jump to navigationJump to search

Chat with 1 message ________________________________________ <> Tue, Jun 15, 2010 at 4:47 PM To: 3:35 PM Heather: Hi Sara! Sorry, I had the window behind things. 3:38 PM Sarah: that's fine.

 i did just now too
Heather: Hi
Sarah: hi. i just had a few, hopefully quick questions about journal sampling design and some data extraction fields I'm debating
Heather: Sounds good!

3:39 PM Sarah: and, just so you know, it's just as easy for me to extract data sharing info as data reuse...I have to read through the materials and methods in detail anyways. I'm timing about 15 min for data extraction (with some, but not complete coding of the data)

 so, first terms of taking a time vs. random sample
Heather: great, good news.

3:40 PM it will probably speed up too, after you've done oodles.

Sarah: i hope so! and I'm hoping the coding won't be that bad, it's just quicker to do extraction first and coding in a later batch

3:41 PM Heather: yes, that makes sense. And definitely keep thinking about ways that we can make the extraction more efficient / distributed / automated as it makes sense 3:42 PM Sarah: of course. unfortunately it's looking like assessing the data reuse practices looks like it will need to be done manually...there is horrible inconsistency in how datasets are cited, but i think we all knew that going in.

Heather: yes, time vs random sample. what are your thoughts?
Sarah: yeah.

3:43 PM I'm thinking my initial "first 25-50 article in 2010" sampling will still be useful

 let me explain
 i'm thinking that i can go at this from a few different angles
 1. a "snapshot" of current practices. i.e. the first issue for the major journals in 2010
 this will get at the % that cite data and that do it properly.

3:44 PM also it provides qualitative (and quantitative) data on the current state of things

Heather: yes, true
Sarah: also, it makes my preliminary data extraction, which was primarily to determine fields, not go to waste
Heather: yes. that is always nice :)
Sarah: then, we're also comparing all the major journals at the exact same moment in time

3:45 PM 2.

Heather: yes
Sarah: a random sample
 but i want this to consider change over time and journal (so I can stick to the established journals which makes it easier and for disciplinary coverage)
 so I'm thinking that I'd block my sampling by year and by journal

3:46 PM then take a set number of articles from each journal per year

Heather: yes, because some journals have many more articles than others, and you don't want them overweighted, correct?
Sarah: perhaps starting with 25-50 per journal per year as i work my way back in time, then going back and adding more if i have time
 yes. exactly

3:47 PM Heather: the reason to block by year isn't quite as clear to me

Sarah: sysbio publishes way less, ecology way more
Heather: because I would imagine the number of articles in any given year in any given journal would be fairly stable?
Sarah: by year to get at change in practices over time
Heather: yes, so totally agreed that changes over time are important
Sarah: and so i'm not sampling article pre-internet

3:48 PM Heather: hmmm.

Sarah: i'll obviously start with the most recent years and get back as far as i can
Heather: I'm not against the idea of blocking by year, just making sure I understand your thinking.
Sarah: just to make sure i get the same # per year
 for equal sampling
 for change over time assessments

3:49 PM and to avoid pre-1990 article

 and pre-data sharing policies
Heather: yeah, ok.... I think it woudl come out mostly in the wash if it were random too because number ofarticles per year is pretty stable within a journal
 and you could keep the articles post 1990
Sarah: this could also be used to correlate with the time of data sharing policy by each individual journal
Heather: yup.
 so I think you could accomplish the same thing without blocking by year, but I don't see any big reason not to block by year, and

3:50 PM blocking by year might make the randomization easier

Sarah: i guess mostly i don't want my sample to yield 5 articles from 2009 and 20 from 2001 for sysbio, but then 20 from 2009 and 5 from 2001 for amnat
 that would bias the results
Heather: yeah. well, a little. but mostly not, because it would be random.....

3:51 PM Sarah: benefitting amnat in this case since datasharing discussions have been more intense in the past decade

Heather: anyway I'll drop it because we are mostly in agreement
 so yes, that sounds good :)
Sarah: ok...blocking by year shouldn't lose anything and it should only help equal sampling, right?

3:52 PM Heather: mostly yes, I think. Blocking does have a few disadvantages....

 it can introduce its own kind of biases

3:53 PM because by forcing equal distribution you are, say, oversampling results pre 2005 because there were fewer articles then relative to 2010

 but I don't think it is an effect worth worring about
 and I think you are right it will help in other ways
 so I think either way is fine for our purposes

3:54 PM Sarah: ok. any tips for randomly selecting within my blocks? I've only used random numbers for randomization...well and GIS applications, but that's unrelated.

Heather: nope, I'd do random numbers too
Sarah: i think random #s would work by knowing the range of doi and randomly selecting within those

3:55 PM Heather: I don't know of any sophisticated say, and I don't think any is really needed

 yes, agreed
Sarah: so, i just look at the first and last article for that journal that year and generate random numbers in that range of doi? correct?
 i haven't experiment with it yet

3:56 PM Heather: Hmm. What I would probably do is get a list of all articles for that journal that year

 Generate a list of N random numbers between 1 and the length of the articles
 and then just pick the articles that are in the random-numbered place in the list
 I don't think that DOIs are always exactly consecutive
 so I wouldn't count on that

3:57 PM Sarah: okay. that makes sense. will do.

Heather: It would be a bit manually intensive unless you can get a well-formatted list of articles....
 but probably not too bad.

3:58 PM Sarah: okay, next series of questions about a few data extraction fields

Heather: yup just a sec
 had a quick meeting with Todd after our meeting today
 and he had some feedback that is relevant
 as you choose your random set of articles and figure out which to annotate first

3:59 PM of course ideally you will annotate "all" of them that you intend to, but sometimes life happens or the annotation takes longer than intended

 so all else being equal, he'd vote for you looking for longitudinal trends within a journal first
 so rather than go by year across your 5 (or whatever) journals
 go within one journal first, through all of time, then start on the next journal.
 something like that.

4:00 PM Sarah: ok. yeah. i was thinking i would do my snapshot first and then work back through each year for that journal

Heather: whatcha think?
Sarah: mostly b/c the journals have their own styles which i get used to for data extraction
Heather: yup, snapshot first is great,
 yes, totally agreed on getting used to format :)
 cool! that'll be great
Sarah: but, that leads me to a related question i just thought of
Heather: yup

4:01 PM Sarah: does he/dataone have any preference of which journal i start with?

 and approx how many articles should i extract per year?
Heather: good question, I don't know.
 I think the best idea will be to make a really explicit plan, picking something
Sarah: i was thinking of starting small then working up as time permitted
Heather: and then send it around to mentors saying "this is what I am going to do."

4:02 PM Sarah: i.e. go through ecology with 25 articles per year

Heather: "if you think I should focus on another journal first, let me know by tomorrow"
Sarah: then go through sysbio, etc
 then do another set of 25 per year for ecology as time permitted
Heather: yeah, a really clear plan like that woudl be great.
Sarah: ok
Heather: then they can critique. give them a day or so to respond (and tell them that).
Sarah: i'll type that up tomorrow, after we meet with the nic/valerie

4:03 PM Heather: also post on your oww, but I think an email woudl be useful in that context too

Sarah: should i give different options and what questions they answer differently?
 and also my current rate of extraction?
Heather: oh, reminds me... meeting tomorrow at 10am pacific, right? changed from 9am. I need to clarify that...
Sarah: to put it all in perspective
 oh. i dunno

4:04 PM Heather: I mean I think we'll make it 10am but I need to tell you guys that :)

 I'll send an email ASAP
Sarah: fine with me
Heather: yeah, perspective is good.
 that said, I'd make your plan self-contained and clear and unambiguous, and make the options separate at the bottom

4:05 PM such that if you don't get feedback, you know what you are going to do, it is the clear stuff at the top

 something like that....
Sarah: ok. i'll figure that out.

4:06 PM so, data extraction questions now?

Heather: yeah, I'm sure it'll be fine.
 yup, ready.
 have you talked data extraction fields with valerie yet,
 just to give me context?
Sarah: no
 was planning on tomorrow
Heather: ok.
 sounds good
Sarah: does she have a google doc? i couldn't find one
 but i thought she mentioned one.

4:07 PM never mind, i'll ask her

Heather: um, yes I think so.
Sarah: ok. anyways
 should i be collecting author metadata?
 # of authors
 discipline of author
 resident country of author
 resident country probably isn't important
Heather: Good question.
Sarah: but easy to obtain from the addresses provided
Heather: Yeah, hmmm.
Sarah: so is general discipline

4:08 PM Heather: I think no, don't bother

 I think there must be a way to get that in a more automated, metadata fashion on the web
Sarah: i think discipline might be impt because it may influence attitude towards data sharing
Heather: I don't know what way yet, but it must exist. Oh, from ISI or something, that would work.
 Yes, disciplines of all authors might be useful for you to capture

4:09 PM because I don't think that would be available in citation databases

 corresponding author and address is though, so I think don't spend time collecting that.
Sarah: zotero, which i'm currently using only records authors, but not any metadata about them. the article contains their address which has the name of their department which usually indicates ecology vs. molecular, etc
Heather: yes. the dept won't be the same for all authors, but it is a good clue

4:10 PM (and it is hard to figure out how to use the departments for all authors, anyway)

Sarah: usually it lists them for all
Heather: really? that's handy.
Sarah: but we could just do primary and assume all
Heather: yeah, so I'd say no, don't collect that.
Sarah: yeah, with the asterix behind there names
 sorry, i meant just do it for the primary author. was that clear?
Heather: as part of our meeting tomorrow maybe we could figure out which of those data fields are easily extracted from ISI or other sources

4:11 PM how about this... hold off on deciding for sure about those fields

Sarah: ok. zotero lets you customize some of the extraction, but i can't get it to grab things like open-access or not, author affiliation, etc.
Heather: and we'll look tomorrow to see which ones we can easily get.
Sarah: ok. i'll put it on my list of questions for tomorrow
Heather: right.
 are your articles open-access on an article-by-article basis?

4:12 PM = in the journals you are looking at, do the authors have the option to pay to make their particular aritcle OA?

Sarah: yep
Heather: because I think Alex is capturing whether the journal itself is OA.
Sarah: amnat, for instance carries approx 2 open access articles and 12 subscription articles per issue

4:13 PM Heather: yeah, ok then. I'm guessing we'll have to capture whether the article itself is OA at the article level... so yes, please extract that. Guessing it will be an important variable.

Sarah: i think that's important to know b/c open access could very well relate to more transparent data sharing
Heather: And not available in citation databases
 my prelim thesis data suggests it is, as you might expect :)
Sarah: yeah. for now i copy the issue page which lists all the articles to extract that. clunky, but what can you do?
Heather: great

4:14 PM Sarah: it probably also correlates at a journal level

Heather: "clunky, but what can you do?" love it. that captures multiple months of my reserach life.

4:15 PM Sarah: anyways, #2

 editor info?
 i'm thinking not relevant
Heather: I agreee
Sarah: ditto for length from submission to publication?
Heather: agreed
Sarah: I thought that might be an indicator of the amount of "hassle" the author is put through

4:16 PM in the publication process

Heather: some of these things might be possibly slightly relevant, but we have to draw the line somewhere
Sarah: but maybe that's a long shot
Heather: to make it a reasonable data collection process
 and the more variables we try to correlate, the more datapoints we need, esp if the associations are weak

4:17 PM Sarah: those times are easy to collect, right under the abstract, it just might take a little work to automate the calculation

Heather: I'd guess editor and time to publication fall into the weak category.
Sarah: no, i agree we have to be careful with the stats and not "searching for significance"
Heather: right.
Sarah: ok. agreed, just thought i would check
Heather: though I hear you on "easy to collect"
 there is something to that
 so feel free to go with your gut when that is true

4:18 PM Sarah: i'll collect the time one, but just not calculate it unless it seems relevant. editors is a pain to collect

Heather: and it might be true for some journals and not others....
 mostly, I vote for it not being a key, important variable, and not worth much time.
Sarah: ok

4:19 PM which i think is more important

 sorry, i had them listed in oww, but apparently not in order of importance
Heather: no problem
Sarah: model/software usage or production
 in our chat yesterday you mentioned that there is a depository for models/software
 is that correct?
Heather: right. good question.

4:20 PM I don't know about software, I don't think there is one well defined place

 I do know there are depositories for models, but I don't know how applicable they are to this domain.
 for what it is worth, this came up in my conversation with Todd
 I told him we were just looking at data (not software)
 He was ok with that.
Sarah: it's really hard to tell (in terms of reuse/sharing at least) if the author made it reusable or is just using a cd he bought

4:21 PM Heather: Though, clearly, there could be value in looking at software.

Sarah: so is a model software or not?
Heather: Right. So if it is really hard to tell, then skip it for now.
 Let's keep to our knitting and not bite off too much.
Sarah: i'm inclined to say it's software b/c usually it's associated with a gui
 yeah. i agree
Heather: I think we get to decide.
 Is model info easy to extract or not?

4:22 PM Sarah: especially the molecular papers are horrendous to decide if they are using software vs. equation vs. theoretical idea

 i'm thinking it will be easy for papers that are purely models (i.e. no data reuse/sharing)
Heather: horrendous to decide = don't do it
Sarah: but not for ones that make a model as one step in a bigger analyssi
Heather: yeah. then my take is we decide it is out of scope for the current paper

4:23 PM one approach is to have a field that says "might use or generate a model?"

 and you could put "yes" or "maybe"
 and then if someone wanted to to back and look
 they'd have a great headstart
Sarah: okay, but maybe mention in the 'future directions' of the discussion that it should be looked at based on anectodal experience with a few papers
Heather: agreed
Sarah: yeah. i can do that in my "data produced" fields

4:24 PM b/c they aren't typically deposited (at least not clearly in the few i've seen)

 i'm dividing those two categories per our group chat yesterday
Heather: we'll do better with more papers abstracted than model or software fields, given our plans.
 so keep that in mind in case it helps you make these tradeoffs
Sarah: ok.
 i'm still interested in the model depository url if you have it by chance, but no biggie

4:25 PM Heather: yeah, let me look. I'll get back to you....

Sarah: ok.
 my last question is...
 typically, anything about sharing/reuse is found in the materials and methods section
Heather: yup

4:26 PM Sarah: can i just extract (i.e. read over) that section, and neglect the rest (and state that clearly in my methods), or should i account for the few times when i'ts also in the intro/results

Heather: sometimes acknowlegements, are you finding that?
Sarah: nope, mostly just funding there
 but i haven't coded my acknowledgement extractions yet
Heather: some journals like PNAS have a separate "data" section (that isn't what they call it... availability? something like that)
 no, I think given the nature of this study, it is worth reading the whole paper

4:27 PM "reading" = skimming or whatever

Sarah: some have a supplementary data section (interally deposited data), but i haven't encountered a data section yet
Heather: then your paper can be the background info that everyone else uses when they get to just read the methods section :)
Sarah: i've had a few weird ones today that cite genbank in crazy ways and different thorughout the entire paper, only to suddenly use an accession number in the results

4:28 PM very strange and highlights the lack of best citation practices

Heather: yikes!
 I think you are keeping track of the section, right?
Sarah: yeah. pretty bad. i'm making note of them for "whizbang" discussion
 i keep full text of all relevant sections
Heather: so for example if they use a doi, you will know if they cite the data doi in the body of the paper, in the references list, or both?

4:29 PM Sarah: then code them as i go or save for later

Heather: perfect.
Sarah: yep
 i keep all relevant citations too
Heather: I haven't looked yet.. do you have any of your coded examples online yet?
Sarah: i keep the full text of relevant parts so i don't have to dig it out again
Heather: or do you think you can put a few up before our talk tomorrow?
Sarah: nope

4:30 PM yeah. mostly, my spreadsheet is in disarray right now because i keep adding fields and doing long annotations to myself to fix it, rather than just fixing it

 i'm hoping to resolve that today
Heather: I hear you. Even if it isn't resolved, if you can put it up I think it woudl be helpful

4:31 PM concrete examples always help

Sarah: yeah. i agree
Heather: and we have all created our share of messy spreadsheets, so no fear there
Sarah: sorry i've been slow on that
Heather: heh, that's life.
Sarah: which reminds me of one complication you might have an opinion on

4:32 PM i'm running into a few papers that cite multiple datasets (i.e. a gene and then a GIS dataset)

Heather: yeah. interesting.
Sarah: i've been putting it all in one field which is messy think i'll parse it out later
 and i don't think it would be that bad, but should i have a set of fields for "dataset1" and then the same set for "dataset 2" etc

4:33 PM i.e. "Depository of dataset 1" and then "depository of dataset 2"

 that then treats them almost as separate records
Heather: oh I see.
Sarah: at least from a database perspective
 rather than article being the smallest nested entity, then dataset it
Heather: here's how I would do it, I think.
Sarah: *is not "it"

4:34 PM Heather: I'm guessing the same dataset types and repositories keep coming up?

 Or that 5 types account for 80% of datatypes or something?
 if that is true, then I'd have columns for each of those five types

4:35 PM Sarah: so like "gene cited" y/n, then "GIS cited" y/n

Heather: so one column for "has DNA sequence data?" and another for "deposited in Genbank?"
 etc for each of the 5 types
Sarah: followed by "gene cited with accession"y/n
Heather: then a misc
Sarah: ok
Heather: yes
Sarah: that's a bit painful in the extraction process, but would be easy enough in coding
Heather: that format will make it easiest to analyze later
Sarah: i've been doing "dropdown menu" type

4:36 PM i.e. "type of dataset cited"

Heather: gotcha.
Sarah: then my standard categories, "gene, GIS, phylo, etc"
 and I've been sticking pretty well to those categories, but then i get GIS and gene
 or what have you
Heather: yeah, so if that method is easiest for you in capture it is easy enough to translate it automatically into columns

4:37 PM right, but it doesn't work well if there are multiple

 so your call whether you want to switch your data input method to be more generalizable
 or how you want to handle it
Sarah: but, this will complicate things even more....with these nutty genbank papers today,
 they cite genbank improperly in mat and meth
 then give an accession # for one of their genes, but not another
 and only in the results section

4:38 PM Heather: yup. arg.

Sarah: so they are citing different genes
Heather: heh!
Sarah: but citing them in totally different ways
 and using the cited genes for totally different purposes
Heather: so an idea then could be to just flag it as a nutty paper
 in some ways, since you and valerie are both abstracting similar things
Sarah: then, do i give it credit for having used an accession number
 even if it didn't mention it in the best place?
 or cite in the biblio

4:39 PM Heather: and you are unlikely to abstract that one the same way, and there aren't tooo many like that... there might be something to flagging it for revisiting later, after you have more experience and ideas about how to handle it

Sarah: i.e. i have a "how cited" field which is either self, correspondence, url, depostitory, or accession
 with accession being the "best"
Heather: yup, gotcha.

4:40 PM Sarah: ok. yeah, i think someone else looking at the paper would do it differently

Heather: so I think maybe if there are lots of paper like this
Sarah: least i encountered a bunch today
Heather: it might tell us that those need to be multiple columns too, so that more than one of them could be true at the same time....
Sarah: yeah...i'll talk with valerie and see what she's running into

4:41 PM okay, i just remembered one more (sorry!)

Heather: fyi this sort of stuff is great. It is exactly the sort of hassles you should be running into at this point. Just to make sure you know that.

4:42 PM Sarah: oh good, "methods" count as datasets? let me might take a minute. some authors include additional methods as part of their supplementary

 material that they post with the journal
 this includes explicit protocols
Heather: interesting
Sarah: i've read in some of the datasharing articles that good metadata about methods/processing is needed just as much as the dataset itself
Heather: I'd say no, at this point.
Sarah: so do i count that additional methods info as dataset?

4:43 PM Heather: Agreed, they could be considered data.

 And it could definitely be interesting to look at from that perspective.
Sarah: yeah, but maybe another "future directions" thing to shuffle into the discussion of a manuscript
Heather: But I think it is far enough removed that we don't have the right headspace to think about all of the different variables that we might want to capture to study that well....

4:44 PM so, yes, I would not bite off that part for now.

 if it were easy to annotate "has more methods in supplementary information" then that could be helpful for future researchers

4:45 PM Sarah: yeah, especially when that's all the author makes available of their data, i note that

Heather: but I wouldn't do more than that.
Sarah: ok. well, i think that's all i had

4:47 PM Heather: and messy spreadsheets are fine :) Get them up so we can have a good stare at them together :)

Sarah: yep. that's my goal for the end of today
Heather: great! ok, Sarah, thanks for finding me. Talk to you tomorrow!

4:48 PM Sarah: great. talk to you then