DataONE:meeting notes:14 June 2010 email

From OpenWetWare
Jump to navigationJump to search

Chat with, Nicholas, 1 message ________________________________________ Sarah Walker Judson <> Mon, Jun 14, 2010 at 11:20 AM To: Cc:,,,,,

In the chat room:, Nicholas Weber (

8:56 AM hpiwowar: You've been invited to this chat room! has joined

8:57 AM hpiwowar: Hi guys. Just checking to make sure I know how to do a group chat...

valerie.janae.enriquez: hello
Sarah Judson: good morning. looks like we're all here. are we doing voice or text?
hpiwowar: preferences?
valerie.janae.enriquez: I sort of prefer text because it keeps a more accurate record.
Sarah Judson: text for record, chat for ease
valerie.janae.enriquez: but if people prefer voice, I could
Sarah Judson: i'm good either way

8:58 AM hpiwowar: Nic, you there yet? I know we are a few minutes early....

Nicholas: yup, just saw the chat box

8:59 AM hpiwowar: would you prefer text or voice chat? some votes for text, but worth making sure....

Nicholas: I think text is easiest yet?

9:00 AM hpiwowar: sounds good, let's do chat.

 text chat rather
valerie.janae.enriquez: ok, sounds good
hpiwowar: if at some point in the next 1.5 months anyone would prefer voice, then just say, otherwise we'll default to text chat
 agenda items:

9:01 AM how is the open wet ware experience going? open notebook science in general? things we can learn from each other? problems? etc 9:02 AM 2. how are the projects going? everyone know what everyone else is aiming for? everyone know what they are going for?

 3. interaction and overlap... what data collection are we doing, how can we do it efficiently together
 4. mini goals for end of june. well defined? well documented?
 That's what I have. Other things I haven't included?

9:04 AM Nicholas: 5. Blog, we talked about it, but I wasn't sure if we were going to start

hpiwowar: Yes! great
Sarah Judson: sounds like everything to me
valerie.janae.enriquez: yes

9:05 AM hpiwowar: ok.

 Where to start? OWW stuff? In general, how is that going for people?
Nicholas: I have a question about wetware... I think that I set up a notebook for my project under my username, but we should set one up under the lab DataONE right?
Sarah Judson: ditto
 for me

9:06 AM hpiwowar: Good question, Nic, and I don't really know.

 Pros and cons either way.
 Anyone else have an opinion about it?
Sarah Judson: I've made sure mine is linked to the main page and vice versa
 i keep all my personal/raw notes there (well, I'm working on getting them all up)
valerie.janae.enriquez: I accidentally have one in my name and the other in DataONE. I think as long as there are links tying everything together, it should be ok, right?
hpiwowar: Yeah, I think as long as it is linked it doesn't really matter.

9:07 AM Nicholas: ok

hpiwowar: Advantage to your own name is it makes it clear to you and everyone that it is your work, you get to keep working on it, publishing on it, etc after the summer is over
 From what I understand, if you change your mind about where you want to host it you can always move it later.

9:08 AM So just pick a place and run with it

Nicholas: yes, it seems like moving pages is easy
valerie.janae.enriquez: that's good
hpiwowar: Valerie, how has it been going to take daily notes via the Notebook "Calendar"?

9:09 AM valerie.janae.enriquez: I've been liking it

hpiwowar: It looks like a great way to organize things. working for you?
valerie.janae.enriquez: it has a standard format and people can easily see my progress
 (or lack thereof)

9:10 AM Sarah Judson: i'm not familiar with that, how does it worK? i'm grouping my notes by category currently

hpiwowar: LOL. yeah, well, don't worry too much about that. Progress is often in spurts, we all know that, no problem.
valerie.janae.enriquez: if you click a date on the calendar, you can make an entry for that date
Sarah Judson: ok. does it work for reminders? i.e. setting a projected schedule
valerie.janae.enriquez: I'm not sure. I don't know if there's a way for it to send you automatic reminders.

9:11 AM hpiwowar: Not that I know of

valerie.janae.enriquez: I've just been using it as a log.
Sarah Judson: ok.
hpiwowar: Though you could probably click on a date in the future and add a placeholder on it, "the date I plan to be finished with X"
 if that helps you
Sarah Judson: ok. just wondering.

9:12 AM my main question is, did we decide a standard way for commenting on pages that are not our own? should be primarily use Talk and Timestamps?

valerie.janae.enriquez: the dates highlight if there's an entry, so you can kind of see if there's something coming up
Sarah Judson: ok thanks.
hpiwowar: Or a combo of Talk and main-pages with timestamps?
Nicholas: I was just trying to find a "tasks" function, but it seems like what Heather mentioned (adding tasks to future dates) is the only way

9:13 AM hpiwowar: what would you all prefer?

Sarah Judson: i like the timestamp, but am not sure about the separate talk page
valerie.janae.enriquez: I like the timestamp and the talk page (since it keeps discussion separate from the main page)
Sarah Judson: it's nice that it doesn't mess with the original text, but sometimes cumbersome to tell what section someone is commenting on
Nicholas: I like the timestamp as well

9:14 AM hpiwowar: agreed sarah

valerie.janae.enriquez: I see your point there.
Sarah Judson: maybe we could copy the original page onto the talk page, or a specified amount of the section you are commenting a short direct quote
hpiwowar: yup, could do that.

9:15 AM valerie.janae.enriquez: there is a quote function, so that makes sense

Sarah Judson: i dunno, that might also make it cumbersome for the commenter....i think i can follow heather's comments on the research questions well enough.
hpiwowar: one down side about the talk page is that it isn't immediately apparent. people have to know about OWW (or wikis) to look for it.
Sarah Judson: how does the quote function work?

9:16 AM valerie.janae.enriquez: I know it accidentally turns up if i copy/paste conversations with indentations.

hpiwowar: LOL
Sarah Judson: do we attract "outsiders" to the page or do we just assume everyone that's on our team knows how to use a wiki/talk page.
Nicholas: I made the same discovery as Valerie
 I think it requires a double indentation

9:17 AM Sarah Judson: ok..i'll play with it

hpiwowar: hmmm. well, I think for most pages we don't necessarily want to go out of our way to attract attention to them... they will be pretty standard reserach-in-progress
 that said there will be pages and/or individual questions that we do want answers to or attention
 so one experiment that seemed to work pretty well was that box on the main page, "quesitons for the community" or whatever

9:18 AM makes me think we could have a few centralized places for "questions" and "active results"

Sarah Judson: maybe we could link those to the talk page with a "comment here" hyperlink
hpiwowar: and then you guys could keep them live as you generate content?
valerie.janae.enriquez: that sounds good

9:19 AM hpiwowar: ok, so we are thinking we'll go with Talk pages as the main place for commenting, for now?

valerie.janae.enriquez: with quotes from the main page for reference?
hpiwowar: yes
valerie.janae.enriquez: ok, that sounds good
Sarah Judson: yeah. i think it will keep clutter to a minimum and also allow crossreferencing
Nicholas: sounds good
hpiwowar: and also updates from the main person the comments are addressed to, so that the talker knows the comments have been read?

9:20 AM valerie.janae.enriquez: sure 9:21 AM Sarah Judson: yeah..all with timestamps for frame of reference

hpiwowar: well, I don't mean immediately after you've read them necessarily, but some feedback saying "hmmm I don't understand" or "yes, I'll do that eventually" or something would help with the communication loop I think
Sarah Judson: yeah. sounds good
valerie.janae.enriquez: ok
hpiwowar: Nic, works for you?

9:22 AM Nicholas: yes

 yes, sorry my connection dropped for a moment

9:23 AM hpiwowar: (btw I don't know if you guys noticed, but when you create a post, right below the edit boxes there is the funky *'Sarah Judson 14:29, 14 June 2010 (EDT) link

 or whatever it is to do the name and timestamp. if you just click on that link, poof, the characters are all inserted for you. handy)
valerie.janae.enriquez: that has been superhelpful

9:24 AM Nicholas: ahhh

hpiwowar: so as you guys experiement with dated entries and we do Talk pages and we dig into it more, do feel free to aggregate the links... be it in "Look at this! This is cool!" boxes at the top of a main page,
Nicholas: just realized that

9:25 AM hpiwowar: or a summary section on your notebook that says "Spent many days collecting data. See pages June 7-June 28 for details" or whatever.

 We'll figure it out... but I think lots of links will probably help.

9:26 AM Sarah Judson: agreed

valerie.janae.enriquez: ok
Sarah Judson: maybe when you take care of a comment, link to the data/link where you resolved it
hpiwowar: ok. In general you are still on board with doing research out in the open?

9:27 AM valerie.janae.enriquez: yes

Sarah Judson: yes. i like it even just for myself to keep track of things
 otherwise i lose my notes in emails or scraps of paper
Nicholas: yes
hpiwowar: I know I've been sending emails, which isn't exactly in the spirit, so I'll try to move more of my commenting to the wiki :)
valerie.janae.enriquez: it does get confusing with all of the back and forth emails
 I'm never sure if I've gotten back to everyone.
 so the talk pages are nice to have as a visual reference

9:28 AM hpiwowar: agreed. good. Well, feel free to redirect any conversation you think could better happen another way, ok?

 I'm very up for that.... I'm learning right along with you guys.
Sarah Judson: are the other mentors best to get a hold of through email or will they be using the OWW?

9:29 AM also, what's the best way to "watch" seems kind of difficult to follow them all even in the watch updates and RSS

hpiwowar: good question. I know most of our direct participants have OWW accounts by now, but I don't know how actively they are watching.
valerie.janae.enriquez: if they have their profile pages up, would we contact them through the talk section of their profile?
hpiwowar: Maybe we could have one central place where we keep questions and update links active for them? Or ?

9:30 AM yes, Valerie, we could.... I'm not sure if they are looking though. We could confirm on the Tuesday meeting, how's that?

valerie.janae.enriquez: ok, sounds good

9:31 AM hpiwowar: One thing that will help with watches is to figure out the main Notebook pages that you will all be updating, so that we know what to make sure we have watches set for.

 Maybe update this page
 everybody with the "Active project pages" links?
Sarah Judson: i think i have all my watches set, it's just cumbersome to look at all the updates in OWW...the RSS feed is nicer but only covers the main page

9:32 AM hpiwowar: yikes, is that true. I hadn't figured that out yet.


9:33 AM wait, I think there is another way, hold on....

 you can set watches by category
 so if you put on all the new pages you create, I think at least all of the

9:34 AM changes would be caught by those watches

 (the fact that they are cumbersome to read is a different issue though....)
valerie.janae.enriquez: ok
hpiwowar: or we could make a category like DataONE_citationsummer2010 or something
Nicholas: put on the entry page? page?

9:35 AM hpiwowar: yes Nic I think it would have to go on every page you create.

valerie.janae.enriquez: it seems to be automatically generated on some pages
Nicholas: I think this goes back to how you set up a notebook

9:36 AM hpiwowar: oh, ok, good. I'm guessing that might be because you created your network under the DataONE lab, maybe?

Sarah Judson: is that included when we add the dataone header to our pages?
hpiwowar: yes
valerie.janae.enriquez: ah
Sarah Judson: ok. that'll make it easier. and then it will go to the main RSS feed?

9:37 AM it's easier to read through the changes in google reader than OWW

hpiwowar: I think yes, it will go to this RSS feed:
Nicholas: so if you go to your main entry page, and inset the it will automatically generate for each new entry you start

9:38 AM hpiwowar: Nic, is that a question, or do you know that is true?

Nicholas: I just did it, it's true

9:39 AM hpiwowar: oh good! thanks for the info.

Nicholas: select "customize your entry page"
hpiwowar: by the main entry page do you mean the main "Notebook" page?
 ah hah! thanks

9:40 AM ok. that works then?

valerie.janae.enriquez: neat
hpiwowar: btw is everyone ok with RSS feed of changes, or do those who prefer email know how to set that up?
Sarah Judson: i like the rss

9:41 AM valerie.janae.enriquez: I could do the rss

hpiwowar: Valerie and Nic, you are good with monitoring changes?

9:42 AM Nicholas: can you rss certain notebook or a lab?

hpiwowar: I think so. See info here:
Nicholas: ah, thanks... I set one up initially, but I was getting a lot more than just DataONE updates

9:43 AM hpiwowar: ok. well, let us know if you can't figure out something htat works well.

 ok, anything else on OWW or similar before we move on?

9:44 AM Sarah Judson: i'm good. i'll make sure the category with the main RSS catches my changes today

hpiwowar: We'll confirm tomorrow what is the best way to communicate with other mentors.
valerie.janae.enriquez: ok, cool

9:45 AM hpiwowar: For now, if there is something that deserves highlighting, maybe add a link (or a new box, or whatever) here...

 ok. want to talk blog now, or blog after we talk about projects?

9:46 AM valerie.janae.enriquez: maybe projects first?

Sarah Judson: blog now since it's more of a bookkeeping thing
valerie.janae.enriquez: ok
Sarah Judson: i think we just need to confirm the schedule nic sent out
 Monday - - Nic 

Tuesday - - Sarah Wednesday - - Valerie Thursday - - Guest Post from a Mentor Friday - - Some special topic

 Does that work?

9:47 AM valerie.janae.enriquez: I'm fine with Wednesday

hpiwowar: yup
valerie.janae.enriquez: how would the special topic friday work?
Nicholas: I think this was Sarah's suggestion, to use our first entry as an introduction to our individual projects
valerie.janae.enriquez: that makes sense
Sarah Judson: yeah, and lots of links to dataone and the oww
 so people can find both. we should each but a link to the blog on our notebooks

9:48 AM so, for fridays....nic, were you picturing something in particular?

 maybe a q/a among us and the mentors/community?
 or a outside guest post?
Nicholas: Fridays? I was just more thinking of some topic we each posted on, or something we were interested in thinking about over the next week
 nothing in particular

9:49 AM guest posts would be nice

hpiwowar: could be a brief summary of a particularly relevant article, sometimes.

9:50 AM guest posts might be hard to get once a week... worthy goal, but also have something in your back pocket that doesn't take too much work and/or is in your control

Sarah Judson: ok. who will manage the thurs/friday posts?
Nicholas: I can do this

9:51 AM Sarah Judson: either recruiting mentors, posting questions..mostly making sure it happerns 9:52 AM Nicholas: I'll try to set up a schedule of mentor's and send out some sort of email reminder about days to post

Sarah Judson: great. i can be your backup or your friday helper for deciding what to post then
Nicholas: also, I'll add mentor info to the blog...if there are other things we should link to on our sidebar or elsewhere let me know

9:53 AM Sarah Judson: ok, are you still doing the twitter feeds?

Nicholas: also, we should probably fill in our profiles
valerie.janae.enriquez: any relevant blogs we might find could be on a side blogroll?
Sarah Judson: yeah, i have a few from heather about open science, and then we can work up digging up some others
hpiwowar: sounds good. one more plug for brainstorming a format that is meaningful but also easy. For example, Fridays could be where you ask a question of the community. People (us + mentors + the whole world) could give their take via blog comments / twitter /friendfeed.....

9:54 AM Nicholas: (twitter) yes, can everyone respond with their twitter handle so I can add those feeds to the blog

valerie.janae.enriquez: ok
 that sounds great
hpiwowar: @researchremix
Sarah Judson: i don't have twitter, but thought you had said something about the dataone/datacite feeds
 agreed about friday
hpiwowar: I think don't spend too much time getting a blog roll

9:55 AM really hard to be comprehensive, and others have put time and energy into this....

Nicholas: is there a dataone twitter account?
hpiwowar: I think so
valerie.janae.enriquez: the feed would work for a blogroll (since sometimes blogs don't update regularly and it's best to show things that are recent)

9:56 AM hpiwowar: one idea is to focus on just a few things at once. this week, research kickstart + blog posts etc

valerie.janae.enriquez: great
hpiwowar: next week could focus on finding related twitter accounts, publicising, etc
 do the essential things first

9:57 AM Nicholas: yes

hpiwowar: ok, so Nic you are going to write the first post today, is that right?
Nicholas: yes, I'll post something this afternoon

9:58 AM valerie.janae.enriquez: so we're just doing an intro entry about what we're each doing for this week?

hpiwowar: great! email around the link when it is up.

9:59 AM Nicholas: I was just going to introduce my project in general and say something about where I am beginning

Sarah Judson: sounds good

10:00 AM valerie.janae.enriquez: ok

hpiwowar: I'm happy to take this Thursday.
Nicholas: great
hpiwowar: ok. any more blog things for now?
valerie.janae.enriquez: neat, thanks!

10:01 AM Sarah Judson: nope, i think that's it

hpiwowar: This is turning into a long (though useful) meeting. Everyone available for the next hour???

10:02 AM Sarah Judson: eyp

valerie.janae.enriquez: sure
Nicholas: I need to duck out by 1pm central (11 your time)...
hpiwowar: ok. 1 more hour it is, then.

10:03 AM projects and overlap and june.

 Sarah, I was wondering if you've had a chance to put your spreadsheet up on google docs or anything yet?
Sarah Judson: i'm planning on doing it today

10:04 AM i've worked out most of my fields now and am back to the glorious land of consistent internet

hpiwowar: I hear you :)]
 ok. Can you easily upload your fields in any way while we talk?
 either posting a list here or ???

10:05 AM Sarah Judson: yeah, i'll hop on docs real quick and post a link once it's up

hpiwowar: perfect. thanks.
Sarah Judson: the spreadsheet also has my questions about overlap/integration
hpiwowar: great
 ok, where shall we start. do you guys feel like you understand each other's projects?
 (do you feel ike you understand your own projects?)

10:06 AM valerie.janae.enriquez: I feel like I understand my project better, but haven't gotten around to look at everyone else's work.\

Sarah Judson: i think maybe, given the emails i received last week, that we need to clarify the big picture a bit.

10:07 AM valerie.janae.enriquez: yes. what are we ultlimately aiming for with our individual projects and combining them?

Sarah Judson: i envisioned it as me with an article approach, valerie with a similar approach but coming from the depository searches, and nic over journal metadata
hpiwowar: ok, then let's start this way
valerie.janae.enriquez: searching for article citations within the actual depositories?
Sarah Judson: i think i'm different from valerie in that i look at the articles regardless of whether they cite data
hpiwowar: if each of you could give a quick 2-3 (?) sentence summary of the way you understand your projects

10:08 AM Sarah Judson: ok.

hpiwowar: I can give a summary of how I see them overlapping
valerie.janae.enriquez: ok

10:09 AM hpiwowar: have at it :)

valerie.janae.enriquez: I am searching through databases and journals for articles that specifically cite data hosted in an online repository, focusing on TreeBASE to start. I have been logging my search terms as well as my results in my lab notebook. There will be a spreadsheet and report following.

10:10 AM Nicholas: I am collecting policy statements on sharing and reuse of data from journals funding sources and repositories,,, along the way I'm also collecting metadata about those entities

Sarah Judson: I am taking a sampling of articles and recording whether or not they cite a data set. this yields % statistics about what types of data/scientists/journals are citing datasets. Then, if a journal does cite data, I note how they cite it (paper, url, dryad, treebase, etc). I see the other projects overlapping with mine by nic collecting journal metadata that can help analyze article trends and valerie collecting data about how the depositories work/what articles are in the depositories and if they are being cited...we may collect similar info about articles in this or valerie may hand off articles to me to analyze.

10:11 AM hpiwowar: great!

 so here are overlaps as I see them:

10:12 AM Sarah and Valerie are looking at similar questions (how is data cited?) but using different datapoints and search strategies.

Sarah Judson: also, different entry strategies
hpiwowar: Sarah is using all articles in given journal issues as her dataset-for-consideration (yes, entry)
Sarah Judson: valerie from depositories and me from article samples

10:13 AM hpiwowar: whereas Valerie is searching across many journal issues for citation of a given repository.

 yes, exactly sarah
valerie.janae.enriquez: ok
hpiwowar: This will give different pictures of the same topic

10:14 AM valerie.janae.enriquez: I knew there was a reason I wanted to export the citations of my search results to connotea/citeulike, I wanted to send those articles to Sarah.

hpiwowar: And it does require quite drastically different annotation/search strategies.
Sarah Judson: yes
hpiwowar: But the data citation data will be similar, so worth coordinating annotation to make sure the same things are being extracted the same way
 so that the data can be integrated as it makes sense

10:15 AM Sarah Judson: mine is a sample (random or time) and valerie's is based on what datasets are published on dryad/etc or cite them in scirius/etc

hpiwowar: yes.
Sarah Judson: agreed. i'm getting my fields posted right now
 and am developing quicker annotation techniques
 i.e. full text extraction and text searching of particular sections
hpiwowar: and sarah yours will be comprehensive for a given journal and time period, whereas valerie's is less likely to be comprehensive

10:16 AM because it is really hard to know how to find all data citations (which is exactly why we are doing this study)

Sarah Judson: will post methods soon and coordinate with valerie. or possible she forwards articles to me and I extract since my artcile retrival is easier.
 that's something we should decide
valerie.janae.enriquez: ok
hpiwowar: yes. let's hold off on deciding who annotates what for a bit longer here....
 because related to nic's too.....
Sarah Judson: yeah sure

10:17 AM valerie.janae.enriquez: ok

hpiwowar: one more link to make is that
 valerie is coming up with search strategies to find data citations
 they won't possibly find all data citations

10:18 AM but sarah's results will help us estimate what % of citations have fallen through the cracks

Sarah Judson: and i'm full text indexing mine and associating them with a tag of "cited dataset" or not, which can allow for test searching and accuracy of search terms
hpiwowar: for example, if sarah's approach finds that 45% of all data citations to tReebase use dois, and Valerie does a search that would find all citations using DOIs

10:19 AM Sarah Judson: yeah. see my above comment that came through at the same time.

hpiwowar: then we know that she's finding at least 45% of the citations (assuming similar articles, etc...)
valerie.janae.enriquez: ok
hpiwowar: yup, exactly.
 so Valerie, Sarah's results will be interesting at the end, to figure out the coverage
 and also in the beginning and middle, to help inform what search terms are likely to be useful
 so keep an eye on her spreadsheet :)

10:20 AM valerie.janae.enriquez: I've been keeping a detailed log on my notebook page.

Sarah Judson: for instance, i only have one article so far that used dryad and a hdl retreival number, but many cite datasets or treebase/etc in other ways
 yeah, i saw your notes.
 i have a few suggestions that I'll post on that page, though you may have attempted them already
valerie.janae.enriquez: ok, cool. thanks!

10:21 AM hpiwowar: great.

 ok, so now to how does Nic's stuff fit into this?
 Nic is looking at policies
 and obviously the policies inform (though perhaps not as directly as the policy-makers would like) the author's behaviour... and Sarah and Valerie are tracking author behaviour

10:22 AM so we'd really like to correlate attributes of the policies that Nic is digging up with how the authors behave

 within a given journal, funded by their funder, using/sharing data from a given repository

10:23 AM Nic's stuff is a bit different though

 because there aren't many policies on data reuse out there
 mostly data sharing
 and Valerie and Sarah your goals are data reuse annotations
Sarah Judson: i'm also looking at data sharing
hpiwowar: so.... it probably makes sense to couple some data sharing annotations in to the data collection steps

10:24 AM Sarah Judson: i.e. if an article says they posted their data internally (with the journal) or on treebase etc

 so, i'm doing reuse and sharing
 b/c it's easy enough to extract both while in an article
hpiwowar: great, sarah. exactly. so for what it is worth, I didn't envision that as your main endpoint :)
 but I agree it is easy to do if you are already there
 and valuable in and of itself
 and coupled with Nic's stuff

10:25 AM Sarah Judson: i bet it will also come up with valerie's search terms

hpiwowar: so I'm glad you are doing it :)
Sarah Judson: some papers that cite data in the same sentence will say they posted theirs
 anyways, nic....
Nicholas: ... a common thread I'm finding is that journals require an author to submit data for peer review

10:26 AM but not for actual publication

hpiwowar: interesting
Nicholas: its a bit of a grey area
hpiwowar: definitely make notes on it
Nicholas: to see where the data set will be included in the publication or whether its just used by the reviewer and then discarded
hpiwowar: could be that correlates with post-publication too, since authors will have already made the effort to find data, document it, etc

10:27 AM yes.

Nicholas: yes
hpiwowar: now there are a few subtlies in the type of data that Nic will need to do correlations with data sharing
Nicholas: Im wondering, Sarah, are you noting the place in the article that you are finding the citation?
 (abstract, methods etc)

10:28 AM Sarah Judson: yep.

hpiwowar: to see if data sharing % increases with different policies, need to know both a) how many shared data, and also b) how many created data and didn't share it
Sarah Judson: just added that field
Nicholas: great
Sarah Judson: but most is methods
valerie.janae.enriquez: should I add that field as well?
Sarah Judson: yeah. see my fields posted at:

10:29 AM they are a little overkill (editor, etc I might not end up collecting), but I'm whittling it down to the most relevant

hpiwowar: Sarah's approach could capture "b) how many created data and didn't share it" whereas Valerie's search approach won't capture that
Sarah Judson: i think nic's data will help answer "why" questions
valerie.janae.enriquez: ah
hpiwowar: sarah, thanks for the SS. great.
Sarah Judson: for instance, if sysbio is citing more datasets, what is their journal policy?
 or is the nature of the science mandate datasharing?

10:30 AM Nicholas: exactly sarah

Sarah Judson: i'm hoping we can tease that out
hpiwowar: so that is totally fine, it just means that Sarah's comprehensive approach will be the only one that we'll use to correlate with that part of nic's data
Sarah Judson: say, what if sysbio and amnat lead the way on datasharing?
 but, they have totally different scientist clientle?
 maybe then, it's the journal policy that is driving their datasharing
hpiwowar: yup, exactly
Sarah Judson: but we can't tease that apart with just article level data

10:31 AM i.e. extracted info

valerie.janae.enriquez: hm... true
hpiwowar: is this making sense to everybody? or questions?

10:32 AM I'm sure it isn't all making sense... so ask questinos.... 10:33 AM sarah, looks like some of the columns you were capturing are journal metadata type columns, right?

 so those could be captured as part of nic's work?
Sarah Judson: yep, but notice the comments below about nic/valerie capturing that info instead
 that's to remind me about database linkages
hpiwowar: sorry, didn't get that far....
Sarah Judson: for analysis purposes

10:34 AM hpiwowar: cool.....

 ok. question for you, sarah.
Sarah Judson: it's colored in my version, not in docs yets
hpiwowar: are you comfortable capturing data on "the investigators created a dataset that could have been shared in TreeBANK but they didn't"?
Sarah Judson: you mean a notes section?
hpiwowar: this can be a bit hard to tease out, and a bio background definitely helps

10:35 AM Sarah Judson: categorizing the data myself?

Nicholas: Valerie, is your starting point TreeBASE? So you begin with the data and find out how it is linked to in an associated publication?
Sarah Judson: i'm doing some of that...
valerie.janae.enriquez: I tried doing that but wasn't sure where to get started
 instead, I looked for any article mentioning data in TreeBASE
 and am in the process of narrowing down the search terms
 to exclude terms like "deposited"
hpiwowar: ok, hold on my question while we dig into Nic and Valerie's conversation....
valerie.janae.enriquez: or "submitted"

10:36 AM Nicholas: ok

valerie.janae.enriquez: then I searched for articles mentioning "study or matrix accession numbers"

10:37 AM hpiwowar: Nic, the thought is that following a citation trail for each dataset individually is very labour intensive. So trying to capture lots at a time with generalized searches instead.

 Does that make sense, or do you have suggestions?

10:38 AM Nicholas: no that makes sense I was just curious,,,, a lot of the policies also label data or datasets as "supplementary material" ... I need to look more closely at the structure of some of those articles to see whether I could suggest any search terms

hpiwowar: sounds good
valerie.janae.enriquez: what's interesting is some articles only cite the main website "" without mentioning the accession numbers

10:39 AM and you are right, other articles have supplementary tables of the datasets they used

hpiwowar: yes. arg, eh? good luck ever getting a computer to be able to do any massive automated mining :)
valerie.janae.enriquez: lol, yes.
 things that are machine readable aren't always human friendly and vice versa

10:40 AM hpiwowar: ok, back to "dataset sharing" conversation?

Sarah Judson: I read an article that dealt with this problem...treebase was cited, but the author explicitly stated that they could only find 50% of the datasets they wanted to used
hpiwowar: yikes. do you have a ref for that, sarah? very relevant!
Sarah Judson: valerie, here's the article:

10:41 AM very interesting, even if more anecdotal. i can send the pdf

hpiwowar: super, thanks for that, sarah!
valerie.janae.enriquez: neat, thanks
hpiwowar: yeah, let's keep track of this stuff. can make for some great pointers in discussion sections :)

10:42 AM maybe Sarah could you add a link to it from our main summer page, in an "interesting findings" box or something??? (just an idea)

Sarah Judson: here's a direct quote from the article: To gauge the frequency of the long-tree problem

in the published phylogenetic literature, I conducted an informal survey of Internet-accessible files using Google Scholar ( and a keyword set consisting of “partition,” “MrBayes,” “mitochondrial,” “codon,” “phylogeography,” and “TreeBASE.” The latter 2 terms were intended to bias the sample toward studies with larger numbers of taxa and easily accessed data sets. I examined the first 24 studies that fit these criteria; many proved unusable here because the data sets were not accessible, the character sets were not clearly specified in the TreeBASE files (, or the partitioning details (e.g., parameter linking) were not entirely specified."

 will do
hpiwowar: perfecto
 ok. the data sharing denominator, as I often call it....

10:43 AM because it is the base of any data sharing frequency calculation

 needs an estimate of the number of similar datasets that could have been shared but weren't.
 I've come up with ways to establish this in other domains, but not in this domain

10:44 AM and I don't know how hard/easy it is (But I bet it isn't all that easy).

Sarah Judson: i'm game to try
hpiwowar: Sarah, as the biologist in the group and the one who has read the most articles....
 how doable do you think this would be?
Sarah Judson: i've loosely classified most of the articles independent of keywords
hpiwowar: yup.
Sarah Judson: so, all i would need to do is assess what depository is most relevant

10:45 AM hpiwowar: might help to target a few specific types of data

Sarah Judson: treebase =phylo. genbank = genetics, etc
hpiwowar: yup. so Treebase type data?
 genbank sequence data
 others on the short list?
Sarah Judson: dryad is broad
 the journals have internal "supplementary data"
 at least am nat and sys bio
hpiwowar: yes, and new.
Sarah Judson: daac = gis

10:46 AM hpiwowar: good.

Sarah Judson: right? i haven't looked at the data there yet
 anyways, i was hoping maybe valerie could collect some metadata on the depostories, but i might be better suited
 i dunno

10:47 AM hpiwowar: I think Nic is doing to give it a shot

Nicholas: wait so you're trying to quantify the number of related datasets (present in a repository) that were not cited in a publication?
hpiwowar: Nope
Nicholas: ha
hpiwowar: The number of datasets published about that were not deposited in a repository but could have been

10:48 AM valerie.janae.enriquez: which is a lot, I imagine

Sarah Judson: meaning they are published internally or with a url, but not on the depostories?
hpiwowar: So ideally Sarah would add a few columns to her spreadsheet along the lines of:
 "Study produced some data?" "Study produced DNA sequence data? (could go in genbank"
Sarah Judson: indeed
 kind of already doing this
hpiwowar: "STudy produced phylogrees?" etc
Sarah Judson: phylogenies...yep
Nicholas: gotcha

10:49 AM hpiwowar: great. if you can make it formal that woudl be super

Sarah Judson: currently, all the data i extracted for cited datasets i also extract for the dataset produced by the paper
 i.e. dataset mention, how cited, where published, where cited in the paper, etc
hpiwowar: let us know if it slows you down excessively to try to quantify/categorize the un-deposited/unshared datasets

10:50 AM Sarah Judson: ok.

 but, do we assume every paper has a dataset?
hpiwowar: because it might... and then we would have to figure out a different solution....
Sarah Judson: or only look into it if they mention there dataset?
Nicholas: we can also determine the number of articles published in an ISI category for instance of the ecology publications 1900 cited a dataset... but 2700 produced data...
hpiwowar: nope, might be papers that didn't create data (in some ways)

10:51 AM like metaanalyses

Sarah Judson: that's only a few of the ones i'm reading, mostly ones that built models rather than collected data
hpiwowar: ah hah!
Sarah Judson: yep, metaanalyses
 or model creation
hpiwowar: ok, then keep track of that
 the models
Sarah Judson: those are the only ones that don't produce data
hpiwowar: because there are places they could be sharing their models ;)
valerie.janae.enriquez: ah
Sarah Judson: but, often they validate or proof their models with real data
 are there model depositories?

10:52 AM hpiwowar: yes, I think so

Sarah Judson: one way i've been trying to get at this is if they cite opensource software
 or say they created an R (statistics) package for running the data
 but i haven't pursued this heavily yet
hpiwowar: interesting

10:53 AM Nicholas: (shoot I am really sorry, but I have to run to a doctor's appointment)... I will check the log of the conversation as soon as I get back

hpiwowar: yeah, that might be correlated....
 ok, bye Nic
Nicholas: sorry guys!
valerie.janae.enriquez: ok, talk to you tomorrow Nic
Nicholas has left
hpiwowar: so Sarah I'd maybe try to come up with the most frequent 5-8 types of data

10:54 AM and include models as one of them.

Sarah Judson: will do
 yeah. i'll re-analyze some of the papers i already finished to think about how to get at this best

10:55 AM hpiwowar: then keep a free text field for strange situations :)

Sarah Judson: you said you did a paper on this in biomed/pubmed...can you send me the doi? i think i have most of your articles, but want to look over that specific one for methods
hpiwowar: before you do that, sync up with Valerie to see if guys can share columns....
Sarah Judson: yeah, all my fields have a full text field and accompanying coded field
hpiwowar: actually I mostly did a thesis on this :)

10:56 AM Sarah Judson: that way i don't have to go back to the paper

hpiwowar: and alas most of the methods aren't very applicable
Sarah Judson: that's what i thought
valerie.janae.enriquez: neat
hpiwowar: because they rely on PubMed sorts of resources
 but I will send you a related paper that did a manual review
Sarah Judson: yeah, anything would be great so i'm not reinventing the wheel

10:57 AM hpiwowar: Here it is
Sarah Judson: thanks
valerie.janae.enriquez: thanks
hpiwowar: Let me know if you need the PDF
 there is also another
 on genbank

10:58 AM That said, the main focus of this DataONE summer is data citation rather than data reuse....

 so although the two are related, and I think we will be doing a bit of delving into data sharing....
valerie.janae.enriquez: ok
hpiwowar: keep your eyes on the reuse ball, mostly :)
 whooops! in the above
Sarah Judson: yeah, i'm focusing on the dataset citation, but it takes me only a few more minutes (with my previous method) to extract sharing info
hpiwowar: I meant to say That said, the main focus of this DataONE summer is data citation rather than data SHARING....

10:59 AM super. yes, that's great then

 hrm, Valerie, did I just confuse you?

11:00 AM You get why we won't be asking you to extract data sharing or data creation info?

valerie.janae.enriquez: uh...
hpiwowar: Because you mostly won't be finding it with your repository-based search terms :)
valerie.janae.enriquez: ok

11:01 AM hpiwowar: or your doi-search terms, or whatever else.

 you may stumble across data-creation/depositing/sharing, but you aren't looking for them systematically, so the instances you find will not be comprehensive, so they are mostly noise for the purposes of your project.

11:02 AM that make sense? if not, feel free to ask more questions now, or in a day or two as it settles in.

valerie.janae.enriquez: all right. I'm sure as I go on, it will make sense, or I will know what questions to ask
hpiwowar: Sarah, to tie one more link for you
 when you asked about whether it made sense for you to annotate Valerie's papers, my guess is mostly no

11:03 AM because valerie's search terms are mostly not perfect and she captures data-sharing papers as well as data reuse papers

 but she doesn't know which until she opens the full text... and if she is in the full text anyway she might as well extract the reuse details
 That's my assumption, anyway.
 It could be that I'm wrong
valerie.janae.enriquez: like the sentence where it came from

11:04 AM hpiwowar: and so feel free to reorganize as you guys dig in...... 11:05 AM valerie.janae.enriquez: ok. I'm sure my searches will change often

hpiwowar: good.
 ok, shall we call it the end of a long chat, then?

11:06 AM I think the only thing we haven't covered much is June deliverables.

 Mostly that is up to you guys
Sarah Judson: ok. i was just meaning that since valerie's search process is time consuming, i could do the standard data extraction on it.
valerie.janae.enriquez: yeah, I'm still a bit unclear about what I will have by then
hpiwowar: Gotcha.
Sarah Judson: or we at least need to coordinate our methods
 okay, sorry, back to June goals

11:07 AM hpiwowar: yes, agreed on coordinate. And it could be that there are benefits to streamlining annotation.... if so, then do it for sure

 Valerie, yes, your project has less clear mini-goals
valerie.janae.enriquez: ok

11:08 AM hpiwowar: one idea could be to have explored

 "variations on words + repository name and doi searches" for TreeBASE and DAAC and Pangaea

11:09 AM or something like that?

 figure out some initial piece that will help you understand feasibility of the project and start to narrow down generalizable methods

11:10 AM valerie.janae.enriquez: ok, that makes sense

hpiwowar: decide you will spend 3 days on each of the repositories or something... see what you can get :)
valerie.janae.enriquez: I think I am nearing the limits of what I can find for TreeBASE

11:11 AM hpiwowar: sarah, could you revamp your section of the main research plan page a bit to make it a bit easier to read?

valerie.janae.enriquez: so moving on to Pangaea and DAAC is a good idea
Sarah Judson: yep. that's on the list for today
hpiwowar: great

11:12 AM Valerie, good to know on TreeBASE

Sarah Judson: i'll have the research questions and my personal page/notebook more navigable
hpiwowar: could be that as you move on to other repositories you will get inspired with new ideas for TreeBASE for later. or not.
valerie.janae.enriquez: this is true
hpiwowar: Also browsing Sarah's findings (now, and on an ongoing basis) will probably inspire some ideas
valerie.janae.enriquez: since I keep coming up with new terms
hpiwowar: yeah, I hear you.
valerie.janae.enriquez: I Just happened to find a lot of overlaps and false hits

11:13 AM neat

hpiwowar: I find sometimes that experiementing with searches in Google Scholar can help me figure out the signal/noise of a given search.
 (I won't tell you how much time I have spent doing that)

11:14 AM Sarah Judson: once I have annotated a few more articles that I have confirmed treebase/daac, etc citations, i can pass them along to you for test searching

 which might help
 i'm keeping an eye for commonly used terms that indicate a dataset
hpiwowar: ok, I'll let you guys figure out the info flow.
valerie.janae.enriquez: ok, excellent
hpiwowar: make use of each other's open notes best you can.

11:15 AM I think I'll just drop off now.....

 keep talking if you want
valerie.janae.enriquez: I've added Sarah's page on my watchlist
hpiwowar: find me if/when you have questions
 and I'll try to be more wiki-based in my comments :)
valerie.janae.enriquez: will do, thanks again!
 should one of us post the conversation somewhere?

11:16 AM hpiwowar: I'm mostly AWOL tomorrow morning due to other meetings, fyi

valerie.janae.enriquez: ok, I'll keep that in mind. we have the big meeting tomorrow at 3 anyway
hpiwowar: yup, and I'll be around for that.
valerie.janae.enriquez: awesome

11:17 AM hpiwowar: yes, either in the correspondance section here
Sarah Judson: i'll take care of posting, i've claimed that as my responsbiility for emails and correspondence per emails earlier last week
hpiwowar: or on the calendar pages?
 great, thanks sara.
valerie.janae.enriquez: neat, thanks
hpiwowar: ok, bye!
valerie.janae.enriquez: talk to you both later has left

11:18 AM Sarah Judson: see ya. valerie, if you look at my spreadsheet and have additional questions, let me know

valerie.janae.enriquez: ok, sure. let me know if there's anything you think should be in my spreadsheet too.
 (and if you need any help with anything)

11:20 AM Sarah Judson: ok. talk to you tomorrow. I'll probably be online most of the day..even if i'm not visible in gchat (which i'm usually not), so can ping me

 good luck.
valerie.janae.enriquez: ok, cool. you too! has left