DataONE:meeting notes:14 June 2010 email

Chat with hpiwowar@gmail.com, Nicholas, valerie.janae.enriquez@gmail.com 1 message ________________________________________ Sarah Walker Judson 	Mon, Jun 14, 2010 at 11:20 AM To: walker.sarah.3@gmail.com Cc: walker.sarah.3@gmail.com, valerie.janae.enriquez@gmail.com, valerie.janae.enriquez@gmail.com, walker.sarah.3@gmail.com, hpiwowar@gmail.com, nicholas.m.weber@gmail.com

In the chat room: hpiwowar@gmail.com, Nicholas Weber (nicholas.m.weber@gmail.com)

8:56 AM hpiwowar: You've been invited to this chat room! valerie.janae.enriquez@gmail.com has joined 8:57 AM hpiwowar: Hi guys. Just checking to make sure I know how to do a group chat... valerie.janae.enriquez: hello Sarah Judson: good morning. looks like we're all here. are we doing voice or text? hpiwowar: preferences? valerie.janae.enriquez: I sort of prefer text because it keeps a more accurate record. Sarah Judson: text for record, chat for ease valerie.janae.enriquez: but if people prefer voice, I could Sarah Judson: i'm good either way 8:58 AM hpiwowar: Nic, you there yet? I know we are a few minutes early.... Nicholas: yup, just saw the chat box hi 8:59 AM hpiwowar: would you prefer text or voice chat? some votes for text, but worth making sure.... Nicholas: I think text is easiest yet? yes? 9:00 AM hpiwowar: sounds good, let's do chat. text chat rather valerie.janae.enriquez: ok, sounds good hpiwowar: if at some point in the next 1.5 months anyone would prefer voice, then just say, otherwise we'll default to text chat agenda items: 9:01 AM how is the open wet ware experience going? open notebook science in general? things we can learn from each other? problems? etc 9:02 AM 2. how are the projects going? everyone know what everyone else is aiming for? everyone know what they are going for? 3. interaction and overlap... what data collection are we doing, how can we do it efficiently together 4. mini goals for end of june. well defined? well documented? That's what I have. Other things I haven't included? 9:04 AM Nicholas: 5. Blog, we talked about it, but I wasn't sure if we were going to start hpiwowar: Yes! great Sarah Judson: sounds like everything to me valerie.janae.enriquez: yes 9:05 AM hpiwowar: ok. Where to start? OWW stuff? In general, how is that going for people? Nicholas: I have a question about wetware... I think that I set up a notebook for my project under my username, but we should set one up under the lab DataONE right? Sarah Judson: ditto for me 9:06 AM hpiwowar: Good question, Nic, and I don't really know. Pros and cons either way. Anyone else have an opinion about it? Sarah Judson: I've made sure mine is linked to the main page and vice versa i keep all my personal/raw notes there (well, I'm working on getting them all up) valerie.janae.enriquez: I accidentally have one in my name and the other in DataONE. I think as long as there are links tying everything together, it should be ok, right? hpiwowar: Yeah, I think as long as it is linked it doesn't really matter. 9:07 AM Nicholas: ok hpiwowar: Advantage to your own name is it makes it clear to you and everyone that it is your work, you get to keep working on it, publishing on it, etc after the summer is over From what I understand, if you change your mind about where you want to host it you can always move it later. 9:08 AM So just pick a place and run with it Nicholas: yes, it seems like moving pages is easy valerie.janae.enriquez: that's good hpiwowar: Valerie, how has it been going to take daily notes via the Notebook "Calendar"? 9:09 AM valerie.janae.enriquez: I've been liking it hpiwowar: It looks like a great way to organize things. working for you? valerie.janae.enriquez: it has a standard format and people can easily see my progress (or lack thereof) 9:10 AM Sarah Judson: i'm not familiar with that, how does it worK? i'm grouping my notes by category currently hpiwowar: LOL. yeah, well, don't worry too much about that. Progress is often in spurts, we all know that, no problem. valerie.janae.enriquez: if you click a date on the calendar, you can make an entry for that date Sarah Judson: ok. does it work for reminders? i.e. setting a projected schedule valerie.janae.enriquez: I'm not sure. I don't know if there's a way for it to send you automatic reminders. 9:11 AM hpiwowar: Not that I know of valerie.janae.enriquez: I've just been using it as a log. Sarah Judson: ok. hpiwowar: Though you could probably click on a date in the future and add a placeholder on it, "the date I plan to be finished with X" if that helps you Sarah Judson: ok. just wondering. 9:12 AM my main question is, did we decide a standard way for commenting on pages that are not our own? should be primarily use Talk and Timestamps? valerie.janae.enriquez: the dates highlight if there's an entry, so you can kind of see if there's something coming up Sarah Judson: ok thanks. hpiwowar: Or a combo of Talk and main-pages with timestamps? Nicholas: I was just trying to find a "tasks" function, but it seems like what Heather mentioned (adding tasks to future dates) is the only way 9:13 AM hpiwowar: what would you all prefer? Sarah Judson: i like the timestamp, but am not sure about the separate talk page valerie.janae.enriquez: I like the timestamp and the talk page (since it keeps discussion separate from the main page) Sarah Judson: it's nice that it doesn't mess with the original text, but sometimes cumbersome to tell what section someone is commenting on Nicholas: I like the timestamp as well 9:14 AM hpiwowar: agreed sarah valerie.janae.enriquez: I see your point there. Sarah Judson: maybe we could copy the original page onto the talk page, or a specified amount of the section you are commenting on...like a short direct quote hpiwowar: yup, could do that. 9:15 AM valerie.janae.enriquez: there is a quote function, so that makes sense Sarah Judson: i dunno, that might also make it cumbersome for the commenter....i think i can follow heather's comments on the research questions well enough. hpiwowar: one down side about the talk page is that it isn't immediately apparent. people have to know about OWW (or wikis) to look for it. Sarah Judson: how does the quote function work? 9:16 AM valerie.janae.enriquez: I know it accidentally turns up if i copy/paste conversations with indentations. hpiwowar: LOL Sarah Judson: heather...how do we attract "outsiders" to the page or do we just assume everyone that's on our team knows how to use a wiki/talk page. Nicholas: I made the same discovery as Valerie I think it requires a double indentation 9:17 AM Sarah Judson: ok..i'll play with it hpiwowar: hmmm. well, I think for most pages we don't necessarily want to go out of our way to attract attention to them... they will be pretty standard reserach-in-progress that said there will be pages and/or individual questions that we do want answers to or attention so one experiment that seemed to work pretty well was that box on the main page, "quesitons for the community" or whatever 9:18 AM makes me think we could have a few centralized places for "questions" and "active results" Sarah Judson: maybe we could link those to the talk page with a "comment here" hyperlink hpiwowar: and then you guys could keep them live as you generate content? true valerie.janae.enriquez: that sounds good 9:19 AM hpiwowar: ok, so we are thinking we'll go with Talk pages as the main place for commenting, for now? valerie.janae.enriquez: with quotes from the main page for reference? hpiwowar: yes valerie.janae.enriquez: ok, that sounds good Sarah Judson: yeah. i think it will keep clutter to a minimum and also allow crossreferencing Nicholas: sounds good hpiwowar: and also updates from the main person the comments are addressed to, so that the talker knows the comments have been read? 9:20 AM valerie.janae.enriquez: sure 9:21 AM Sarah Judson: yeah..all with timestamps for frame of reference hpiwowar: well, I don't mean immediately after you've read them necessarily, but some feedback saying "hmmm I don't understand" or "yes, I'll do that eventually" or something would help with the communication loop I think Sarah Judson: yeah. sounds good valerie.janae.enriquez: ok hpiwowar: Nic, works for you? 9:22 AM Nicholas: yes yes, sorry my connection dropped for a moment 9:23 AM hpiwowar: (btw I don't know if you guys noticed, but when you create a post, right below the edit boxes there is the funky *'Sarah Judson 14:29, 14 June 2010 (EDT) link or whatever it is to do the name and timestamp. if you just click on that link, poof, the characters are all inserted for you. handy) valerie.janae.enriquez: that has been superhelpful 9:24 AM Nicholas: ahhh hpiwowar: so as you guys experiement with dated entries and we do Talk pages and we dig into it more, do feel free to aggregate the links... be it in "Look at this! This is cool!" boxes at the top of a main page, Nicholas: just realized that 9:25 AM hpiwowar: or a summary section on your notebook that says "Spent many days collecting data. See pages June 7-June 28 for details" or whatever. We'll figure it out... but I think lots of links will probably help. 9:26 AM Sarah Judson: agreed valerie.janae.enriquez: ok Sarah Judson: maybe when you take care of a comment, link to the data/link where you resolved it hpiwowar: ok. In general you are still on board with doing research out in the open? 9:27 AM valerie.janae.enriquez: yes Sarah Judson: yes. i like it even just for myself to keep track of things otherwise i lose my notes in emails or scraps of paper Nicholas: yes hpiwowar: I know I've been sending emails, which isn't exactly in the spirit, so I'll try to move more of my commenting to the wiki :) cool valerie.janae.enriquez: it does get confusing with all of the back and forth emails  I'm never sure if I've gotten back to everyone.  so the talk pages are nice to have as a visual reference 9:28 AM hpiwowar: agreed. good. Well, feel free to redirect any conversation you think could better happen another way, ok?  I'm very up for that.... I'm learning right along with you guys. Sarah Judson: are the other mentors best to get a hold of through email or will they be using the OWW? 9:29 AM also, what's the best way to "watch" pages...it seems kind of difficult to follow them all even in the watch updates and RSS hpiwowar: good question. I know most of our direct participants have OWW accounts by now, but I don't know how actively they are watching. valerie.janae.enriquez: if they have their profile pages up, would we contact them through the talk section of their profile? hpiwowar: Maybe we could have one central place where we keep questions and update links active for them? Or ? 9:30 AM yes, Valerie, we could.... I'm not sure if they are looking though. We could confirm on the Tuesday meeting, how's that? valerie.janae.enriquez: ok, sounds good 9:31 AM hpiwowar: One thing that will help with watches is to figure out the main Notebook pages that you will all be updating, so that we know what to make sure we have watches set for. Maybe update this page http://www.openwetware.org/wiki/DataONE:Notebook/Summer_2010 everybody with the "Active project pages" links? Sarah Judson: i think i have all my watches set, it's just cumbersome to look at all the updates in OWW...the RSS feed is nicer but only covers the main page 9:32 AM hpiwowar: yikes, is that true. I hadn't figured that out yet. hmmmm. ideas? 9:33 AM wait, I think there is another way, hold on.... you can set watches by category so if you put on all the new pages you create, I think at least all of the 9:34 AM changes would be caught by those watches (the fact that they are cumbersome to read is a different issue though....) valerie.janae.enriquez: ok true hpiwowar: or we could make a category like DataONE_citationsummer2010 or something Nicholas: put on the entry page? page? 9:35 AM hpiwowar: yes Nic I think it would have to go on every page you create. valerie.janae.enriquez: it seems to be automatically generated on some pages Nicholas: I think this goes back to how you set up a notebook 9:36 AM hpiwowar: oh, ok, good. I'm guessing that might be because you created your network under the DataONE lab, maybe? Sarah Judson: is that included when we add the dataone header to our pages? hpiwowar: yes valerie.janae.enriquez: ah Sarah Judson: ok. that'll make it easier. and then it will go to the main RSS feed? 9:37 AM it's easier to read through the changes in google reader than OWW hpiwowar: I think yes, it will go to this RSS feed: http://openwetware.org/index.php?filter=DataONE&feed=rss&title=Special:Recentchanges Nicholas: so if you go to your main entry page, and inset the it will automatically generate for each new entry you start 9:38 AM hpiwowar: Nic, is that a question, or do you know that is true? Nicholas: I just did it, it's true 9:39 AM hpiwowar: oh good! thanks for the info. Nicholas: select "customize your entry page" hpiwowar: by the main entry page do you mean the main "Notebook" page? ah hah! thanks 9:40 AM ok. that works then? valerie.janae.enriquez: neat hpiwowar: btw is everyone ok with RSS feed of changes, or do those who prefer email know how to set that up? Sarah Judson: i like the rss 9:41 AM valerie.janae.enriquez: I could do the rss hpiwowar: Valerie and Nic, you are good with monitoring changes? 9:42 AM Nicholas: can you rss certain notebook or a lab? hpiwowar: I think so. See info here: http://www.openwetware.org/wiki/Help:RSS Nicholas: ah, thanks... I set one up initially, but I was getting a lot more than just DataONE updates 9:43 AM hpiwowar: ok. well, let us know if you can't figure out something htat works well. ok, anything else on OWW or similar before we move on? 9:44 AM Sarah Judson: i'm good. i'll make sure the category with the main RSS catches my changes today hpiwowar: We'll confirm tomorrow what is the best way to communicate with other mentors. valerie.janae.enriquez: ok, cool 9:45 AM hpiwowar: For now, if there is something that deserves highlighting, maybe add a link (or a new box, or whatever) here...http://www.openwetware.org/wiki/DataONE:Notebook/Summer_2010 ok. want to talk blog now, or blog after we talk about projects? 9:46 AM valerie.janae.enriquez: maybe projects first? Sarah Judson: blog now since it's more of a bookkeeping thing valerie.janae.enriquez: ok Sarah Judson: i think we just need to confirm the schedule nic sent out Monday - - Nic Tuesday - - Sarah Wednesday - - Valerie Thursday - - Guest Post from a Mentor Friday - - Some special topic Does that work? 9:47 AM valerie.janae.enriquez: I'm fine with Wednesday hpiwowar: yup valerie.janae.enriquez: how would the special topic friday work? Nicholas: I think this was Sarah's suggestion, to use our first entry as an introduction to our individual projects valerie.janae.enriquez: that makes sense Sarah Judson: yeah, and lots of links to dataone and the oww so people can find both. we should each but a link to the blog on our notebooks 9:48 AM so, for fridays....nic, were you picturing something in particular? maybe a q/a among us and the mentors/community? or a outside guest post? Nicholas: Fridays? I was just more thinking of some topic we each posted on, or something we were interested in thinking about over the next week nothing in particular 9:49 AM guest posts would be nice hpiwowar: could be a brief summary of a particularly relevant article, sometimes. 9:50 AM guest posts might be hard to get once a week... worthy goal, but also have something in your back pocket that doesn't take too much work and/or is in your control Sarah Judson: ok. who will manage the thurs/friday posts? Nicholas: I can do this 9:51 AM Sarah Judson: either recruiting mentors, posting questions..mostly making sure it happerns 9:52 AM Nicholas: I'll try to set up a schedule of mentor's and send out some sort of email reminder about days to post Sarah Judson: great. i can be your backup or your friday helper for deciding what to post then Nicholas: also, I'll add mentor info to the blog...if there are other things we should link to on our sidebar or elsewhere let me know 9:53 AM Sarah Judson: ok, are you still doing the twitter feeds? Nicholas: also, we should probably fill in our profiles valerie.janae.enriquez: any relevant blogs we might find could be on a side blogroll? Sarah Judson: yeah, i have a few from heather about open science, and then we can work up digging up some others hpiwowar: sounds good. one more plug for brainstorming a format that is meaningful but also easy. For example, Fridays could be where you ask a question of the community. People (us + mentors + the whole world) could give their take via blog comments / twitter /friendfeed..... 9:54 AM Nicholas: (twitter) yes, can everyone respond with their twitter handle so I can add those feeds to the blog valerie.janae.enriquez: ok that sounds great hpiwowar: @researchremix Sarah Judson: i don't have twitter, but thought you had said something about the dataone/datacite feeds agreed about friday hpiwowar: I think don't spend too much time getting a blog roll 9:55 AM really hard to be comprehensive, and others have put time and energy into this.... Nicholas: is there a dataone twitter account? hpiwowar: I think so valerie.janae.enriquez: the feed would work for a blogroll (since sometimes blogs don't update regularly and it's best to show things that are recent) 9:56 AM hpiwowar: one idea is to focus on just a few things at once. this week, research kickstart + blog posts etc valerie.janae.enriquez: great hpiwowar: next week could focus on finding related twitter accounts, publicising, etc do the essential things first 9:57 AM Nicholas: yes hpiwowar: ok, so Nic you are going to write the first post today, is that right? Nicholas: yes, I'll post something this afternoon 9:58 AM valerie.janae.enriquez: so we're just doing an intro entry about what we're each doing for this week? hpiwowar: great! email around the link when it is up. 9:59 AM Nicholas: I was just going to introduce my project in general and say something about where I am beginning Sarah Judson: sounds good 10:00 AM valerie.janae.enriquez: ok hpiwowar: I'm happy to take this Thursday. Nicholas: great hpiwowar: ok. any more blog things for now? valerie.janae.enriquez: neat, thanks! 10:01 AM Sarah Judson: nope, i think that's it hpiwowar: This is turning into a long (though useful) meeting. Everyone available for the next hour??? 10:02 AM Sarah Judson: eyp yep valerie.janae.enriquez: sure Nicholas: I need to duck out by 1pm central (11 your time)... hpiwowar: ok. 1 more hour it is, then. 10:03 AM projects and overlap and june. Sarah, I was wondering if you've had a chance to put your spreadsheet up on google docs or anything yet? Sarah Judson: i'm planning on doing it today 10:04 AM i've worked out most of my fields now and am back to the glorious land of consistent internet hpiwowar: I hear you :)] ok. Can you easily upload your fields in any way while we talk?  either posting a list here or ??? 10:05 AM Sarah Judson: yeah, i'll hop on docs real quick and post a link once it's up hpiwowar: perfect. thanks. Sarah Judson: the spreadsheet also has my questions about overlap/integration hpiwowar: great  ok, where shall we start. do you guys feel like you understand each other's projects?  (do you feel ike you understand your own projects?) 10:06 AM valerie.janae.enriquez: I feel like I understand my project better, but haven't gotten around to look at everyone else's work.\ Sarah Judson: i think maybe, given the emails i received last week, that we need to clarify the big picture a bit. 10:07 AM valerie.janae.enriquez: yes. what are we ultlimately aiming for with our individual projects and combining them? Sarah Judson: i envisioned it as me with an article approach, valerie with a similar approach but coming from the depository searches, and nic over journal metadata hpiwowar: ok, then let's start this way valerie.janae.enriquez: searching for article citations within the actual depositories? Sarah Judson: i think i'm different from valerie in that i look at the articles regardless of whether they cite data hpiwowar: if each of you could give a quick 2-3 (?) sentence summary of the way you understand your projects 10:08 AM Sarah Judson: ok. hpiwowar: I can give a summary of how I see them overlapping valerie.janae.enriquez: ok 10:09 AM hpiwowar: have at it :) valerie.janae.enriquez: I am searching through databases and journals for articles that specifically cite data hosted in an online repository, focusing on TreeBASE to start. I have been logging my search terms as well as my results in my lab notebook. There will be a spreadsheet and report following. 10:10 AM Nicholas: I am collecting policy statements on sharing and reuse of data from journals funding sources and repositories,,, along the way I'm also collecting metadata about those entities Sarah Judson: I am taking a sampling of articles and recording whether or not they cite a data set. this yields % statistics about what types of data/scientists/journals are citing datasets. Then, if a journal does cite data, I note how they cite it (paper, url, dryad, treebase, etc). I see the other projects overlapping with mine by nic collecting journal metadata that can help analyze article trends and valerie collecting data about how the depositories work/what articles are in the depositories and if they are being cited...we may collect similar info about articles in this or valerie may hand off articles to me to analyze. 10:11 AM hpiwowar: great! so here are overlaps as I see them: 10:12 AM Sarah and Valerie are looking at similar questions (how is data cited?) but using different datapoints and search strategies. Sarah Judson: also, different entry strategies hpiwowar: Sarah is using all articles in given journal issues as her dataset-for-consideration (yes, entry) Sarah Judson: valerie from depositories and me from article samples 10:13 AM hpiwowar: whereas Valerie is searching across many journal issues for citation of a given repository. yes, exactly sarah valerie.janae.enriquez: ok hpiwowar: This will give different pictures of the same topic 10:14 AM valerie.janae.enriquez: I knew there was a reason I wanted to export the citations of my search results to connotea/citeulike, I wanted to send those articles to Sarah. hpiwowar: And it does require quite drastically different annotation/search strategies. Sarah Judson: yes hpiwowar: But the data citation data will be similar, so worth coordinating annotation to make sure the same things are being extracted the same way so that the data can be integrated as it makes sense 10:15 AM Sarah Judson: mine is a sample (random or time) and valerie's is based on what datasets are published on dryad/etc or cite them in scirius/etc hpiwowar: yes. Sarah Judson: agreed. i'm getting my fields posted right now and am developing quicker annotation techniques i.e. full text extraction and text searching of particular sections hpiwowar: and sarah yours will be comprehensive for a given journal and time period, whereas valerie's is less likely to be comprehensive 10:16 AM because it is really hard to know how to find all data citations (which is exactly why we are doing this study) Sarah Judson: will post methods soon and coordinate with valerie. or possible she forwards articles to me and I extract since my artcile retrival is easier. that's something we should decide valerie.janae.enriquez: ok hpiwowar: yes. let's hold off on deciding who annotates what for a bit longer here.... because related to nic's too..... Sarah Judson: yeah sure 10:17 AM valerie.janae.enriquez: ok hpiwowar: one more link to make is that valerie is coming up with search strategies to find data citations they won't possibly find all data citations 10:18 AM but sarah's results will help us estimate what % of citations have fallen through the cracks Sarah Judson: and i'm full text indexing mine and associating them with a tag of "cited dataset" or not, which can allow for test searching and accuracy of search terms hpiwowar: for example, if sarah's approach finds that 45% of all data citations to tReebase use dois, and Valerie does a search that would find all citations using DOIs 10:19 AM Sarah Judson: yeah. see my above comment that came through at the same time. hpiwowar: then we know that she's finding at least 45% of the citations (assuming similar articles, etc...) valerie.janae.enriquez: ok hpiwowar: yup, exactly. so Valerie, Sarah's results will be interesting at the end, to figure out the coverage and also in the beginning and middle, to help inform what search terms are likely to be useful so keep an eye on her spreadsheet :) 10:20 AM valerie.janae.enriquez: I've been keeping a detailed log on my notebook page. Sarah Judson: for instance, i only have one article so far that used dryad and a hdl retreival number, but many cite datasets or treebase/etc in other ways yeah, i saw your notes.  i have a few suggestions that I'll post on that page, though you may have attempted them already valerie.janae.enriquez: ok, cool. thanks! 10:21 AM hpiwowar: great.  ok, so now to how does Nic's stuff fit into this?  Nic is looking at policies  and obviously the policies inform (though perhaps not as directly as the policy-makers would like) the author's behaviour... and Sarah and Valerie are tracking author behaviour 10:22 AM so we'd really like to correlate attributes of the policies that Nic is digging up with how the authors behave  within a given journal, funded by their funder, using/sharing data from a given repository 10:23 AM Nic's stuff is a bit different though because there aren't many policies on data reuse out there mostly data sharing and Valerie and Sarah your goals are data reuse annotations Sarah Judson: i'm also looking at data sharing hpiwowar: so.... it probably makes sense to couple some data sharing annotations in to the data collection steps 10:24 AM Sarah Judson: i.e. if an article says they posted their data internally (with the journal) or on treebase etc so, i'm doing reuse and sharing b/c it's easy enough to extract both while in an article hpiwowar: great, sarah. exactly. so for what it is worth, I didn't envision that as your main endpoint :) but I agree it is easy to do if you are already there  and valuable in and of itself  and coupled with Nic's stuff 10:25 AM Sarah Judson: i bet it will also come up with valerie's search terms hpiwowar: so I'm glad you are doing it :) Sarah Judson: some papers that cite data in the same sentence will say they posted theirs anyways, nic.... Nicholas: ... a common thread I'm finding is that journals require an author to submit data for peer review 10:26 AM but not for actual publication hpiwowar: interesting Nicholas: its a bit of a grey area hpiwowar: definitely make notes on it Nicholas: to see where the data set will be included in the publication or whether its just used by the reviewer and then discarded hpiwowar: could be that correlates with post-publication too, since authors will have already made the effort to find data, document it, etc 10:27 AM yes. Nicholas: yes hpiwowar: now there are a few subtlies in the type of data that Nic will need to do correlations with data sharing Nicholas: Im wondering, Sarah, are you noting the place in the article that you are finding the citation? (abstract, methods etc) 10:28 AM Sarah Judson: yep. hpiwowar: to see if data sharing % increases with different policies, need to know both a) how many shared data, and also b) how many created data and didn't share it Sarah Judson: just added that field Nicholas: great Sarah Judson: but most is methods valerie.janae.enriquez: should I add that field as well? Sarah Judson: yeah. see my fields posted at: https://spreadsheets.google.com/ccc?key=0Am4hbt8Ef8WXdENmZU83dTRUbW5fNFg3RjFFa1Z0LUE&hl=en 10:29 AM they are a little overkill (editor, etc I might not end up collecting), but I'm whittling it down to the most relevant hpiwowar: Sarah's approach could capture "b) how many created data and didn't share it" whereas Valerie's search approach won't capture that Sarah Judson: i think nic's data will help answer "why" questions valerie.janae.enriquez: ah hpiwowar: sarah, thanks for the SS. great. Sarah Judson: for instance, if sysbio is citing more datasets, what is their journal policy? or is the nature of the science mandate datasharing? 10:30 AM Nicholas: exactly sarah Sarah Judson: i'm hoping we can tease that out hpiwowar: so that is totally fine, it just means that Sarah's comprehensive approach will be the only one that we'll use to correlate with that part of nic's data Sarah Judson: say, what if sysbio and amnat lead the way on datasharing?  but, they have totally different scientist clientle?  maybe then, it's the journal policy that is driving their datasharing hpiwowar: yup, exactly Sarah Judson: but we can't tease that apart with just article level data 10:31 AM i.e. extracted info valerie.janae.enriquez: hm... true hpiwowar: is this making sense to everybody? or questions? 10:32 AM I'm sure it isn't all making sense... so ask questinos.... 10:33 AM sarah, looks like some of the columns you were capturing are journal metadata type columns, right? so those could be captured as part of nic's work? Sarah Judson: yep, but notice the comments below about nic/valerie capturing that info instead that's to remind me about database linkages hpiwowar: sorry, didn't get that far.... Sarah Judson: for analysis purposes 10:34 AM hpiwowar: cool..... ok. question for you, sarah. Sarah Judson: it's colored in my version, not in docs yets *yet hpiwowar: are you comfortable capturing data on "the investigators created a dataset that could have been shared in TreeBANK but they didn't"? Sarah Judson: you mean a notes section? hpiwowar: this can be a bit hard to tease out, and a bio background definitely helps 10:35 AM Sarah Judson: categorizing the data myself? Nicholas: Valerie, is your starting point TreeBASE? So you begin with the data and find out how it is linked to in an associated publication? Sarah Judson: i'm doing some of that... valerie.janae.enriquez: I tried doing that but wasn't sure where to get started instead, I looked for any article mentioning data in TreeBASE and am in the process of narrowing down the search terms to exclude terms like "deposited" hpiwowar: ok, hold on my question while we dig into Nic and Valerie's conversation.... valerie.janae.enriquez: or "submitted" 10:36 AM Nicholas: ok valerie.janae.enriquez: then I searched for articles mentioning "study or matrix accession numbers" 10:37 AM hpiwowar: Nic, the thought is that following a citation trail for each dataset individually is very labour intensive. So trying to capture lots at a time with generalized searches instead. Does that make sense, or do you have suggestions? 10:38 AM Nicholas: no that makes sense I was just curious,,,, a lot of the policies also label data or datasets as "supplementary material" ... I need to look more closely at the structure of some of those articles to see whether I could suggest any search terms hpiwowar: sounds good valerie.janae.enriquez: what's interesting is some articles only cite the main website "http://www.treebase.org" without mentioning the accession numbers 10:39 AM and you are right, other articles have supplementary tables of the datasets they used hpiwowar: yes. arg, eh? good luck ever getting a computer to be able to do any massive automated mining :) valerie.janae.enriquez: lol, yes. things that are machine readable aren't always human friendly and vice versa 10:40 AM hpiwowar: ok, back to "dataset sharing" conversation? Sarah Judson: I read an article that dealt with this problem...treebase was cited, but the author explicitly stated that they could only find 50% of the datasets they wanted to used hpiwowar: yikes. do you have a ref for that, sarah? very relevant! Sarah Judson: valerie, here's the article:  10.1093/sysbio/syp080 10:41 AM very interesting, even if more anecdotal. i can send the pdf hpiwowar: super, thanks for that, sarah! valerie.janae.enriquez: neat, thanks hpiwowar: yeah, let's keep track of this stuff. can make for some great pointers in discussion sections :) 10:42 AM maybe Sarah could you add a link to it from our main summer page, in an "interesting findings" box or something??? (just an idea) Sarah Judson: here's a direct quote from the article: To gauge the frequency of the long-tree problem in the published phylogenetic literature, I conducted an informal survey of Internet-accessible files using Google Scholar (scholar.google.com) and a keyword set consisting of “partition,” “MrBayes,” “mitochondrial,” “codon,” “phylogeography,” and “TreeBASE.” The latter 2 terms were intended to bias the sample toward studies with larger numbers of taxa and easily accessed data sets. I examined the first 24 studies that fit these criteria; many proved unusable here because the data sets were not accessible, the character sets were not clearly specified in the TreeBASE files (http://www.treebase.org/treebase/index.html), or the partitioning details (e.g., parameter linking) were not entirely specified." will do hpiwowar: perfecto  ok. the data sharing denominator, as I often call it.... 10:43 AM because it is the base of any data sharing frequency calculation  needs an estimate of the number of similar datasets that could have been shared but weren't.  I've come up with ways to establish this in other domains, but not in this domain 10:44 AM and I don't know how hard/easy it is (But I bet it isn't all that easy). Sarah Judson: i'm game to try hpiwowar: Sarah, as the biologist in the group and the one who has read the most articles....  how doable do you think this would be? Sarah Judson: i've loosely classified most of the articles independent of keywords hpiwowar: yup. Sarah Judson: so, all i would need to do is assess what depository is most relevant 10:45 AM hpiwowar: might help to target a few specific types of data Sarah Judson: treebase =phylo. genbank = genetics, etc hpiwowar: yup. so Treebase type data? genbank sequence data others on the short list? Sarah Judson: dryad is broad the journals have internal "supplementary data" at least am nat and sys bio hpiwowar: yes, and new. Sarah Judson: daac = gis 10:46 AM hpiwowar: good. Sarah Judson: right? i haven't looked at the data there yet anyways, i was hoping maybe valerie could collect some metadata on the depostories, but i might be better suited i dunno 10:47 AM hpiwowar: I think Nic is doing to give it a shot Nicholas: wait so you're trying to quantify the number of related datasets (present in a repository) that were not cited in a publication? hpiwowar: Nope Nicholas: ha hpiwowar: The number of datasets published about that were not deposited in a repository but could have been 10:48 AM valerie.janae.enriquez: which is a lot, I imagine Sarah Judson: meaning they are published internally or with a url, but not on the depostories? hpiwowar: So ideally Sarah would add a few columns to her spreadsheet along the lines of: "Study produced some data?" "Study produced DNA sequence data? (could go in genbank" Sarah Judson: indeed kind of already doing this hpiwowar: "STudy produced phylogrees?" etc Sarah Judson: phylogenies...yep Nicholas: gotcha 10:49 AM hpiwowar: great. if you can make it formal that woudl be super Sarah Judson: currently, all the data i extracted for cited datasets i also extract for the dataset produced by the paper  i.e. dataset mention, how cited, where published, where cited in the paper, etc hpiwowar: let us know if it slows you down excessively to try to quantify/categorize the un-deposited/unshared datasets 10:50 AM Sarah Judson: ok.  but, do we assume every paper has a dataset? hpiwowar: because it might... and then we would have to figure out a different solution.... Sarah Judson: or only look into it if they mention there dataset? Nicholas: we can also determine the number of articles published in an ISI category ...so for instance of the ecology publications 1900 cited a dataset... but 2700 produced data... hpiwowar: nope, might be papers that didn't create data (in some ways) 10:51 AM like metaanalyses Sarah Judson: that's only a few of the ones i'm reading, mostly ones that built models rather than collected data hpiwowar: ah hah! Sarah Judson: yep, metaanalyses or model creation hpiwowar: ok, then keep track of that the models Sarah Judson: those are the only ones that don't produce data hpiwowar: because there are places they could be sharing their models ;) valerie.janae.enriquez: ah Sarah Judson: but, often they validate or proof their models with real data are there model depositories? 10:52 AM hpiwowar: yes, I think so Sarah Judson: one way i've been trying to get at this is if they cite opensource software  or say they created an R (statistics) package for running the data  but i haven't pursued this heavily yet hpiwowar: interesting 10:53 AM Nicholas: (shoot I am really sorry, but I have to run to a doctor's appointment)... I will check the log of the conversation as soon as I get back hpiwowar: yeah, that might be correlated....  ok, bye Nic Nicholas: sorry guys! valerie.janae.enriquez: ok, talk to you tomorrow Nic Nicholas has left hpiwowar: so Sarah I'd maybe try to come up with the most frequent 5-8 types of data 10:54 AM and include models as one of them. Sarah Judson: will do  yeah. i'll re-analyze some of the papers i already finished to think about how to get at this best 10:55 AM hpiwowar: then keep a free text field for strange situations :) good. Sarah Judson: you said you did a paper on this in biomed/pubmed...can you send me the doi? i think i have most of your articles, but want to look over that specific one for methods hpiwowar: before you do that, sync up with Valerie to see if guys can share columns....  yes. Sarah Judson: yeah, all my fields have a full text field and accompanying coded field hpiwowar: actually I mostly did a thesis on this :) 10:56 AM Sarah Judson: that way i don't have to go back to the paper hpiwowar: and alas most of the methods aren't very applicable Sarah Judson: that's what i thought valerie.janae.enriquez: neat hpiwowar: because they rely on PubMed sorts of resources but I will send you a related paper that did a manual review Sarah Judson: yeah, anything would be great so i'm not reinventing the wheel 10:57 AM hpiwowar: Here it is http://www.nature.com/nmeth/journal/v5/n12/full/nmeth1208-991.html Sarah Judson: thanks valerie.janae.enriquez: thanks hpiwowar: Let me know if you need the PDF there is also another on genbank http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0040228 10:58 AM That said, the main focus of this DataONE summer is data citation rather than data reuse.... so although the two are related, and I think we will be doing a bit of delving into data sharing.... valerie.janae.enriquez: ok hpiwowar: keep your eyes on the reuse ball, mostly :) whooops! in the above Sarah Judson: yeah, i'm focusing on the dataset citation, but it takes me only a few more minutes (with my previous method) to extract sharing info hpiwowar: I meant to say That said, the main focus of this DataONE summer is data citation rather than data SHARING.... 10:59 AM super. yes, that's great then  hrm, Valerie, did I just confuse you? 11:00 AM You get why we won't be asking you to extract data sharing or data creation info? valerie.janae.enriquez: uh... hpiwowar: Because you mostly won't be finding it with your repository-based search terms :) valerie.janae.enriquez: ok 11:01 AM hpiwowar: or your doi-search terms, or whatever else. you may stumble across data-creation/depositing/sharing, but you aren't looking for them systematically, so the instances you find will not be comprehensive, so they are mostly noise for the purposes of your project. 11:02 AM that make sense? if not, feel free to ask more questions now, or in a day or two as it settles in. valerie.janae.enriquez: all right. I'm sure as I go on, it will make sense, or I will know what questions to ask hpiwowar: Sarah, to tie one more link for you when you asked about whether it made sense for you to annotate Valerie's papers, my guess is mostly no 11:03 AM because valerie's search terms are mostly not perfect and she captures data-sharing papers as well as data reuse papers but she doesn't know which until she opens the full text... and if she is in the full text anyway she might as well extract the reuse details That's my assumption, anyway. It could be that I'm wrong valerie.janae.enriquez: like the sentence where it came from 11:04 AM hpiwowar: and so feel free to reorganize as you guys dig in...... 11:05 AM valerie.janae.enriquez: ok. I'm sure my searches will change often hpiwowar: good. ok, shall we call it the end of a long chat, then? 11:06 AM I think the only thing we haven't covered much is June deliverables. Mostly that is up to you guys Sarah Judson: ok. i was just meaning that since valerie's search process is time consuming, i could do the standard data extraction on it. valerie.janae.enriquez: yeah, I'm still a bit unclear about what I will have by then hpiwowar: Gotcha. Sarah Judson: or we at least need to coordinate our methods okay, sorry, back to June goals 11:07 AM hpiwowar: yes, agreed on coordinate. And it could be that there are benefits to streamlining annotation.... if so, then do it for sure Valerie, yes, your project has less clear mini-goals valerie.janae.enriquez: ok 11:08 AM hpiwowar: one idea could be to have explored "variations on words + repository name and doi searches" for TreeBASE and DAAC and Pangaea 11:09 AM or something like that? figure out some initial piece that will help you understand feasibility of the project and start to narrow down generalizable methods 11:10 AM valerie.janae.enriquez: ok, that makes sense hpiwowar: decide you will spend 3 days on each of the repositories or something... see what you can get :) valerie.janae.enriquez: I think I am nearing the limits of what I can find for TreeBASE 11:11 AM hpiwowar: sarah, could you revamp your section of the main research plan page a bit to make it a bit easier to read? valerie.janae.enriquez: so moving on to Pangaea and DAAC is a good idea Sarah Judson: yep. that's on the list for today hpiwowar: great 11:12 AM Valerie, good to know on TreeBASE Sarah Judson: i'll have the research questions and my personal page/notebook more navigable hpiwowar: could be that as you move on to other repositories you will get inspired with new ideas for TreeBASE for later. or not. valerie.janae.enriquez: this is true hpiwowar: Also browsing Sarah's findings (now, and on an ongoing basis) will probably inspire some ideas valerie.janae.enriquez: since I keep coming up with new terms hpiwowar: yeah, I hear you. valerie.janae.enriquez: I Just happened to find a lot of overlaps and false hits 11:13 AM neat hpiwowar: I find sometimes that experiementing with searches in Google Scholar can help me figure out the signal/noise of a given search. (I won't tell you how much time I have spent doing that) 11:14 AM Sarah Judson: once I have annotated a few more articles that I have confirmed treebase/daac, etc citations, i can pass them along to you for test searching which might help i'm keeping an eye for commonly used terms that indicate a dataset hpiwowar: ok, I'll let you guys figure out the info flow. valerie.janae.enriquez: ok, excellent hpiwowar: make use of each other's open notes best you can. 11:15 AM I think I'll just drop off now..... keep talking if you want valerie.janae.enriquez: I've added Sarah's page on my watchlist hpiwowar: find me if/when you have questions and I'll try to be more wiki-based in my comments :) valerie.janae.enriquez: will do, thanks again! should one of us post the conversation somewhere? 11:16 AM hpiwowar: I'm mostly AWOL tomorrow morning due to other meetings, fyi valerie.janae.enriquez: ok, I'll keep that in mind. we have the big meeting tomorrow at 3 anyway hpiwowar: yup, and I'll be around for that. valerie.janae.enriquez: awesome 11:17 AM hpiwowar: yes, either in the correspondance section here  http://www.openwetware.org/wiki/DataONE:Notebook/Summer_2010 Sarah Judson: i'll take care of posting, i've claimed that as my responsbiility for emails and correspondence per emails earlier last week hpiwowar: or on the calendar pages?  great, thanks sara.  sarah. valerie.janae.enriquez: neat, thanks hpiwowar: ok, bye! valerie.janae.enriquez: talk to you both later hpiwowar@gmail.com has left 11:18 AM Sarah Judson: see ya. valerie, if you look at my spreadsheet and have additional questions, let me know valerie.janae.enriquez: ok, sure. let me know if there's anything you think should be in my spreadsheet too. (and if you need any help with anything) 11:20 AM Sarah Judson: ok. talk to you tomorrow. I'll probably be online most of the day..even if i'm not visible in gchat (which i'm usually not), so can ping me good luck. valerie.janae.enriquez: ok, cool. you too! valerie.janae.enriquez@gmail.com has left