Talk:DataONE/Summer 2010/Research questions

Link to the main Research Question page (anyone know a better way to navigate back to the main page from its talk version? - Just click the "Page" tab Sarah Judson 12:22, 14 June 2010 (EDT))

Response to Community Questions
[| Community Questions]
 * The students don't have access to all the journals they need through their home institutions (eg Simmons Collage). Can we set up some guest access to other DataONE-affiliated resources?
 * What is a good GIS/earth journal for analysis?
 * Use a journal that is well represented in Pangaea? And/or one affiliated with GSA?  --Todd Vision 10:46, 10 June 2010 (EDT)
 * Recommendation for a specific paleontology journal?
 * I would recommend 'Paleobiology' as having broad interest, high impact papers --Todd Vision 10:46, 10 June 2010 (EDT)

Response to Research Questions
Research Questions

re: number of samples

 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Sarah, stats. Let's drill into this once you've shared your spreadsheet so I can see what specific data elements you are proposing to collect.  Agreed, I think simple stats may suffice.  If you can make an estimate as to the number of articles you'll be able to annotate, that would be very helpful.  Guessing it is too early to do that now... maybe make estimating that one of your goals for June 30th?  I bet you ahead of me, and this is part of what you meant by "Evaluate continued article sampling (random, time-scale, by topic)"
 * Sarah Judson 15:29, 14 June 2010 (EDT):I'm thinking probably 500 articles by June 30. Is this high or low? Annotating is a bit more time consuming than I previously thought, but I think it will get fast and I'm hoping to develop some text searching techniques. Any suggestions on softwares?
 * Heather A Piwowar 07:48, 15 June 2010 (EDT): Sarah, I think 500 would be enough to learn some interesting things, for sure. And I think it will be a lot, given my assumptions about how time consuming the extraction will be.  Keep us updated as you do more about how long it takes.
 * Heather A Piwowar 07:48, 15 June 2010 (EDT): One way to think about stats and number of datapoints: play around with assumptions and look at the size of the confidence intervals for different sample sizes.  So for example,  you might be trying to estimate the percentage of articles that cited data according to best practices for all papers that reuse data.  Assume you annotate 500 articles, maybe half reuse data so 250, and you find ten percent of them do it by best practices.  Then, using this online calculator http://statpages.org/confint.html you could put in numerator=25, denominator=250...  the estimate of the proportion of papers that cite data properly out of all papers that reused data is between 6.6 and 14.4%.  That is a pretty narrow band, which suggests the sample size was large enough.  If you only collected 50 datapoints, though, with the same assumptions it would be 2/25 or from 1% to 25%.  Less informative.  Play around with the assumptions and number of data points and see if the confidence intervals are narrow enough to be informative.  I agree, it is only part of the answer because "informative" is subjective... but it starts to quantify the impact of the choices.
 * Heather A Piwowar 07:48, 15 June 2010 (EDT): Another approach is to think through the multivariate analysis you might want to do... a rule of thumb is to have 30-50 datapoints for each variable in your multivariate regression...

re: time vs random sample

 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Sarah, your journal selection criteria looks good. Firm up the list (you can always change it later if necessary).
 * Sarah Judson 15:35, 14 June 2010 (EDT):Does anyone have an opinion on time sample vs. random sample? Or recommended journals beyond what I have already mentioned?
 * Sarah Judson 15:35, 14 June 2010 (EDT):Heather (on behalf of Todd I think) recommended a time sample to look at change in practices over time. My only concern with time sample is that I may run into "Special Issues" from time to time which means the issue won't have of articles from different disciplines. I'm extracting data relative to this and haven't run into one yet, but just know it could happen which might bias the data (i.e. if the discipline associated with the special issue has a propensity for data reuse or not, or if it's a "tribute" issue to a person that focuses less on science). Maybe it won't be as big of a problem as I think. Also, the other concern is unequal sampling...i.e. SysBio has approx 8 articles per issue whereas AmNat has 12+. The time sample is good for now in getting a sense of necessary fields, etc but I'd like your opinions on the move to a random sample (could be blocked by time).
 * Heather A Piwowar 07:55, 15 June 2010 (EDT): Good point on special issues. Random is better in general, agreed.  Disadvantages in this context is added complexity in selecting/retrieving papers to annotate.  Do you think it would slow down much?

other

 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Good start, guys. Now make it your own.  Rephrase the research questions as you see fit, cut out my seed questions on things your proposal should cover, and restructure and rewrite your proposals/research plan sections so that this page will provide a good summary of what you're up to for the next few weeks.


 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Sarah makes a great point about the research questions being very broad. Feel free to modify them or supplement them with more specific questions.
 * Sarah Judson 15:35, 14 June 2010 (EDT):I'll keep the broad questions and put narrow ones nested beneath them.


 * Heather A Piwowar 09:45, 10 June 2010 (EDT): I do indeed have more details on the survey!  Will be getting the data and manuscript-drafts-in-progress up on a corner of the DataONE site later this month.
 * Sarah Judson 15:29, 14 June 2010 (EDT):Let me know when it's available.


 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Sarah, can you import your spreadsheet into Google docs, make it publicly accessible, and include a link to the spreadsheet in your proposal and future research notes.  Others who have been doing open science on wikis for a while have found that to be a pretty useful integration.
 * Sarah Judson 15:29, 14 June 2010 (EDT):Rough Version at: [|ArticleMetadataDataONE]


 * Heather A Piwowar 09:45, 10 June 2010 (EDT): All:  Sarah's idea of a centralized database sounds like a great one.  Shall we start with centralized/shared google spreadsheet and then migrate to a more complicated database setup if necessary?  Someone kick it off and create something, make it globally editable (fairly risk-free since Google docs keeps a revision history), and share a link on the project pages.
 * Sarah Judson 15:29, 14 June 2010 (EDT): Will work on this today so we can share and update field suggestions. Will send link via email, and post here and on my notebook.


 * Heather A Piwowar 09:45, 10 June 2010 (EDT): Sarah, in general it looks like you are well on your way to scoping the project. Reformat your plan description so it is more easily read and finalize a few of the details (we can always change them if there is reason to later), then we'll highlight it with other mentors to get their feedback.
 * Sarah Judson 15:29, 14 June 2010 (EDT):Again, a task for today. Sorry I didn't meet last Friday's deadline on this. My internet situation was difficult last week. I was working of my desktop mostly with the occasional good spurt of internet at a local convention center for mass article downloading and keeping in the loop with everyone.

Comments to Valerie via June 9, 2010 email from Heather A few more things:
 * do these databases or repositories have "accession numbers"? If so, what is the format of the accession numbers? For example, for NCBI's GEO database, the accession number format is GDSxxxxxx or GSExxxxxx and sometimes people just cite data by mentioning the accession number, so we need to be able to search the article full text for GDS* or GSE*
 * maybe it would be interesting to have add columns for Dryad, Genbank, NCBI's Gene Expression Omnibus Database, and the ArrayExpress database. I say this because it would help us draw comparisons to those sources, even though we probably won't be looking for citations to these databases in the literature
 * it might be worth copying the full paragraphs of full text of the databases' reuse policies into the spreadsheet, for reference. Definitely a link to the page where they discuss their policies would be helpful.

Comment from Sarah to Valerie: I have some ideas on repository inventory that I haven't been able to explore yet, we should talk about ideas/approaches...I'll post more later, email me if I don't by June 14ish or if you want to collaborate sooner!!!! - Sarah


 * Heather A Piwowar 14:13, 11 June 2010 (EDT): Valerie, your initial spreadsheet looks great. I have a few suggestions... I'll try to catch you in google chat.


 * Heather A Piwowar 14:13, 11 June 2010 (EDT): Valerie, I've copied the correspondence parts from the project plan area into this talk page, so now feel free to delete the correspondence from your research plan section, and replace everything from "User: Valerie" on down with a list of research questions you plan to tackle (may be different than those currently listed?) and a reformatted project plan along the lines of what Nic has done.


 * Heather A Piwowar 14:13, 11 June 2010 (EDT): Valerie, maybe add a "Phase 2" section to your area with initial ideas on how you might continue later... for example, you could mention the other repositories here

Nic,
 * Heather A Piwowar 14:15, 11 June 2010 (EDT): I like your reformat. Can you add a "research questions" area back in, to be explicit about what you are trying to accomplish?  They don't have to be exactly what I  proposed, tweak them as you see fit.  If you want to see again the initial idea, you can look at the page history here: http://www.openwetware.org/index.php?title=DataONE/Summer_2010/Research_questions&action=history