DataONE/Summer 2010/Research questions

=Research Questions and Research Plans=

Let's start brainstorming formal research questions, then you can flush out the scope and add your research plans for a June 30th mini-deliverable.

Open Questions for mentors and the community
Please comment on questions on the related [|Talk page] or by editing this section of the wiki. Please leave your name and time with all comments (use the "Sign your username" link while editing).
 * The students don't have access to all the journals they need through their home institutions (eg Simmons Collage). Can we set up some guest access to other DataONE-affiliated resources?
 * What is a good GIS/earth journal for analysis?
 * Use a journal that is well represented in Pangaea? And/or one affiliated with GSA?  --Todd Vision 10:46, 10 June 2010 (EDT)
 * Recommendation for a specific paleontology journal?
 * I would recommend 'Paleobiology' as having broad interest, high impact papers --Todd Vision 10:46, 10 June 2010 (EDT)

Data citation practice inventory within journals (articles)
Owner: [| Sarah]

Guiding Research Questions

 * 1) What are various practices for data citation within academic papers?  How prevalent is each variety?
 * 2) * Do authors tend to cite that dataset itself or related paper?
 * 3) * How did the author obtain the dataset (i.e. past study, buddy, search, known database)?
 * 4) How do these practices vary across discipline, journal, data type, data source?
 * 5) * Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?
 * 6) * Associated with [|Nic's Metadata Questions]
 * 7) How have these practices varied across time?
 * 8) * Does increased data reuse/sharing correlate with changes in journal policy?
 * 9) * Does data reuse/sharing simply increase with time since the advent of the internet?

Scope and Plan

 * Phase 1: Data Extraction
 * Journals
 * Dryad "Top Three"
 * Justification:
 * 1. Most currently posted datasets...is it really being reused?
 * 2. Known "High Impact" Journals
 * 3. Cover target disciplines
 * Systematic Biology (SysBio; discipline: Systematics, Phylo-genetics/-geography, Molecular Evolution)
 * American Naturalist (AmNat; discipline: Behavior, Natural History, Ecology)
 * Molecular Ecology (MolEco; discipline: Genetics, Molecular Evolution )
 * ESA family
 * Justification:
 * 1. Known high impact
 * 2. Known involvement with early data-reuse/sharing discussions (ADD LINK!!)
 * Ecology
 * Others (open to suggestions)
 * Is Evolution affiliated with ESA?
 * Discipline Coverage
 * Justification:
 * 1. To get full disciplinary coverage on "environmental and earth" related articles which are the primary interest of DataONE.
 * 2. Target journals that are good at or lacking data re-use.
 * Paleobiology
 * Pangea-associated journal (SUGGESTIONS PLEASE!)
 * GIS/Biogeography associated journal (SUGGESTIONS PLEASE!)
 * Broad Coverage (if time!)
 * Justification:
 * 1. Wider variety of disciplines in the same journal
 * 2. Disciplinary overlap with other journals, but then able to elucidate influence of journal policies.
 * Evolution (if not ESA)
 * Nature
 * Science


 * Article sampling
 * Time series vs. random sample
 * Time: Issue by Issue. Starting with 2010 moving back. Possible confounding factors of Special Issues and unequal sampling b/c some journals have fewer articles per issue and per year.
 * I'm currently working under this method, primarily for working out needed extraction fields and differences between journal formats.
 * Random: Controlled "random" search of 100 articles per year or per discipline. Could still do within a time block.
 * I appreciate suggestions on Time vs. Random sampling.


 * Data Extraction
 * Primarily information about author, discipline, Y/N dataset citation (reuse and sharing), details on dataset citation
 * [|Preliminary list of extracted fields]


 * Phase 2: Analysis
 * Statistics:
 * Sample Size: 500-1000 articles (Any opinions on what is realistic in a month period?)
 * Percentage: % Data reuse per issue/year, % Proper dataset citation (depository)
 * Correlations: Time, Journal, Discipline
 * This could expand to Multivariate Cluster or NMDS analysis
 * Database Integration
 * Set up Access/Excel or opensource database for our communication between all collected metadata
 * NOTE, we need to solidify fields (especially link fields) up front to prevent problems later on. See this spreadsheet (still need to link, yell at me if I haven't).

Deliverables

 * June 30th:
 * Bulk of data extraction. Hoping for at least 500 articles. Still trying to determine if this is realistic based on the time it takes for data extraction.
 * Develop and Standardize Extraction methods. Text searching, zotero/EndNote/etc integration, standardized fields, coding.
 * End of Project
 * Statistical results
 * Database or at least dataset with good metadata for future use by DataONE. This is the whole point of the project, eh? To encourage and be an example of good data sharing and reuse.
 * Summary paper (at least documentation, hopefully moving towards publishable manuscript). Coordinate with Nic and Valerie.
 * Suggestions for best practices for authors, journals, depositories, DataONE.
 * Include anecdotal information I have obtained by authors I know that reuse data (Sites, Peck, SysBIO treebase ppr, etc) and people I've talked to about their databases (Miller, NEON, etc).

Data sharing and citation policies for journals, funding sources and repositories
Owner: Nic

Description
In this project I will be investigating data management policies for the existence (or absence of) requirements for researchers sharing and citing data. This will be accomplished in two phases. In phase I, I  will collect data management policies from a  number of journals, repositories and funding sources in order to quantitatively assess data sharing and citation requirements. In phase II, I will be trying to determine the impact of the policies based on correlations with Sarah and Valerie's data.

My Specific research questions include


 * 1) What are the data sharing and citation policies applicable to authors, from funders, journals, and repositories?
 * 2) How do these policies differ by discipline, journal, data type, data source?
 * 3) How has the spectrum of applicable policies changed over time? (Need more thought on how to track this)
 * 4) How do the applicable policies correlate with data sharing behavior
 * 5) How do the applicable policies correlate with citing data

Scope and Plan
Project will be carried out in two phases

 Phase 1 : Collecting and "quantifying" various attributes of policies

I'll Use the following sources (Please add sources)


 * Journals : SysBio, AmNat, MolecularEco
 * Repositories: TreeBASE, Genbank, PanGEa
 * Foundations / Funding Bodies :NSF, JISC, AU ANDS

I will collect the following elements from each source (linked to googledoc's SS)

Metadata: Journals, Repositories + Funders

Policy Data

Phase 2 : Determining Impact of Policies

This will be done by correlating my quantified policy data with Valerie and Sarah's reuse data. (More to come)

Deliverables
As of 6/30--(If scope seems narrow please comment)


 * policies retrieved and data / metadata extracted for sources
 * comprehensive list of funders and metadata for resources Sarah and Valerie are working with

Data citation practice inventory for repositories
Owner: Valerie (Very similar to Sarah's project, above)--->*I have some ideas on repository inventory that I haven't been able to explore yet, we should talk about ideas/approaches...I'll post more later, email me if I don't by June 14ish or if you want to collaborate sooner!!!! - Sarah
 * 1) What are all the ways that data housed in given repositories are cited or attributed?
 * 2) How do these practices vary across discipline, journal, data type, data source?
 * 3) How have these practices varied across time?

Link to repository public spreadsheet on Google Docs

Scope and Plan

 * 1) Which repositories? For the June 30, 2010 midpoint, I will mostly focus on TreeBASE. Future repositories to examine include: Pangaea and the ORNL DAAC archive
 * 2) How will you bound the problem? The search will be limited to articles in English from the year 2008+ within ISI Web of Science, Scirus, Nature, Sysbio, and Google Scholar.
 * 3) What methods will be used to search for citation and attributions? Using which search resources? I will use fulltext search when available. If fulltext is not available, I will look for relevant articles using keyword search and search through the reference pages for any mention of TreeBASE or respective study accession numbers. Naturally, my methods will change as I find what works and what does not work, keeping regular entries in this lab notebook.
 * 4) What is the estimated coverage of these methods? Since my methods may change depending on what works and does not work, I cannot provide an estimate at this time.
 * 5) What stats will you run? What is your statistical power? Once again, this will depend on the initial data set I draw in the next week. Please check the lab notebook for further developments.
 * 6) What do you plan to have complete by June 30th? A full spreadsheet containing articles reusing data from TreeBASE and documentation for how that data was cited in those articles; a report summarizing findings based on this spreadsheet.
 * 7) Plans for integration with other intern work? Since Sarah's project is similar to mine, I will keep in touch with her regarding methodology and data integration.
 * 8) Plans for integration/parallel analysis with Heather's NCBI GEO work? Since many articles I have found so far are also in PubMed and BioMed, I expect a lot of overlap to occur with Heather's work and look forward to further collaboration with her.

June 10, 2010 I will start looking for the following ways that TreeBASE is cited in articles:
 * 1) Mention of TreeBASE or TreeBANK
 * 2) DOI or URI
 * 3) Full citation as per TreeBASE recommendations.
 * 4) Mention of data author only

I will look in the following databases and journals:
 * 1) ISI Web of Science
 * 2) Scirus
 * 3) Nature
 * 4) Sysbio
 * 5) Google Scholar


 * This information will go into a spreadsheet housed here: TreeBASE Citations
 * My observations will also be posted here Reuse of repository data

Milestones

 * 1) 6/30/2010: completed spreadsheet and report summarizing findings

Phase 2
Where should this program go after this initial investigation? Other repositories to examine in depth later:
 * GODAE
 * ORNL DAAC archive
 * Pangaea
 * STD-DOI project