This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.
Telecon Meeting Notes June 3 2010
Ornl funded by nasa (remote sensing, etc)
Citizen science!! Cornell ornithology, phenology network (FIND OUT MORE…future job?)
NEON through NSF
To read: the 4th paradigm
Train scientists and next generation how to manage data
Based on previous google summer of code project, help from nascent
If having trouble getting a hold of mentors when need input: Rebecca Koskela email@example.com
To access etherpad old notes: http://epad.dataone.org/_index
Data Conservancy a similar project at John Hopkins, funded at same time
Citation Practices project
how to cite datasets?
How do people cite datasets?
Expect something from Bruce soon
Main goal: paper on best practices
Start with lit review?
Make note of how to reserve or partition spots for undergraduates or lower level grad students (so such students can participate)
Use blog to generate continuous discussion about the projects, or ONS to post research notes online (openwetware) – wetware = colloquial for your brain (to distinguish from software and hardware) - - - also google wave for collaborative document development (can be open to public or not)
Strive to always communicate with the whole group, not just a single mentor or student
if need access to particular resource, ask mentors (can get temporary access to library resources – especially with todd vinsion) or otherwise pull strings
Goals of the project
Goal reusing data, change how data is used
Our project is especially encouraged to have a conference presentation or published paper
Dryad is a partner of dataONE: http://www.datadryad.org/
Dryad's development wiki is here... good for exploring: https://www.nescent.org/wg_dryad/Main_Page
Project in conjunction with Heather:
1. Understand the degree to which data is reused in studies within the Dryad partner journals (see website) but not robustly cited
2. Inventory the data citation practices in the literature by studies that reuse data from a few primary data repositories, perhaps Treebase, DAAC, economics replications data.
Goals in question format:
- given a journal (or, for example the set of biology journals that are partners of the Dryad data repository), what proportion of articles reuse data but do not attribute it using data citation best practices? How is the data attributed instead?
- given a repository or datatype (for example, Treebase data, the ORNL DAAC archive, or economics replication datasets) what are all the different ways data reuse is attributed? What are the most common ways?
Main output would be some publications describing these findings.
from ESA 2009: Michael Whitlock (University of British Columbia and former editor of The American Naturalist) discussed data sharing in evolution journals. It is important for researchers to share their data in order to provide an avenue for error checking, to allow new methods for meta-analysis and new interpretation of existing data, and to increase citations. However, most ecology and evolution journals can not require data archiving because the data repositories do not exist yet. There is an initiative to create a data archive, Dryad (www.datadryad.org), which is currently accepting data by email while it is in beta version. Five major evolution journals adopted a common data sharing policy at the same time so no journal would suffer from being the sole source to encourage data sharing. The draft policy states, "The [journal] requires, as a condition for publication, that data used in the paper should be archived in an appropriate public archive, such as GenBank, Treebase, or Dryad. The data should be given with sufficient details that, together with the contents of the paper, allow each result in the published paper to be re-created." This policy provides for three accommodations: 1) data can be archived with a one-year embargo on public access if desired by the author(s); longer embargos can be considered upon application to the journal editor; 2) archived data should be cited fairly, and journals should encourage citation of the original paper, not accession numbers; and 3) archiving is required only for data used in the paper being published.
To staunch this loss, Dryad serves as a repository for tables, spreadsheets, flatfiles, and all other kinds of published data that do not currently have a home. A major design consideration with these data is to avoid placing an undue burden of metadata generation on individual researchers while at the same time capturing sufficient metadata to enable data discovery and reuse. A special section of the repository named Dryad Lab hosts datasets of particular educational value for use in undergraduate and graduate training.
Dryad participates in the DataONE network (the Data Observation Network for Earth, http://dataone.org), and is actively developing partnerships with other international data networks and scholarly publishing organizations.
why are there so few published dataset citations even though journals are “encouraging” it? What is treebase doing differently than dryad? What are Systematic Bio, American Naturalist, and Molecular biology doing differently (the only journals with more than 1-2 datasets on dryad)?
What kinds of data are being generated by labs?
What kinds of procedures (techniques, software programs) are being used by labs to generate those data?
Does the typical group of researchers who generate (raw) data overlap with the group of researchers who are currently utilizing data repositories and semantic web functions?
What kinds of data might be included in, or attached to, papers submitted to publications?
What kinds of data are in existing (targeted) databases/repositories?
What is the focus of each of the following research areas, what kind of data do they typically generate, and what type of software/equipment do they typically utilize?
How do I include references to published materials?
Referencing published books and papers is easy in OpenWetWare. See OpenWetWare:Biblio for instructions on how to automatically generate formatted citations from just the ISBN for books and the PMID for papers.
LSID - Life Science Identifiers (Heavily used in oceans community)
URI: Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network (typically the World Wide Web) using specific protocols. Schemes specifying a concrete syntax and associated protocols define each URI.
GUIDs- Globally Unique Identifiers
SEEK - Science Environment for Ecological Knowledge project in developing tools for archiving, processing, and analyzing ecological data.
Knowledge Network for Biocomplexity (KNB)
Michener , Wilson, Trisha/etc papers
Heather’s publications : http://www.researchremix.org/wordpress/publications/
National Center for Ecological Analysis and Synthesis (NCEAS)
Ecological Metadata Language (EML)
How is dataOne different?
DataCite is an international consortium to
• establish easier access to scientific research data on the Internet, to
• increase acceptance of research data as legitimate, citable contributions to the scientific record, and to
• support data archiving that will permit results to be verified and re-purposed for future study.
ESA data registry
Only 37 available datasets
- not only a problem for datasets, but also sometimes conference presentations. Which are a 10 min blip in time not always terminating in a paper in following years. Often the abstract is not indicative of what was learned in the presentation.
Example: Luyssaert et al Net Primary Productivity (NPP)
Data at ORNL DAAC (doi:10.3334/ORNLDAAC/949)
Article at doi:10.1111/j.1365-2486.2007.01439.x
Drawn from many sources (very well documented)
Future work using Luyssaert dataset can’t compare it to any of the underlying data
Advantages of citing (and properly storing) a dataset:
- metanalysis uses the dataset, not the authors’ published conclusions about it
- if need to use a dataset, can find it rather than contacting authors
- author can see who is using their dataset
- sometimes data is collected but not used or not used to its full extent (especially the case with under/grad researchers)
- if permanent, URL does not expire when PI moves to a different university (I’ve lost web space in this instance or the website was redone when I wasn’t involved = others didn’t know the importance of my data and buried it somewhere else) --???? Are there any publishers/journals that store that raw dataset???
Governmental – data are for public benefit (most U.S. Federal data in public domain) - - but state data is not – ex. USU buglab BLM data - - ask Scott about how data sharing works with their database