Manuscript version on Google Docs
- Your comments, suggestions, and critiques are appreciated.
- Extremely Rough Draft
- Sarah Judson 19:32, 2 August 2010 (EDT):Draft updated today with writeup of dataset-level analysis methods and results. Comments are greatly appreciated!
Abstract (July 20 2010)
- this will probably be expanded into the introduction instead
- With the advent of the internet, data has literally been placed at our fingertips. Most commonly, we see this in the form of factoids and novel visualizations. It seems like anything imaginable can be retrieved and that there is an inundation of under-utilized information. Yet, in the scientific community a plethora of data is produced and remains obscure (could give potential reasons why). In response to this high production without requisite disemination, editorials have been published calling for more data sharing between scientists. Starting in ** with Genbank?, a number of data repositories have been formed to facilitate sharing of raw data. In conjunction, many repositories and journals have issued policies for reusing and sharing data. However, many of the journal policies lack information on how to cite such data in subsequent reuses. It is unclear if these policy recommendations are followed and to what degree depositories are utilized for both data reuse and sharing. Therefore, we investigated incidents of data reuse and sharing in **300** articles spanning journals in ecology, evolutionary, and environmental disciplines. Six focal journals were selected based on corresponding repository, discipline, and impact factor. All journals were additionally expected to produce non-standard biological data that could be deposited in the newer Dryad depository. Fifty articles from each journal, half from 2000 and half from 2010, were evaluated for incident of reuse or sharing, resolvability of the data citation (i.e. provision of an accession number), and proper attribution to the original data author(s). Out of the #% of articles that reused data, #% mentioned the depository, #% gave an accession number, and #% cited the original data authors, other than themselves, with a full bibliographic citation. **scoring** Of the # articles that shared data, #% mentioned the depository, #% gave an accession number, and #% shared all of their produced data.**scoring. The majority of data reuses lacked accession numbers even when the utilized depository was mentioned and many either gave an author citation or accession number, but not both. Data sharing was most common with datasets conducive to Genbank (i.e. gene sequences), but severly lacking with ecological and phylogenetic datasets. In particular, the systematic and molecular communities (represented by the journals Systematic Biology and Molecular Ecology respectively), often produced gene alignments and/or phylogenetic trees but they were often not posted to Treebase despite obvious general knowledge of its availablity and recommendations by the journal policies to do so. Biological (organismic) and ecological (community) datasets across the board were often not shared and most instances of reuse involved extraction of information from the published literature. General problems encountered were non-persistent or difficult to access datasets, supplementary archives utilized for additional result outputs rather than raw data, and non-compliance to journal and repository citation guidelines. Upon assessment of current citation practices in journals, we recommend a standardized bibliographic format for data citation and clearer, enforeced author instructions in journal and repository policies.
Outline (July 20 2010)
loose manuscript ideas
intro since the advent of the internet, scientists have been grappling with the simulatenous plethora and drought of data. While a multitude of data is
theoretically available and much of it with a single click, there is still much data that is unobtainable without significant effort....
similar studies (heather) have used citation tracking to assess incidences of resuse. however, data handling numbers and the tracking thereof are not
consistently given for ecological/environmenatl/evoluationary studies. this observation was confirmed by valeries work (unpublished data). Therefore, i
manually searched full text articles and extracted excerpts relating to data reuse, sharing, and production.
methods last paragraph of intro, methods, and first para of results read in most detail...others skimmed...checked by search full text for keywords (depositories,
author, data set, etc)
journal selection a set of candidate journals was first determined from the journals with the most posted datasets on dryad (set up to host non-standard biological datasets).
This set of journals was then expanded to include other journals that cooresponded with currently available depositories other than dryad. chosen based on corollary depository, impact factor, and high deposition rates in dryad gcb---DAAC paleo---paleodb molecEco-genbank (treebase) sysbio - treebase (genbank) ecology -- ecological archives amnat---dryad we expected these journals to have a high rate of desposition in the corollary depositories, as well as incidental deposition in internal (journal)
depositiories. we were therefore investigating both these rates of deposition as compared to data produced, as well as incidences of reuse of already
deposited datasets. furthermore, we assessed the citation quality of each reuse and sharing (in methods, give criteria).
dataset types (maybe a table) the bio, eco, and earth categories could be refined. we fully acknowledge that the current types are ambigious and are assigned on a subjective level. we
invite re-coding and re-analysis of this dataset in this and other regards. bio: measurements taken at the organismic level. ranging from morphological characters to chemical analysis, but excluding genetic sequences. earth: abiotic measurements. ranging from meteorological data to soil analysis.
results possibilities: Percentages -% extinct urls from personal/other share list (illustrates one reason depositories should be employed) -% reuse per journal per year (or per discipline, funder, etc) -% sharing per journal per year -% sharing vs. % produced -% that could have been put in relevant depository but weren't (especially treebase) Scoring -"quality" reuse citation (journal/repository specific?) -% sharing vs. % produced Correlations - dataset type (or journal, discipline, nationality, funding type) to YN/quality reused/shared -open access to data reuse/sharing -something with multiple datasets statistics: ordinal regression, anova, clustering, correlation additional statistics to do: percent improvement from 2000 to 2010 (would be a better measure for journals like ecology which have few reuses); could do this
for citation quality, incidences of reuse, and incidences of sharing/or sharing/production % - also (similar) % increase in utilization of depositories (esp. treebase)-->but then in discussion, state that treebase still under utilized judging by
amount of pt and ga prodcuced that aren't posted. (? could depositories be more active and contact authors to deposit in them after they see a paper
published? or accepted....this could capitalize on relationships with editoral boards of journals)
discussion need to address problem of where and how to store highly variable bio/eco data. obviously authors aren't sure and feel that there data is so local that it
isn't relevant/worth sharing (i don't have evidence to say it that way...but try to get at the point that bio/eco are undershared and the likely reasons)
changes to subsequent studies or reanalysis of this data (use the invite lingo...open science will dig that) omit non-regular articles (notes and comments, points of view, etc) more detailed anlaysis of funders, nationality, etc determine different bio/eco/ea data types
add to future dirctions: just look at metaanlyses which should have a higher incidence of reuse. broader impact journals: nature (too many confounding factors for this study)
Outline (June 30 2010)
- state that this has been studied for molecular/medical but not eco/evo
- term: "original data author"
- Time sample snapshot
- First 25-50 articles from 2010
- for Percentage stats, current state of things
- Random sample
- blocked by journal and by year
- PT/GA not assumed to be on treebase unless explicitly stated (even if the other was)
- Time sample snapshot
- Suggested best practices (see knoxville ppt)
- Internal (journal) supplementary data used more as a dump than for reusable data
- Specifically, sysbio problems (see Knoxville ppt and 7/13 blog)
- differences between journals
- examples of haphazard citations
- NABS: talk with NEON, Alain, Miller
- Author frustrated with Treebase: 10.1093/sysbio/syp080
- Genbank oddities (cited a bazillion different ways in the same paper = need for best practices): http://dx.doi.org.erl.lib.byu.edu/10.1111/j.1365-294X.2009.04411.x 10.1111/j.1365-294X.2009.04411.x
AND http://dx.doi.org.erl.lib.byu.edu/10.1111/j.1365-294X.2009.04433.x 10.1111/j.1365-294X.2009.04433.x
- Future directions (i.e. didn't have time to explore)
- Software/model reuse and sharing (R-packages, GUIs for math equations, etc)
- Method metadata (additional supplementary data about methods, i.e. explicit protocols or analysis steps that help make the data re-usable)
- also, the reusability of the file format
- Track the cited or shared datasets
- Databases (all the rage, but not standardized. at least the metadata and cache could be stored at a repository)
- Recommendations for future studies
- only extract methods
- more in-depth: time series, snapshots of more journals, funding analysis, other factor analysis
- Future directions (i.e. didn't have time to explore)