DataONE:Notebook/Reuse of repository data/2010/06/29

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

Home People Research Summer 2010 Resources

Reuse of Repository Data

Main project page

Previous entry Next entry

Notes for June 29, 2010

Goals/tasks for today:

Go through previous entries, spreadsheets and data
Write initial outline summarizing findings so far based on above review
Create powerpoint presentation based on outline

Outline for Meeting

I. Motivations and initial questions

Data deposit vs. data reuse
How difficult is it to find data citations and why?
How do the citations vary across discipline, repository and publication?
What is the most common citation? Repository name? Data author name? Unique identifier like a study number or DOI?

II. Methods

Initial search process: Test searches for TreeBASE resulting sample articles study accession numbers and data author names to search for later.
Focused search

Repositories

TreeBASE
Pangaea
ORNL DAAC

Databases

ISI Web of Science Cited Reference Search
Scirus
Google Scholar

Limits

Date range: 2008-2010

Language: English

Journal articles only

Repository-specific search terms

TreeBASE: repository name, study accession number, data author name
Pangaea: repository name, DOI prefix, data author name
ORNL DAAC: repository name, DOI prefix, data author name, project name (BOREAS, FLUXNET, etc.)

Raw data input

1. Search comparison spreadsheet hosted here

Search methods, terms and datasets used to construct search terms were captured as well as the total number of results followed by respective hits and misses.
Percentages of hits vs. misses calculated within the spreadsheet.
Reasons for miss captured
Reasons for hit captured

2. Shared fields template from Sarah with my input data hosted here

Hosts data about individual articles, including DOIs as applicable, metadata and coding for hits and misses.

Interpretation
- Browse through observations made within OpenWetware journal entries
- Look through Search Comparisons spreadsheet for percentage of hits versus misses as well as the types of hits.

III. Stumbles

Finding focus and the difficulty of going beyond the obvious
- Mention of repository could mean either data was deposited there or downloaded from there.
- TreeBASE study accession numbers cited in article may have changed over time (from StudyID to LegacyID after study publication).
- “Pangaea” can refer to either Pangaea.de data repository or the Pangaea supercontinent. How do I exclude these results?
- Sometimes narrowing search terms with boolean operators or “-” exclusion only resulted in no results at all while broadening back out resulted in too many results to read through manually.
- Google scholar does not make the distinction between published journal articles and dissertations deposited into academic repositories.

"Missing” searches (use Search Methodology Table as visual aid in slideshow)
- For the sake of thoroughness, I intended to go through each possible search combination.
- Not all searches worked and I did not record them in my notebook. However, it is important to record these “failures” for future reference.
- Also, using the above-linked table helped show me that I missed some possible combinations.

“Like trying to find someone on Facebook only knowing their hair color and favorite breakfast cereal.”

IV. Findings by Repository

1. TreeBASE

ISI
- Most effective: Searching for author name and citations of the original article in which the dataset was used.
- Least effective: Searching by mention of repository name (also did not allow for search for study accession number).
Scirus
- Most effective: Search for mention of TreeBASE with controlled vocabulary.
- Least effective: Author name or study accession number.
Google Scholar
- Most effective: This should more accurately be called "least ineffective."
- Least effective: Even with controlled vocabulary, searching by mention of TreeBASE not helpful. Neither was searching for study accession numbers or data author names.

2. Pangaea

ISI
- Most effective: Search by individual author name and mention of Pangaea in Cited Author or Cited Work fields.
- Least effective: Some DOIs turned up in search results, but could not actually search using DOI in fields.
Scirus
- Most effective: Searching by DOI prefix with "*" wildcard.
- Least effective: Searching by author name.
Google Scholar
- Most effective: Possibly searching by DOI prefix with controlled vocabulary (not the same controlled vocabulary as used with TreeBASE, however).
- Least effective: Everything else.

3. ORNL DAAC

ISI
- Most effective: Search by data author name/original publication.
- Least effective: Once again, DOI cannot be used in in search fields.
Scirus
- Most effective: Search by DOI prefix with "*" wildcard and search by author name with ORNL project name (FLUXNET, BOREAS, etc.). Search for mentions of ORNL DAAC also yielded solid hits.
- Least effective: Surprisingly, none.
Google Scholar
- Most effective: Search for DOI prefix.
- Least effective: Search for mention of repository name (ORNL DAAC) even with controlled vocabulary.

V. Conclusions

VI. Future Plans

Article
Other repositories, search terms and databases
Compare data with other interns

DataONE:Notebook/Reuse of repository data/2010/06/29

Notes for June 29, 2010

Outline for Meeting

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools