DataONE:Notebook/Reuse of repository data/2010/06/29

From OpenWetWare
Jump to navigationJump to search

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

DataONE

Home        People        Research        Summer 2010        Resources       


Reuse of Repository Data Main project page
Previous entry      Next entry

Notes for June 29, 2010

  • Goals/tasks for today:
  1. Go through previous entries, spreadsheets and data
  2. Write initial outline summarizing findings so far based on above review
  3. Create powerpoint presentation based on outline

Outline for Meeting

I. Motivations and initial questions

  • Data deposit vs. data reuse
  • How difficult is it to find data citations and why?
  • How do the citations vary across discipline, repository and publication?
  • What is the most common citation? Repository name? Data author name? Unique identifier like a study number or DOI?

II. Methods

  • Initial search process: Test searches for TreeBASE resulting sample articles study accession numbers and data author names to search for later.
  • Focused search
Repositories
  1. TreeBASE
  2. Pangaea
  3. ORNL DAAC
Databases
  1. ISI Web of Science Cited Reference Search
  2. Scirus
  3. Google Scholar
Limits
Date range: 2008-2010
Language: English
Journal articles only
Repository-specific search terms
  1. TreeBASE: repository name, study accession number, data author name
  2. Pangaea: repository name, DOI prefix, data author name
  3. ORNL DAAC: repository name, DOI prefix, data author name, project name (BOREAS, FLUXNET, etc.)
  • Raw data input
1. Search comparison spreadsheet hosted here
  • Search methods, terms and datasets used to construct search terms were captured as well as the total number of results followed by respective hits and misses.
  • Percentages of hits vs. misses calculated within the spreadsheet.
  • Reasons for miss captured
  • Reasons for hit captured
2. Shared fields template from Sarah with my input data hosted here
  • Hosts data about individual articles, including DOIs as applicable, metadata and coding for hits and misses.
  • Interpretation
    • Browse through observations made within OpenWetware journal entries
    • Look through Search Comparisons spreadsheet for percentage of hits versus misses as well as the types of hits.

III. Stumbles

  • Finding focus and the difficulty of going beyond the obvious
    • Mention of repository could mean either data was deposited there or downloaded from there.
    • TreeBASE study accession numbers cited in article may have changed over time (from StudyID to LegacyID after study publication).
    • “Pangaea” can refer to either Pangaea.de data repository or the Pangaea supercontinent. How do I exclude these results?
    • Sometimes narrowing search terms with boolean operators or “-” exclusion only resulted in no results at all while broadening back out resulted in too many results to read through manually.
    • Google scholar does not make the distinction between published journal articles and dissertations deposited into academic repositories.
  • "Missing” searches (use Search Methodology Table as visual aid in slideshow)
    • For the sake of thoroughness, I intended to go through each possible search combination.
    • Not all searches worked and I did not record them in my notebook. However, it is important to record these “failures” for future reference.
    • Also, using the above-linked table helped show me that I missed some possible combinations.
  • “Like trying to find someone on Facebook only knowing their hair color and favorite breakfast cereal.”

IV. Findings by Repository

1. TreeBASE

  • ISI
    • Most effective: Searching for author name and citations of the original article in which the dataset was used.
    • Least effective: Searching by mention of repository name (also did not allow for search for study accession number).
  • Scirus
    • Most effective: Search for mention of TreeBASE with controlled vocabulary.
    • Least effective: Author name or study accession number.
  • Google Scholar
    • Most effective: This should more accurately be called "least ineffective."
    • Least effective: Even with controlled vocabulary, searching by mention of TreeBASE not helpful. Neither was searching for study accession numbers or data author names.

2. Pangaea

  • ISI
    • Most effective: Search by individual author name and mention of Pangaea in Cited Author or Cited Work fields.
    • Least effective: Some DOIs turned up in search results, but could not actually search using DOI in fields.
  • Scirus
    • Most effective: Searching by DOI prefix with "*" wildcard.
    • Least effective: Searching by author name.
  • Google Scholar
    • Most effective: Possibly searching by DOI prefix with controlled vocabulary (not the same controlled vocabulary as used with TreeBASE, however).
    • Least effective: Everything else.

3. ORNL DAAC

  • ISI
    • Most effective: Search by data author name/original publication.
    • Least effective: Once again, DOI cannot be used in in search fields.
  • Scirus
    • Most effective: Search by DOI prefix with "*" wildcard and search by author name with ORNL project name (FLUXNET, BOREAS, etc.). Search for mentions of ORNL DAAC also yielded solid hits.
    • Least effective: Surprisingly, none.
  • Google Scholar
    • Most effective: Search for DOI prefix.
    • Least effective: Search for mention of repository name (ORNL DAAC) even with controlled vocabulary.

V. Conclusions

VI. Future Plans

  • Article
  • Other repositories, search terms and databases
  • Compare data with other interns