DataONE:Notebook/Reuse of repository data/2010/06/29

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] Reuse of Repository Data
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

Notes for June 29, 2010

 * Goals/tasks for today:
 * Go through previous entries, spreadsheets and data
 * Write initial outline summarizing findings so far based on above review
 * Create powerpoint presentation based on outline

Outline for Meeting
I. Motivations and initial questions II. Methods
 * Data deposit vs. data reuse
 * How difficult is it to find data citations and why?
 * How do the citations vary across discipline, repository and publication?
 * What is the most common citation? Repository name? Data author name? Unique identifier like a study number or DOI?
 * Initial search process: Test searches for TreeBASE resulting sample articles study accession numbers and data author names to search for later.
 * Focused search
 * Repositories
 * TreeBASE
 * Pangaea
 * ORNL DAAC


 * Databases
 * ISI Web of Science Cited Reference Search
 * Scirus
 * Google Scholar


 * Limits
 * Date range: 2008-2010
 * Language: English
 * Journal articles only


 * Repository-specific search terms
 * TreeBASE: repository name, study accession number, data author name
 * Pangaea: repository name, DOI prefix, data author name
 * ORNL DAAC: repository name, DOI prefix, data author name, project name (BOREAS, FLUXNET, etc.)


 * Raw data input
 * 1. Search comparison spreadsheet hosted here
 * Search methods, terms and datasets used to construct search terms were captured as well as the total number of results followed by respective hits and misses.
 * Percentages of hits vs. misses calculated within the spreadsheet.
 * Reasons for miss captured
 * Reasons for hit captured


 * 2. Shared fields template from Sarah with my input data hosted here
 * Hosts data about individual articles, including DOIs as applicable, metadata and coding for hits and misses.


 * Interpretation
 * Browse through observations made within OpenWetware journal entries
 * Look through Search Comparisons spreadsheet for percentage of hits versus misses as well as the types of hits.

III. Stumbles


 * Finding focus and the difficulty of going beyond the obvious
 * Mention of repository could mean either data was deposited there or downloaded from there.
 * TreeBASE study accession numbers cited in article may have changed over time (from StudyID to LegacyID after study publication).
 * “Pangaea” can refer to either Pangaea.de data repository or the Pangaea supercontinent. How do I exclude these results?
 * Sometimes narrowing search terms with boolean operators or “-” exclusion only resulted in no results at all while broadening back out resulted in too many results to read through manually.
 * Google scholar does not make the distinction between published journal articles and dissertations deposited into academic repositories.


 * "Missing” searches (use Search Methodology Table as visual aid in slideshow)
 * For the sake of thoroughness, I intended to go through each possible search combination.
 * Not all searches worked and I did not record them in my notebook. However, it is important to record these “failures” for future reference.
 * Also, using the above-linked table helped show me that I missed some possible combinations.


 * “Like trying to find someone on Facebook only knowing their hair color and favorite breakfast cereal.”

IV. Findings by Repository

1. TreeBASE
 * ISI
 * Most effective: Searching for author name and citations of the original article in which the dataset was used.
 * Least effective: Searching by mention of repository name (also did not allow for search for study accession number).
 * Scirus
 * Most effective: Search for mention of TreeBASE with controlled vocabulary.
 * Least effective: Author name or study accession number.
 * Google Scholar
 * Most effective: This should more accurately be called "least ineffective."
 * Least effective: Even with controlled vocabulary, searching by mention of TreeBASE not helpful. Neither was searching for study accession numbers or data author names.

2. Pangaea
 * ISI
 * Most effective: Search by individual author name and mention of Pangaea in Cited Author or Cited Work fields.
 * Least effective: Some DOIs turned up in search results, but could not actually search using DOI in fields.
 * Scirus
 * Most effective: Searching by DOI prefix with "*" wildcard.
 * Least effective: Searching by author name.
 * Google Scholar
 * Most effective: Possibly searching by DOI prefix with controlled vocabulary (not the same controlled vocabulary as used with TreeBASE, however).
 * Least effective: Everything else.

3. ORNL DAAC
 * ISI
 * Most effective: Search by data author name/original publication.
 * Least effective: Once again, DOI cannot be used in in search fields.
 * Scirus
 * Most effective: Search by DOI prefix with "*" wildcard and search by author name with ORNL project name (FLUXNET, BOREAS, etc.). Search for mentions of ORNL DAAC also yielded solid hits.
 * Least effective: Surprisingly, none.
 * Google Scholar
 * Most effective: Search for DOI prefix.
 * Least effective: Search for mention of repository name (ORNL DAAC) even with controlled vocabulary.

V. Conclusions

VI. Future Plans


 * Article
 * Other repositories, search terms and databases
 * Compare data with other interns


 * }