Revision as of 13:29, 14 July 2010

Identify reuses of GEO datasets

Aim

To collect data on the uses of uses of datasets in the published literature. This proposal focuses on reuses of gene expression microarray datasets stored in NCBI's Gene Expression Omnibus (GEO) repository and tracks reuses attributed through accession numbers.

Background

Little research has been done on the patterns and prevalence of data reuse. A few superstar success stories need no analysis: Data from Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. It is difficult to track reuse. There have been a few surveys, but they suffer from limited scope and self-reporting biases. I gather that download stats are poorly correlated with perceived value. So let’s track reuse in the published literature.

Protocol Overview

Query GEO for all GDS and GDS accession numbers for datasets
Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
Enumerate the PMC papers that reused GEO data
Estimate what percent of these papers depended on the GEO data for their scientific contribution

Assumptions

Materials

Online connection

eUtils

Installed software

python
- eutils library <<link here to github>>
- pypub library <<link here to github>>

Procedure

Accession number formats

look at both GSE and GDS accession numbers
use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies

spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
- do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
- could use query from my BioLink paper:

 (geo OR omnibus) 
 AND microarray 
 AND "gene expression"       
 AND accession
 NOT (databases 
        OR user OR users
        OR (public AND accessed) 
        OR (downloaded AND published))

- or the more simple:

 "gene expression omnibus” AND (submitted OR deposited)

- to do this transparently, query PMC results for each of these words:
  - submitted
  - deposited
  - user*
  - public
  - accessed
  - downloaded
  - published

Estimate the fraction of all papers that are in PMC

use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
- restrict from 2007 to 2009
- result:

 number of articles in PMC:  6311, 
 number of articles in PubMed:  21569, 
 so PMC contains 29.26% of related papers

so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing

Validation

To Do: compare results to GEO's list of 3rd party reuses: http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html

Limitations

This is a conservative estimate because:

our estimates do not consider reuses after our study timeframe
- many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
- could estimate this impact if we examine data deposited 7 years ago?
Many papers not in PubMed Central
- using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
our methods do not find studies that both create and reuse data
- to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
- we don't have an estimate of how many this is, would require manual inventory
Many data citations not attributed using accession numbers
- don't have a good way to estimate this yet
- would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
- maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
Deposits into PMC not stable over time, distribution may change over time

Application

Potential uses

could use this data to see how many publications use any one dataset
could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year

Known uses

draft manuscript

Possible Enhancements

Use Author-ity clusters to disambiguate authors

Torvik VI and Smalheiser NR. Author Name Disambiguation in MEDLINE. ACM Trans Knowl Discov Data. 2009 Jul 1;3(3). DOI:10.1145/1552303.1552304 | PubMed ID:20072710 | HubMed [Authority2009]

Keep track of GDS and GSE overlaps

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input! Please sign your name to your note by adding '''*~~~~''': to the beginning of your tip.

List troubleshooting tips here.
Anecdotal observations that might be of use to others can also be posted here.

Related references

blog intro
[Piwowar-blogGauntlet]
Piwowar HA and Chapman WW. Identifying data sharing in biomedical literature. AMIA Annu Symp Proc. 2008 Nov 6;2008:596-600. PubMed ID:18998887 | HubMed [Piwowar-AMIA2008]
BioLINK paper
[Piwowar-AMIA2008]

Contact

Protocol created by Heather Piwowar. Contact me if you have questions or suggestions!
or instead discuss this protocol on the associated talk page.

@@ Line 122: / Line 122: @@
 * or instead [[Talk:{{PAGENAME}}|discuss this protocol on the associated talk page]].
-<!-- Move the relevant categories above this line to tag your protocol with the label
 [[Category:Protocol]]
 [[Category:DataONE]]
 [[Category:Bibliometrics]]
--->

DataONE:Protocols/Find GEO reuses: Difference between revisions