DataONE:Protocols/Find GEO reuses

=Identify reuses of GEO datasets=

Aim
The aim of this protocol is to collect data on the reuses of datasets in the published literature. This particular protocol focuses on reuses of gene expression microarray datasets stored in NCBI's Gene Expression Omnibus (GEO) repository and tracks reuses attributed through accession numbers within the full text of articles in PubMed Central.

Background
Little research has been done on the patterns and prevalence of data reuse. A few superstar success stories need no analysis: Data from Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. There have been a few surveys, but they suffer from limited scope and self-reporting biases. Download stats are poorly correlated with perceived value <>. So let’s track reuse in the published literature.

Unfortunately, there are nto well-established attribution formats and standards for data to facilitate the sort of automated citation analysis that bibliomatricians perform with journal articles. Following the track of data is difficult in several additional ways: datasets do not have unambiguous identifiers, attribution is often within full text and thus difficult to query across journals and disciplines, and it is difficult to disambiguate the mention of a dataset in the context of reuse from the mention of a dataset deposit.

Restricting our focus to gene expression microarray data helps to address several of these issues. First, most shared gene expression microarray data is shared in once central repository: the NCBI's Gene Expression Omnibus (GEO). It is common practice to refer to datasets by their GEO accession numbers, and the GEO accession numbers have a fairly unique format. Furthermore, most creations and reuses of gene expression microarray data in the published literature are indexed by PubMed and are increasingly (as per NIH mandate) available for full-text query in PubMed Central. The coordinated Entrez databases and eUtils web service means that full-text can be queried automatically, links between articles and datasets can be monitored, and standard indexing metadata can be collected. All disciplines should be so lucky.

Below, then, is a protocol for using these resources to collect information on reuse. Please note the limitations section, and contribute if you have other ideas!

Protocol Overview

 * Query GEO for all GDS and GSE accession numbers for datasets deposited within specified date range
 * Determine which GSE accession numbers are within of which GDS numbers, to estimate total number of data packages
 * Query PubMed Central for each of these accession numbers, using eutils to search full text of papers available through PubMed Central
 * Exclude the PMC papers that created the GEO data, using Entrez links and guided manual inspection

Optionally:
 * Extrapolate to all of PubMed, using yearly proportion of articles with the MeSH term "gene expression profiling" in PMC vs all of PubMed
 * Estimate what percent of reuse papers have authors in common with the corresponding data creation paper, using last names, institutions, and manual inspection
 * Estimate what percent of reuse papers use data for metaanalysis, using MeSH
 * Estimate what percent of reuse papers use data for tool and method validation, using MeSH and journal title keywords

Online connection

 * eUtils

Installed software
Used python source code: NOTE: I'm still getting my git together, so the code at the above links may not be fully standalone or easily run by others. I'm working on it... in the meantime, feel free to email me if you want details!
 * eutils
 * pypub
 * geo data collection script

Accession number formats

 * look at both GSE and GDS accession numbers
 * use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
 * search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies
(geo OR omnibus) AND microarray AND "gene expression" AND accession NOT (databases         OR user OR users         OR (public AND accessed)          OR (downloaded AND published)) "gene expression omnibus” AND (submitted OR deposited)
 * spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
 * do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
 * could use query from my BioLink paper:
 * or the more simple:
 * to do this transparently, query PMC results for each of these words:
 * submitted
 * deposited
 * user*
 * public
 * accessed
 * downloaded
 * published

Estimate what percentage of reusers weren't the original authors

 * see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
 * other idea: institution comparison using medline info
 * better than submitter, because submitter not the whole story
 * better than institution, because institution not precise in submission

Is the PMC paper by the same investigators as those who originally created the data?
 * first pass: automatedly extracted a column that contained the last names at the intersection of the PMC reuse paper and those in the original data-creation paper and those in the GEO submission list
 * if there was a lot of author overlap, coded it as a "CREATOR REUSE" paper
 * also automatedly extracted the institution of the PMC reuse paper and the original data-creation paper. If there was overlap and some evidence of author overlap, coded it a "CREATOR REUSE" paper
 * if there was no overlap in author or institution, coded it as NOT a "CREATOR REUSE" paper
 * for ambiguous cases were there was an author in common between the two papers but it was a common name or the corresponding author addresses were different, I manually examined the PMC reuse paper and the data-creation paper to determine whether the common authors had the same initials and institutions. If yes, I coded it as a "CREATOR REUSE" paper, otherwise I coded it as NOT a "CREATIVE REUSE" paper

Extrapolate from PubMed Central to PubMed
number of articles in PMC: 6311, number of articles in PubMed: 21569, so PMC contains 29.26% of related papers
 * use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
 * restrict from 2007 to 2009
 * result:
 * so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing

Validation

 * To Do: compare results to GEO's list of 3rd party reuses:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html

Reuses of ArrayExpress datasets

 * as with GEO datasets, but gather ArrayExpress accession numbers through screen scrape of ArrayExpress website (is there a better way?).
 * used this url: http://www.ebi.ac.uk/microarray-as/ae/browse.html with Display=500 and click "Detailed view" in the header.  Warning, this is slow.
 * most expedient data extraction I could easily figure out: actually copy the raw data from within the frame and paste into a text file
 * didn't use any varients of ArrayExpress accession numbers. A quick google scholar exploration suggested that people are pretty consistent with the E-XXXX-nnnn formatting.
 * obviously the "NOT pmc_gds[filter]" isn't going to do much because it captures links between PMC and GEO not ArrayExpress. Left it in there anyway, since a large proportion of ArrayExpress content is pulled from GEO, might exclude some of the data creation articles
 * expect a higher proportion of data creation articles in resulting set, because no very effective automated filter

Example data
Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:
 * raw data

Potential uses

 * is the PMC paper actually about data sharing into GEO rather than data reuse?
 * is the PMC paper by the same investigators as those who originally created the data?
 * if reuse, is it in the context of developing a method or tool?
 * could use this data to see how many publications use any one dataset
 * could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
 * can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year

Known uses

 * A first snapshot of this data is included in a manuscript-in-progress
 * What else on OpenWetWare links to this page

Assumptions, Limitations, and Unknowns
This protocol captures a subset of all dataset reuses because of several limitations:
 * Many data citations are attributed without using accession numbers
 * don't have a good way to estimate this yet
 * would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
 * Many papers are not in PubMed Central
 * we can estimate what percentage are and then try to extrapolate to all papers. For example, using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
 * many datasets will continue to be used in the future... these reuses are obviously not continued in our estimate
 * our methods do not find studies that both create and reuse data
 * to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
 * we don't have an estimate of how many this is, would require manual inventory
 * Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)

Furthermore, extrapolations based on this data may be biased:
 * Papers in PubMed Central may not be representative
 * Deposits into PMC not stable over time, distribution may change over time, may be skewed based on open-access uptake or NIH-funding levels in various communities
 * our estimates do not consider reuses after our study timeframe

Open Questions

 * How to efficiently estimate what percent of these papers depended on the GEO data for their scientific contribution?

Possible Enhancements

 * Use Author-ity clusters to disambiguate authors
 * 1) Authority2009 pmid=20072710
 * Keep track of GDS and GSE overlaps

Related references

 * 1) Piwowar-blogGauntlet Piwowar, HA. Studying Reuse Of GEO Datasets In The Published Literature.  Research Remix.  July 5 2010.  blog post
 * 2) Piwowar-AMIA2008 pmid=18998887
 * 3) Piwowar-BioLINK2008 Piwowar, Wendy W Chapman (2008) Linking database submissions to primary citations with PubMed Central. BioLINK 2008, Toronto Canada. Full text

Contact

 * Protocol created by Heather Piwowar. Contact me if you have questions or suggestions!
 * or instead discuss this protocol on the associated talk page.