DataONE:GEO reuse study/Phase 1

From OpenWetWare
Revision as of 16:19, 20 June 2010 by Heather A Piwowar (talk | contribs) (remove brainstorming section, adding it to its own page)
Jump to: navigation, search

Research Plan


  • Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
  • Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
  • Enumerate the PMC papers that reused GEO data
  • Estimate what percent of these papers depended on the GEO data for their scientific contribution

Query details

  • accession number formats:
    • look at both GSE and GDS accession numbers
    • use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
    • search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies

  • spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
    • do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
    • could use query from my BioLink paper:
 (geo OR omnibus) 
 AND microarray 
 AND "gene expression"       
 AND accession
 NOT (databases 
        OR user OR users
        OR (public AND accessed) 
        OR (downloaded AND published)) 
    • or the more simple:
 "gene expression omnibus” AND (submitted OR deposited) 
    • to do this transparently, query PMC results for each of these words:
      • submitted
      • deposited
      • user*
      • public
      • accessed
      • downloaded
      • published

Estimate time lag for reuse

To estimate time lag:

  • extract year

Estimate what percentage of reusers weren't the original authors

  • see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
  • other idea: institution comparison using medline info
  • better than submitter, because submitter not the whole story
  • better than institution, because institution not precise in submission

Estimate what percent of reuse created "new science"

  • classify if methods or informatics:
    • journal name has informatics
    • mesh term for methods?
  • look at mesh overlap?
  • look for metaanalysis mesh term?

Estimate what percent of these papers depended on the GEO data for their scientific contribution

    • Any good ideas on how to do this efficiently?
      • find those which are/are not in informatics journals
      • that use "methods" MeSH terms
      •  ??

Estimate the fraction of all papers that are in PMC

  • use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
    • restrict from 2007 to 2009
    • result:
 number of articles in PMC:  6311, 
 number of articles in PubMed:  21569, 
 so PMC contains 29.26% of related papers
  • so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing


  • could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider
  • could use this data to see how many publications use any one dataset
  • can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year


  • this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page:
  • would be nice to figure out how to write all of these columns to google docs directly from code

Open Questions


Important for argument

This is a conservative estimate because:

  • our estimates do not consider reuses after our study timeframe
    • many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
    • could estimate this impact if we examine data deposited 7 years ago?
  • Many papers not in PubMed Central
    • using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
  • our methods do not find studies that both create and reuse data
    • to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
    • we don't have an estimate of how many this is, would require manual inventory
  • Many data citations not attributed using accession numbers
    • don't have a good way to estimate this yet
    • would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
    • maybe out-of-scope to get this estimate for this project, just admit it is an underestimate

Less important for argument

  • Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
  • Deposits into PMC not stable over time, distribution may change over time