|
|
(3 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
| ==Research Plan== | | ==Aim== |
| ===Overview===
| |
| * Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007
| |
| * Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009
| |
| * Enumerate the PMC papers that reused GEO data
| |
| * Estimate what percent of these papers depended on the GEO data for their scientific contribution
| |
|
| |
|
| ===Query details=== | | ==Background== |
| * accession number formats:
| |
| ** look at both GSE and GDS accession numbers
| |
| ** use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
| |
| ** search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"
| |
|
| |
|
| ===Exclude data creation studies=== | | ==Methods== |
| * spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644) | | ===Overview=== |
| ** do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate | | * Using the method outlined at [[DataONE:Protocols/Find_GEO_reuses]]: |
| ** could use query from my BioLink paper:
| | ** Query GEO for all GDS and GDS accession numbers for datasets submitted in 2007 |
| (geo OR omnibus)
| | ** Query PubMed Central for these accession numbers in the full text of PMC papers published between 1900 and 2009 |
| AND microarray
| | ** Enumerate the PMC papers that reused GEO data |
| AND "gene expression"
| | ** Estimate what percent of these papers depended on the GEO data for their scientific contribution |
| AND accession
| |
| NOT (databases
| |
| OR user OR users
| |
| OR (public AND accessed)
| |
| OR (downloaded AND published))
| |
| ** or the more simple:
| |
| "gene expression omnibus” AND (submitted OR deposited)
| |
| ** to do this transparently, query PMC results for each of these words: | |
| *** submitted
| |
| *** deposited
| |
| *** user*
| |
| *** public
| |
| *** accessed
| |
| *** downloaded
| |
| *** published
| |
| | |
| ===Estimate time lag for reuse===
| |
| To estimate time lag:
| |
| * extract year
| |
| | |
| ===Estimate what percentage of reusers weren't the original authors===
| |
| * see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
| |
| * other idea: institution comparison using medline info | |
| * better than submitter, because submitter not the whole story | |
| * better than institution, because institution not precise in submission
| |
| | |
| ===Estimate what percent of reuse created "new science"===
| |
| * classify if methods or informatics:
| |
| ** journal name has informatics
| |
| ** mesh term for methods?
| |
| * look at mesh overlap? | |
| * look for metaanalysis mesh term? | |
| | |
| ===Estimate what percent of these papers depended on the GEO data for their scientific contribution===
| |
| ** Any good ideas on how to do this efficiently?
| |
| *** find those which are/are not in informatics journals
| |
| *** that use "methods" MeSH terms
| |
| *** ??
| |
| | |
| ===Estimate the fraction of all papers that are in PMC===
| |
| * use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
| |
| ** restrict from 2007 to 2009
| |
| ** result:
| |
| number of articles in PMC: 6311,
| |
| number of articles in PubMed: 21569,
| |
| so PMC contains 29.26% of related papers
| |
| * so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing
| |
| | |
| ==Limitations==
| |
| ===Important for argument===
| |
| This is a conservative estimate because:
| |
| * our estimates do not consider reuses after our study timeframe
| |
| ** many datasets we are considering will continue to be used in the future... these reuses are obviously not continued in our estimate
| |
| ** could estimate this impact if we examine data deposited 7 years ago?
| |
| * Many papers not in PubMed Central
| |
| ** using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
| |
| * our methods do not find studies that both create and reuse data
| |
| ** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
| |
| ** we don't have an estimate of how many this is, would require manual inventory
| |
| * Many data citations not attributed using accession numbers
| |
| ** don't have a good way to estimate this yet
| |
| ** would require a manual inventory, similar to Sarah's data citation inventory in DataONE summer 2010 project
| |
| ** maybe out-of-scope to get this estimate for this project, just admit it is an underestimate
| |
| | |
| ===Less important for argument===
| |
| * Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)
| |
| * Deposits into PMC not stable over time, distribution may change over time
| |
| | |
| ==Additional uses for this data collection==
| |
| * could use this data to see how many publications use any one dataset
| |
| * could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
| |
| * can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year
| |
| | |
| ==Open Questions==
| |
| None right now
| |
| | |
| ==Notes==
| |
| * this query PMC full-text approach is similar (as per correspondence with GEO team) to that used by the GEO team to compile the 3rd party reuse page: http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html
| |
| * would be nice to figure out how to write all of these columns to google docs directly from code
| |
| | |
| ==Early results==
| |
|
| |
|
| ===Data collection=== | | ===Details=== |
| Used python source code:
| | * see [[DataONE:Protocols/Find_GEO_reuses]] |
| *[http://github.com/hpiwowar/eutils eutils] | |
| *[http://github.com/hpiwowar/pypub pypub]
| |
| *[http://gist.github.com/448371 geo data collection script]
| |
|
| |
|
| Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:
| | ==Results== |
| * [http://spreadsheets.google.com/ccc?key=0Ai0SDlWE5_VYdHRDN0Q2WTV5T0RzM0dYME5OS09IRlE&hl=en raw data]
| |
|
| |
|
| Now adding derived columns to characterize each row:
| | ==Discussion== |
| * is the PMC paper actually about data sharing into GEO rather than data reuse?
| |
| * if reuse, is the PMC paper by the same investigators as those who originally created the data?
| |
| * if reuse, is it in the context of developing a method or tool?
| |