DataONE:Protocols/Find GEO reuses: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
(start with protocol template content)
 
(add ArrayExpress variant)
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
<center>
=Identify reuses of GEO datasets=
{|style="width: 48em; background: #FF6677;"
| align="center"|'''This page is a template and should not be edited.'''<br><span style="font-size:90%">Click [http://openwetware.org/index.php?title={{PAGENAME}}&action=edit here], copy the source, and paste it into your page.</span>
|}
</center>


==Aim==
The aim of this protocol is to collect data on the reuses of datasets in the published literature.  This particular protocol focuses on reuses of gene expression microarray datasets stored in NCBI's Gene Expression Omnibus (GEO) repository and tracks reuses attributed through accession numbers within the full text of articles in PubMed Central.


==Background==
Little research has been done on the patterns and prevalence of data reuse. A few superstar success stories need no analysis: Data from Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.


They are so successful, though, that people discount them as special cases.


So what does the reuse behaviour look like for other datasets?


'''Interested in posting a protocol on OpenWetWare? Here is a template to help you do so.''' 
We don’t know. There have been a few surveys, but they suffer from limited scope and self-reporting biases. Download stats are poorly correlated with perceived value <<citation?>>. So let’s track reuse in the published literature.


'''Click the view source tab and copy everything below this line.  Paste it into your new protocol pageThen replace the text in this page with your own protocol. Feel free to add or delete sections as appropriate.'''
Unfortunately, there are nto well-established attribution formats and standards for data to facilitate the sort of automated citation analysis that bibliomatricians perform with journal articlesFollowing the track of data is difficult in several additional ways: datasets do not have unambiguous identifiers, attribution is often within full text and thus difficult to query across journals and disciplines, and it is difficult to disambiguate the mention of a dataset in the context of reuse from the mention of a dataset deposit.
[[Category:Template]]
<!-- COPY EVERYHING BELOW HERE TO START YOUR OWN PROTOCOL!  -->


==Overview==
Restricting our focus to gene expression microarray data helps to address several of these issues.  First, most shared gene expression microarray data is shared in once central repository:  the NCBI's Gene Expression Omnibus (GEO).  It is common practice to refer to datasets by their GEO accession numbers, and the GEO accession numbers have a fairly unique format.  Furthermore, most creations and reuses of gene expression microarray data in the published literature are indexed by PubMed and are increasingly (as per NIH mandate) available for full-text query in PubMed Central.  The coordinated Entrez databases and eUtils web service means that full-text can be queried automatically, links between articles and datasets can be monitored, and standard indexing metadata can be collected.  All disciplines should be so lucky.


This is a protocol designed for T-regulatory assay for Mouse allergen (Mus m1) study.
Below, then, is a protocol for using these resources to collect information on reuse. Please note the limitations section, and contribute if you have other ideas!


==Materials==
==Protocol Overview==
*List reagents, supplies and equipment necessary to perform the protocol here. 
*Query GEO for all GDS and GSE accession numbers for datasets deposited within specified date range
{| border="1" cellpadding="4" cellspacing="0" style="border:#c9c9c9 1px solid; margin: 1em 1em 1em 0; border-collapse: collapse; width:710px" <!--
*Determine which GSE accession numbers are within of which GDS numbers, to estimate total number of data packages
*Query PubMed Central for each of these accession numbers, using eutils to search full text of papers available through PubMed Central
*Exclude the PMC papers that created the GEO data, using Entrez links and guided manual inspection


This line here formats your table for you.  Change the code to change the formatting of your table.-->
Optionally:
| align="center" style="background:#f0f0f0;"|'''Isolation of Mononuclear Cells'''
*Extrapolate to all of PubMed, using yearly proportion of articles with the MeSH term "gene expression profiling" in PMC vs all of PubMed
| align="center" style="background:#f0f0f0;"|'''PBMC Antigen Stimulation Assay'''
*Estimate what percent of reuse papers have authors in common with the corresponding data creation paper, using last names, institutions, and manual inspection
| align="center" style="background:#f0f0f0;"|'''Collection of Supernatant'''
*Estimate what percent of reuse papers use data for metaanalysis, using MeSH
| align="center" style="background:#f0f0f0;"|'''T-Regulatory Cells Surface Staining'''
*Estimate what percent of reuse papers use data for tool and method validation, using MeSH and journal title keywords
<!-- Copy and paste one of the lines above to create a new column in the schedule table.  Alternatively, you can also delete lines to reduce the number of columns.-->
|--
|On average, 18mL of whole heparinized blood/patient will be given.
|8 x 10^6 (8 million) cells for 7 day culture (for measurement of frequency of PBMC-derived, Mus M1-specific CD4+ CD25+ FoxP3+ T cells)
|Sterile 5mL polystyrene round-bottom tubes.
|Phosphate buffered saline (PBS)
|--
|--
|Ficoll Paque Plus [Endotoxin tested], room temperature.
|CFSE (5μM aliquots).
|Staining buffer (PBS + 5g BSA/L + 2mM EDTA)
|Staining buffer (PBS + 5g BSA/L + 2mM EDTA)
|--
|--
|Sterile phosphate buffered saline (PBS), room temperature
|AIM-V Media [add 500μL Amphotericin (Fungizone), 5mL PS (PCN streptomycin), and 5mL Glutamine].
|Cluster tubes or Cryovial tubes and racks.
|Cell separation buffer (10X stock; prepare 1X using distilled water)
|--
|--
|0.2% and 0.4% Trypan solution.
|IL-2 (approximately 20 units/μL; 1 μg = 2.4x10^3 units) [To make a 10μg vial of IL-2, reconstitute in 1mL sterile PBS + 1% human AB serum to get a final concentration of 10 μg/mL. Place in aliquots of 50 μL and freeze at -80°].
|
|Cocktail preparation [need 50μL of cocktail/flow tube; volume of mAb needed per flow tube: CD25-PCy5 (10uL), CD4-PC7 (5uL), CD3-APC7 (5uL), Aqua live/ dead (2μL),CD127-PE (1uL); add staining buffer to “cocktail tube” to reach a total volume of 50μl per flow tube. Vortex. Store in fridge (light-sensitive) until ready for use]
|--
|--
|Sterile conical tubes (15mL, 50mL)
|CD3, CD28 expander beads
|
|1X Facs Lysis Buffer
|--
|--
|Sterile, graduated transfer pipettes.
|Mus M1 (50µl) aliquot tube.
|
|A solution of 20% DMSO + PBS.
|--
|--
|Hemocytometer or disposable slides for the automated counter or manual counter
|Tetanus Toxoid (stock 2 mg/mL).
|
|Fix/Perm Concentrate
|--
|--
|Accuspin tubes
|24 well tissue culture plates.
|
|Fix/Perm Diluent
|--
|--
|
|Sterile 5mL polypropylene round-bottom tubes
|
|FoxP3-Alexa 647  Antibody.
|--
|--
|
|
|
|15mL conical tubes.
|--
|--
|
|
|
|5mL polystyrene tubes
|--
<!-- To add another row to the table copy and paste everything from the |-- line to just above this line.-->
|}


=='''Procedure'''==
==Materials==
===Online connection===
* eUtils


===Installed software===
Used python source code:
*[http://github.com/hpiwowar/eutils eutils]
*[http://github.com/hpiwowar/pypub pypub]
*[http://gist.github.com/448371 geo data collection script]
NOTE:  I'm still getting my git together, so the code at the above links may not be fully standalone or easily run by others.  I'm working on it... in the meantime, feel free to email me if you want details!


==Procedure==


==Isolation of Mononuclear Cells==
===Summary===


* This is a sterile procedure and all steps should be performed in a hood.
===Accession number formats===
*look at both GSE and GDS accession numbers
*use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
*search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"


# Turn on the hood. Bring Ficoll and PBS to room temperature in the hood.
===Exclude data creation studies===
* spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
** do this for all the PMC article hits?  looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
** could use query from my BioLink paper: 
  (geo OR omnibus)
  AND microarray
  AND "gene expression"     
  AND accession
  NOT (databases
        OR user OR users
        OR (public AND accessed)
        OR (downloaded AND published))
** or the more simple: 
  "gene expression omnibus” AND (submitted OR deposited)
** to do this transparently, query PMC results for each of these words:
*** submitted
*** deposited
*** user*
*** public
*** accessed
*** downloaded
*** published


# Obtain whole blood specimens collected in sodium heparin (green top) collection tubes and record subject information, i.e. ID #, date collected, date received.
===Estimate what percentage of reusers weren't the original authors===
* see if AND pubmed_gds and NOT pmc_gds have any author overlaps?  (note AND should be pubmed!)
* other idea:  institution comparison using medline info
* better than submitter, because submitter not the whole story
* better than institution, because institution not precise in submission


# If performing Basophil Activation assay on this sample, set aside 3mL of whole blood.
Is the PMC paper by the same investigators as those who originally created the data?
* first pass:  automatedly extracted a column that contained the last names at the intersection of the PMC reuse paper and those in the original data-creation paper and those in the GEO submission list
* if there was a lot of author overlap, coded it as a "CREATOR REUSE" paper
* also automatedly extracted the institution of the PMC reuse paper and the original data-creation paper.  If there was overlap and some evidence of author overlap, coded it a "CREATOR REUSE" paper
* if there was no overlap in author or institution, coded it as NOT a "CREATOR REUSE" paper
* for ambiguous cases were there was an author in common between the two papers but it was a common name or the corresponding author addresses were different, I manually examined the PMC reuse paper and the data-creation paper to determine whether the common authors had the same initials and institutions. If yes, I coded it as a "CREATOR REUSE" paper, otherwise I coded it as NOT a "CREATIVE REUSE" paper


# Dilute remaining blood at 1:1 with PBS in 50 mL conical tubes, adding PBS first followed by blood.
===Extrapolate from PubMed Central to PubMed===
* use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
** restrict from 2007 to 2009
** result: 
  number of articles in PMC: 6311,
  number of articles in PubMed:  21569,  
  so PMC contains 29.26% of related papers
* so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing


# Place 15 mL of Ficoll in a 50 mL conical tube. Overlay with up to 30mL of diluted blood, adding it very slowly to make sure that the blood doesn’t mix with the Ficoll layer.
==Validation==
* To Do:  compare results to GEO's list of 3rd party reuses:  http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html


# Centrifuge at 500 g for 30 minutes at room temperature (slow acceleration, deceleration off to ensure no disruption of the density gradient).
== Variants ==
===Reuses of ArrayExpress datasets===
* as with GEO datasets, but gather ArrayExpress accession numbers through screen scrape of ArrayExpress website (is there a better way?).
** used this url:  http://www.ebi.ac.uk/microarray-as/ae/browse.html with Display=500 and click "Detailed view" in the header.  Warning, this is slow.
** most expedient data extraction I could easily figure out: actually copy the raw data from within the frame and paste into a text file
* didn't use any varients of ArrayExpress accession numbers.  A quick google scholar exploration suggested that people are pretty consistent with the E-XXXX-nnnn formatting.
* obviously the "NOT pmc_gds[filter]" isn't going to do much because it captures links between PMC and GEO not ArrayExpress.  Left it in there anyway, since a large proportion of ArrayExpress content is pulled from GEO, might exclude some of the data creation articles
* expect a higher proportion of data creation articles in resulting set, because no very effective automated filter


# Using a sterile transfer pipette, aspirate the buffy coat (peripheral blood mononuclear cells [PBMCs]) into a new 50 mL conical tube. (Avoid aspirating the Ficoll.) Add PBS to bring to a minimum of 2x the volume, inverting up and down to mix.
==Application==
===Example data===
Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:
* [http://spreadsheets.google.com/ccc?key=0Ai0SDlWE5_VYdHRDN0Q2WTV5T0RzM0dYME5OS09IRlE&hl=en raw data]


# Centrifuge at 500 g for 20 minutes at room temperature (maximum acceleration and deceleration).
===Potential uses===
* is the PMC paper actually about data sharing into GEO rather than data reuse?
* is the PMC paper by the same investigators as those who originally created the data?
* if reuse, is it in the context of developing a method or tool?
* could use this data to see how many publications use any one dataset
* could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
* can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year


# Aspirate and discard the supernatant. Resuspend the cell pellet first by tapping the tube until no clumps are visible, then adding 1 mL of PBS. Set aside a 10 µL aliquot of cells for counting as follows: place the 10 µL of cells into 90 µL of PBS in a small sterile eppendorf tube, and add 100 µL of Trypan Blue (0.2%) into the eppendorf tube.  Mix well, and let it sit for 1 minute before counting.
===Known uses===
* A [[DataONE:GEO_reuse_study/Phase_1|first snapshot]] of this data is included in a manuscript-in-progress
* [[Special:WhatLinksHere/DataONE:Protocols/Find_GEO_reuses|What else on OpenWetWare links to this page]]


# Add 19 mL of PBS to the cells in 990 µL to make a total volume of 19.99 mL and centrifuge at 300 g for 15 minutes at room temperature (maximum acceleration and deceleration).
==Assumptions, Limitations, and Unknowns==
This protocol captures a subset of all dataset reuses because of several limitations:
* Many data citations are attributed without using accession numbers
** don't have a good way to estimate this yet
** would require a manual inventory, similar to [[User:Sarah_Judson/Notebook/DataOne_DataCitationPractices|Sarah's data citation inventory in DataONE summer 2010 project]]
* Many papers are not in PubMed Central
** we can estimate what percentage are and then try to extrapolate to all papers.  For example, using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
** many datasets will continue to be used in the future... these reuses are obviously not continued in our estimate
* our methods do not find studies that both create and reuse data
** to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
** we don't have an estimate of how many this is, would require manual inventory
* Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)


# In order to determine the volume to use for resuspending the PBMCs after the wash, the total number of cells in the sample must be determined.
Furthermore, extrapolations based on this data may be biased:
* Papers in PubMed Central may not be representative
* Deposits into PMC not stable over time, distribution may change over time, may be skewed based on open-access uptake or NIH-funding levels in various communities
*  our estimates do not consider reuses after our study timeframe


# Carefully introduce 10 µL of the stained cells into the notch of a hemocytometer and record cell counts using a hand-held counter. Calculate the number of cells, taking into account the dilution factors and sample volume used. Or place 20 µL of the stained cells onto a disposable slide and count using the automated cell counter.
===Open Questions===
* How to efficiently estimate what percent of these papers depended on the GEO data for their scientific contribution?


# After centrifugation is completed, aspirate and discard the supernatant. Resuspend the cell pellet by tapping the tube until no clumps are visible. Suspend PBMCs with PBS at 10 x 10^6/ml in 15 ml PP conical tube.
==Possible Enhancements==
* Use Author-ity clusters to disambiguate authors
<biblio>
#Authority2009 pmid=20072710
</biblio>
* Keep track of GDS and GSE overlaps


=='''PBMC Antigen Stimulation Assay'''==
==Related references==
==*This is a sterile procedure and all steps should be performed in a hood.==
<biblio>
# Remove stimulants from the freezer and thaw.
#Piwowar-blogGauntlet Piwowar, HAStudying Reuse Of GEO Datasets In The Published LiteratureResearch RemixJuly 5 2010[http://researchremix.wordpress.com/2010/07/05/studying-reuse-of-geo-datasets-in-the-published-literature/ blog post]
# Label 24 well plate with specimen ID and date (this is for the 7 days cell culture).  Label each well with the appropriate stimulant condition, '''ordered by priority''' (for cases where there are insufficient cells to test all stimulants).
#Piwowar-AMIA2008 pmid=18998887
## Musm1 (allergen):  @ 200μg/mL purified Musm1 protein in Aim-V.
#Piwowar-BioLINK2008 Piwowar, Wendy W Chapman (2008) Linking database submissions to primary citations with PubMed CentralBioLINK 2008, Toronto Canada. [http://www.researchremix.org/data/bioLINK2008%20Piwowar.doc Full text]
## AIM-V + IL-2 (negative control): AIM-V + IL-2 medium alone.
</biblio>
## Beads (positive control): 1μg/mL anti-Cd3/Cd28 beads.
## Tetanus:  200μg/mL tetanus in AIM-V.
# Add an equal volume (1:1 dilution) of freshly prepared 10μM CFSE (in PBS) to the tube of cells.  To make 1.5mL of PBS + CFSE (2x solutin): add 3μL of stock CFSE (5mM) into 1.5mL of PBS.
# Incubate in 37deg;C water bath for 10 minutes.
# After incubation, wash the CFSE stained cells in 10mL of AIM-V @ 300g for 10 minutesAspirate supernatant after centrifugation.
# Resuspend CFSE stained cells in AIM-V medium @ 4 million cells/mLFor plating, each well should contain 2 milliion cells/mL.
* Begin the stimulation process by preparing AIM-V medium + a 2X solution of IL-2 by adding 2μL of IL-2 per mL of medium in a 15mL conical tubeFor 5 stimulant conditions, you will need atleast 2.5mL of AIM-V + IL-2Vortex gently.
 
# Prepare solutions for each stimulant condition in 5 mL PP tubes as follows:
 
## AIM-V+IL-2 medium alone: place 500μL of AIM-V+IL-2 in the appropriately labeled tube and add 500μL of cells in AIM-V+IL-2. Pipette up and down and plate.
 
## 1μg/mL anti-CD3, anti-CD28 beads in AIM-V+IL-2: place 500μL of AIM-V in the appropriately labeled tube and add 2.5μL of CD3, CD28 expander beads. Then add 500μL of cells in AIM-V+IL-2. Pipette up and down and plate.
 
## 200μg/mL MusM1 in AIM-V+IL-2: place 450μL of AIM-V+IL-2 in the appropriately labeled tube and add 50μL of allergen (MusM1). Then add 500μL of cells in AIM-V+IL-2 + allergen. Pipette up and down, and plate.
 
## 200μg/mL tetanus in AIM-V+IL-2: place 475μL of AIM-V+IL-2 in the appropriately labeled tube and add 25μL of tetanus. Then add 500μL of cells in AIM-V+IL-2. Pipette up and down and plate.
 
## Place the tissue culture plate in the 37°C CO2 incubator for 7 days.
 
=='''COLLECTION OF SUPERNATANTS'''==
 
# Cell culture supernatants will be collected for cytokine measurement for 2 or 7 days cultures.
 
# At day 7, collect cells carefully by pipetting up and down to resuspend cells in the well, and
 
transfer the total volume in each well into separate 5ml polystyrene tubes.  Rinse each well
 
with 200μl of staining buffer and add to corresponding PS tubes. Tap gently to mix.
 
# Centrifuge tubes at 300g for 5 minutes at 25°C temperature.
 
# Obtain a cluster tube rack for the storage of supernatants.
 
# Label cluster tubes as follows:
 
## Specimen ID
 
## Date
 
## Supernatants - 7 day
 
## Stimulant condition 1-4
 
### Stimulant 1 = MusM1
 
### Stimulant 2 = AIM-V
 
### Stimulant 3 = Beads
 
### Stimulant 4 = Tetanus Toxoid
# For each stimulant, using a 1000μL pipette tip, transfer 800μL of supernatant from the PS culture tube into each corresponding cluster tube in the cluster rack. Be careful not to disturb the cell pellets.
 
# Cap the cluster tubes and store in the –80oC freezer (by the freight elevators).
 
=='''T REGULATORY CELL SURFACE STAINING'''==
 
# After the collection of supernatants according to the Collection of Supernatants Protocol. Resuspend cells in 80μL of cell separation buffer.
 
# Add 1mL of staining buffer to each tube and vortex.
 
# Wash cells at 300 g for 5 minutes at 4°C. Decant supernatant.
 
# Add 1mL of staining buffer to each tube and vortex.
 
# Wash cells at 300 g for 10 minutes at 4°C. Decant supernatant.
 
# Prepare cocktail preparation according to the T REGULATORY CELL SURFACE STAINING;    Materials. Add 50μL of cocktail to each tube.
 
# Incubate @ 4°C for 20-30 minutes.
 
# Add 3mL of staining buffer to each tube and vortex.
 
# Wash at 300 g for 10 minutes at 4°C. Decant supernatant.
 
* At this point in the experiment, one can fix/freeze the cells or continue with Intracellular cytokine staining.
 
# To fix/freeze the cells, pipette 500μl of 1X Facs Lysis Buffer into each tube and incubate @ 25°C in the dark for 15 minutesAfter the incubation, pipette 500μl of the 20% DMSO + PBS solution into each tube.  
 
# Transfer the 1ml of cells + 1X Facs Lysis Buffer + 20% DMSO/PBS solution into an appropriately labeled cluster tubes, and freeze at -80°C until when you are ready to conduct Intracellular cytokine staining..
 
# To Intracellular Cytokine stain the cell with Fox P3, prepare a Fix/Perm working solution as follow: dilute the Fix/Perm Concentrate (1 part) into the Fix/Perm Diluent (3 parts) to the desired volume of working solution (1mL per tube).
 
# Resuspend cell pellet with pulse vortex and add 1mL of Fix/Perm buffer to each sample tube. Pulse vortex again.
 
# Incubate at 4°C for 30-60 minutes in the dark (preferably for 60 minutes if possible).  Wash for 10 minutes at 800 g x 2 (4°C) with 2mL of 1X Perm Buffer (made from 10X solution using dH2O). Decant supernatant.
 
# Add 20μL of Foxp3 antibody into 100μL of 1X Perm Buffer. Incubate for at least 30 minutes (45 minutes if possible) at 4°C.
 
# Wash at 800g for 10 minutes x 2 with 2mL 1X Perm Buffer. Decant supernatant. Vortex and acquire using the Flow machine at 17-40.


==Notes==
==Notes==
Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!
Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!
Please sign your name to your note by adding <font face="courier"><nowiki>'''*~~~~''':</nowiki></font> to the beginning of your tip.
#List troubleshooting tips here.   
#List troubleshooting tips here.   
#You can also link to FAQs/tips provided by other sources such as the manufacturer or other websites.
#Anecdotal observations that might be of use to others can also be posted here.   
#Anecdotal observations that might be of use to others can also be posted here.   
Please sign your name to your note by adding <font face="courier"><nowiki>'''*~~~~''':</nowiki></font> to the beginning of your tip.
==References==
'''Relevant papers and books'''
<!-- If this protocol has papers or books associated with it, list those references here.  See the [[OpenWetWare:Biblio]] page for more information. -->
<biblio>
#Goldbeter-PNAS-1981 pmid=6947258
#Jacob-JMB-1961 pmid=13718526
#Ptashne-Genetic-Switch isbn=0879697164
</biblio>


==Contact==
==Contact==
*Who has experience with this protocol?
* Protocol created by [[User:Heather_A_Piwowar|Heather Piwowar]].  Contact me if you have questions or suggestions!
 
* or instead [[Talk:{{PAGENAME}}|discuss this protocol on the associated talk page]].  
or instead, [[Talk:{{PAGENAME}}|discuss this protocol]].
 
<!-- You can tag this protocol with various categories.  See the [[Categories]] page for more information. -->


<!-- Move the relevant categories above this line to tag your protocol with the label
[[Category:DataONE]]
[[Category:Protocol]]
[[Category:Protocol]]
[[Category:Needs attention]]
[[Category:In vitro]]
[[Category:In vivo]]
[[Category:In silico]]
[[Category:In silico]]
 
[[Category:Data analysis]]
[[Category:DNA]]
[[Category:Bibliometrics]]
 
[[Category:RNA]]
 
[[Category:Protein]]
 
[[Category:Chemical]]
 
[[Category:Escherichia coli]]
 
[[Category:Yeast]]
-->

Latest revision as of 13:38, 20 July 2010

Identify reuses of GEO datasets

Aim

The aim of this protocol is to collect data on the reuses of datasets in the published literature. This particular protocol focuses on reuses of gene expression microarray datasets stored in NCBI's Gene Expression Omnibus (GEO) repository and tracks reuses attributed through accession numbers within the full text of articles in PubMed Central.

Background

Little research has been done on the patterns and prevalence of data reuse. A few superstar success stories need no analysis: Data from Genbank and the Protein Data Bank are reused, heavily, successfully. They have generated important science that would not have been possible otherwise.

They are so successful, though, that people discount them as special cases.

So what does the reuse behaviour look like for other datasets?

We don’t know. There have been a few surveys, but they suffer from limited scope and self-reporting biases. Download stats are poorly correlated with perceived value <<citation?>>. So let’s track reuse in the published literature.

Unfortunately, there are nto well-established attribution formats and standards for data to facilitate the sort of automated citation analysis that bibliomatricians perform with journal articles. Following the track of data is difficult in several additional ways: datasets do not have unambiguous identifiers, attribution is often within full text and thus difficult to query across journals and disciplines, and it is difficult to disambiguate the mention of a dataset in the context of reuse from the mention of a dataset deposit.

Restricting our focus to gene expression microarray data helps to address several of these issues. First, most shared gene expression microarray data is shared in once central repository: the NCBI's Gene Expression Omnibus (GEO). It is common practice to refer to datasets by their GEO accession numbers, and the GEO accession numbers have a fairly unique format. Furthermore, most creations and reuses of gene expression microarray data in the published literature are indexed by PubMed and are increasingly (as per NIH mandate) available for full-text query in PubMed Central. The coordinated Entrez databases and eUtils web service means that full-text can be queried automatically, links between articles and datasets can be monitored, and standard indexing metadata can be collected. All disciplines should be so lucky.

Below, then, is a protocol for using these resources to collect information on reuse. Please note the limitations section, and contribute if you have other ideas!

Protocol Overview

  • Query GEO for all GDS and GSE accession numbers for datasets deposited within specified date range
  • Determine which GSE accession numbers are within of which GDS numbers, to estimate total number of data packages
  • Query PubMed Central for each of these accession numbers, using eutils to search full text of papers available through PubMed Central
  • Exclude the PMC papers that created the GEO data, using Entrez links and guided manual inspection

Optionally:

  • Extrapolate to all of PubMed, using yearly proportion of articles with the MeSH term "gene expression profiling" in PMC vs all of PubMed
  • Estimate what percent of reuse papers have authors in common with the corresponding data creation paper, using last names, institutions, and manual inspection
  • Estimate what percent of reuse papers use data for metaanalysis, using MeSH
  • Estimate what percent of reuse papers use data for tool and method validation, using MeSH and journal title keywords

Materials

Online connection

  • eUtils

Installed software

Used python source code:

NOTE: I'm still getting my git together, so the code at the above links may not be fully standalone or easily run by others. I'm working on it... in the meantime, feel free to email me if you want details!

Procedure

Summary

Accession number formats

  • look at both GSE and GDS accession numbers
  • use both the raw ID number like 200007572 and the stripped version without the 200... prefix. For example, search for both 200007572 and 7572
  • search for both accession number right beside the prefix, and with one space in between, so "GSE 7572" and "GSE7572"

Exclude data creation studies

  • spot-check to make sure accession number is in the context of reuse... looks like there may be a few mentions in the context of depost in which the article is not tagged with pmc_gds[filter] (example: PMCID 2396644)
    • do this for all the PMC article hits? looks like there are a few missing the filter, and it matters because it would erroneously inflate our reuse estimate
    • could use query from my BioLink paper:
 (geo OR omnibus) 
 AND microarray 
 AND "gene expression"       
 AND accession
 NOT (databases 
        OR user OR users
        OR (public AND accessed) 
        OR (downloaded AND published)) 
    • or the more simple:
 "gene expression omnibus” AND (submitted OR deposited) 
    • to do this transparently, query PMC results for each of these words:
      • submitted
      • deposited
      • user*
      • public
      • accessed
      • downloaded
      • published

Estimate what percentage of reusers weren't the original authors

  • see if AND pubmed_gds and NOT pmc_gds have any author overlaps? (note AND should be pubmed!)
  • other idea: institution comparison using medline info
  • better than submitter, because submitter not the whole story
  • better than institution, because institution not precise in submission

Is the PMC paper by the same investigators as those who originally created the data?

  • first pass: automatedly extracted a column that contained the last names at the intersection of the PMC reuse paper and those in the original data-creation paper and those in the GEO submission list
  • if there was a lot of author overlap, coded it as a "CREATOR REUSE" paper
  • also automatedly extracted the institution of the PMC reuse paper and the original data-creation paper. If there was overlap and some evidence of author overlap, coded it a "CREATOR REUSE" paper
  • if there was no overlap in author or institution, coded it as NOT a "CREATOR REUSE" paper
  • for ambiguous cases were there was an author in common between the two papers but it was a common name or the corresponding author addresses were different, I manually examined the PMC reuse paper and the data-creation paper to determine whether the common authors had the same initials and institutions. If yes, I coded it as a "CREATOR REUSE" paper, otherwise I coded it as NOT a "CREATIVE REUSE" paper

Extrapolate from PubMed Central to PubMed

  • use "gene expression profiling"[mesh] query in PMC vs PubMed over time period in question to get relevant estimate
    • restrict from 2007 to 2009
    • result:
 number of articles in PMC:  6311, 
 number of articles in PubMed:  21569, 
 so PMC contains 29.26% of related papers
  • so we should multiply our number of scientific papers by about 3 to get estimate for all of scientific publishing

Validation

Variants

Reuses of ArrayExpress datasets

  • as with GEO datasets, but gather ArrayExpress accession numbers through screen scrape of ArrayExpress website (is there a better way?).
    • used this url: http://www.ebi.ac.uk/microarray-as/ae/browse.html with Display=500 and click "Detailed view" in the header. Warning, this is slow.
    • most expedient data extraction I could easily figure out: actually copy the raw data from within the frame and paste into a text file
  • didn't use any varients of ArrayExpress accession numbers. A quick google scholar exploration suggested that people are pretty consistent with the E-XXXX-nnnn formatting.
  • obviously the "NOT pmc_gds[filter]" isn't going to do much because it captures links between PMC and GEO not ArrayExpress. Left it in there anyway, since a large proportion of ArrayExpress content is pulled from GEO, might exclude some of the data creation articles
  • expect a higher proportion of data creation articles in resulting set, because no very effective automated filter

Application

Example data

Extracted this raw data, one row for every (GEO accession number:PMCID of paper that includes the accession number) pair:

Potential uses

  • is the PMC paper actually about data sharing into GEO rather than data reuse?
  • is the PMC paper by the same investigators as those who originally created the data?
  • if reuse, is it in the context of developing a method or tool?
  • could use this data to see how many publications use any one dataset
  • could use this data to look at average elapsed time between data submission and reuse, but only have short time period to consider... better off with data deposited longer ago
  • can't use this particular data to see how many datasets each publication uses, because only looking at datasets from a given year

Known uses

Assumptions, Limitations, and Unknowns

This protocol captures a subset of all dataset reuses because of several limitations:

  • Many data citations are attributed without using accession numbers
  • Many papers are not in PubMed Central
    • we can estimate what percentage are and then try to extrapolate to all papers. For example, using "gene expression profiling"[mesh] query in PMC vs PubMed over 2007-2009 suggests PMC contains 30% of all related papers in PubMed
    • many datasets will continue to be used in the future... these reuses are obviously not continued in our estimate
  • our methods do not find studies that both create and reuse data
    • to narrow down our query results, we automatedly eliminate studies that create data... even though these same studies may also reuse data
    • we don't have an estimate of how many this is, would require manual inventory
  • Doesn't capture reuse outside the peer-reviewed literature (for example, reuse during training)

Furthermore, extrapolations based on this data may be biased:

  • Papers in PubMed Central may not be representative
  • Deposits into PMC not stable over time, distribution may change over time, may be skewed based on open-access uptake or NIH-funding levels in various communities
  • our estimates do not consider reuses after our study timeframe

Open Questions

  • How to efficiently estimate what percent of these papers depended on the GEO data for their scientific contribution?

Possible Enhancements

  • Use Author-ity clusters to disambiguate authors
  1. Torvik VI and Smalheiser NR. Author Name Disambiguation in MEDLINE. ACM Trans Knowl Discov Data. 2009 Jul 1;3(3). DOI:10.1145/1552303.1552304 | PubMed ID:20072710 | HubMed [Authority2009]
  • Keep track of GDS and GSE overlaps

Related references

  1. Piwowar, HA. Studying Reuse Of GEO Datasets In The Published Literature. Research Remix. July 5 2010. blog post

    [Piwowar-blogGauntlet]
  2. Piwowar HA and Chapman WW. Identifying data sharing in biomedical literature. AMIA Annu Symp Proc. 2008 Nov 6;2008:596-600. PubMed ID:18998887 | HubMed [Piwowar-AMIA2008]
  3. Piwowar, Wendy W Chapman (2008) Linking database submissions to primary citations with PubMed Central. BioLINK 2008, Toronto Canada. Full text

    [Piwowar-BioLINK2008]

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input! Please sign your name to your note by adding '''*~~~~''': to the beginning of your tip.

  1. List troubleshooting tips here.
  2. Anecdotal observations that might be of use to others can also be posted here.

Contact