DataONE:GEO reuse study/pilot

'''Note! This data is very preliminary and at this point should be taken as illustrations of the analyses that can be done, rather than valid results. '''

=GEO reuse study, Exploratory Pilot=

Indexing terms used by articles that deposit or reuse GEO microarray datasets

 * Nodes = MeSH indexing terms and qualifiers
 * Edge brightness = how often the connecting indexing terms are used to annotate the same articles
 * Green nodes = indexing term used relatively often by articles that DEPOSITED data
 * Blue nodes = indexing term used relatively often by articles that REUSED data
 * Size of nodes = how often the indexing term is applied to this set of articles overall
 * Dataset… about 3800 articles from PMC had MeSH terms: 300 were Reusers and the rest Depositers. MeSH terms were hand-selected (based on automated stats) for high differentiation between reuse and deposit in this dataset.  Only MeSH terms applied to at least 10 articles in this dataset were kept.



Authors of papers that deposit or reuse GEO microarray datasets

 * (visible) Nodes = authors
 * Edges = links from authors to (tiny, almost invisible) nodes representing articles that either deposited or reused data
 * Green nodes = authors that only authored DEPOSITING articles in this dataset
 * Green edges = edges connecting authors to (tiny) depositing-articles nodes
 * Blue nodes = authors that only authored REUSE articles in this dataset
 * Blue edges = = edges connecting authors to (tiny) reusing-articles nodes
 * Red nodes = authors that authored both depositing and reuse articles in this dataset
 * Connected large nodes (through almost invisible article nodes) = co-authors
 * Dataset… Authors from 1000 randomly chosen papers out of the 4200 articles from PMC (subsampled to keep network from being too crowded and slow)



Proportion of publications in PubMed Central that deposited or reused GEO microarray data

 * Deposit: Ratio of (articles in PubMed Central with links from primary-citation field in GEO) / number of papers in PMC
 * Reuse: Ratio of (articles in PMC to those with GEO accession numbers in their full text and no links from primary-citation field of GEO) / number of papers in PMC
 * Note: this is definitely an underestimate of absolute reuse from GEO, but it should permit a fair picture of trends over time
 * Note: deposits to GEO appear to be tapering off.  This is very possible, as focus shifts to inexpensive DNA sequencing.  However, before publishing this, I’d want to double-check with GEO that it reflects their experience and isn’t an artifact of the rapid increase of publications into PubMed Central, for example.
 * (I collected the total-PMC data yearly: the graph could be smoother with a bit more data-collection effort)




 * Same graph, but secondary axis at 1/10th scale to superimpose shapes: