User:Heather A Piwowar/Notebook/PhD thesis

From OpenWetWare
Jump to navigationJump to search
<sitesearch>title=Search this Project</sitesearch>

Summary

My PhD dissertation, Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data, was defended in the Department of Biomedical Informatics in the School of Medicine at the University of Pittsburgh on March 24, 2010.

Committee:

  • Dissertation Advisor: Wendy W. Chapman, PhD, Assistant Professor, Department of Biomedical Informatics, University of Pittsburgh
  • Brian B. Butler, PhD, Associate Professor, Katz Graduate School of Business, University of Pittsburgh
  • Ellen G. Detlefsen, PhD, Associate Professor, School of Information Sciences, University of Pittsburgh
  • Gunther Eysenbach, MD, MPH, Associate Professor, Department of Health Policy, Management and Evaluation, University of Toronto
  • Madhavi Ganapathiraju, PhD, Assistant Professor, Department of Biomedical Informatics, University of Pittsburgh

Abstract

Many initiatives encourage research investigators to share their raw research datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp on the prevalence or patterns of data sharing and reuse. Previous survey methods for understanding data sharing patterns provide insight into investigator attitudes, but do not facilitate direct measurement of data sharing behaviour or its correlates. In this study, we evaluate and use bibliometric methods to understand the impact, prevalence, and patterns with which investigators publicly share their raw gene expression microarray datasets after study publication. To begin, we analyzed the citation history of 85 clinical trials published between 1999 and 2003. Almost half of the trials had shared their microarray data publicly on the internet. Publicly available data was significantly (p=0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin.

Digging deeper into data sharing patterns required methods for automatically identifying data creation and data sharing. We derived a full-text query to identify studies that generated gene expression microarray data. Issuing the query in PubMed Central, Highwire Press, and Google Scholar found 56% of the data-creation studies in our gold standard, with 90% precision. Next, we established that searching ArrayExpress and the Gene Expression Omnibus databases for PubMed article identifiers retrieved 77% of associated publicly-accessible datasets.

We used these methods to identify 11603 publications that created gene expression microarray data. Authors of at least 25% of these publications deposited their data in the predominant public databases. We collected a wide set of variables about these studies and derived 15 factors that describe their authorship, funding, institution, publication, and domain environments. In second-order analysis, authors with a history of sharing and reusing shared gene expression microarray data were most likely to share their data, and those studying human subjects and cancer were least likely to share.

We hope these methods and results will contribute to a deeper understanding of data sharing behavior and eventually more effective data sharing initiatives.

Full Text

PDF, Word docx, University of Pittsburgh [ETD entry].

Associated Publications, Data, and Source Code

iEvo Bio 2010 lightning talk submission summarizing my findings.

Proposal

Many initiatives encourage research data sharing in hopes of increasing research efficiency and quality, but the effectiveness of these early initiatives is not well understood. Reusing research data has many benefits for the scientific community new research hypotheses can be tested more quickly and inexpensively when duplicate data collection is reduced. Shared data can be aggregated to study otherwise-intractable issues, and a more diverse set of scientists can become involved when analysis is opened beyond those who collected the original data. Publicly available data helps to identify errors, discourages fraud, and is useful for training new researchers.

Funders, publishers and academic organizations — eager to realize such benefits — have developed tools, resources and policies to encourage and require data-producing investigators to make their datasets publicly available. Despite these investments of time and money, we do not have a firm grasp on the prevalence or patterns of data sharing and reuse, the effectiveness of initiatives, or the costs, benefits, and impact of repurposing biomedical research data.

Previous assessments methods for assessing data sharing prevalence have included manual curation and investigator self-reporting. Models of knowledge sharing have emerged from the information science and management of information systems communities, usually derived from case studies or survey instruments. These approaches provides insight into motivation, but are subject to an intention-action gap and are labor-intensive to repeat in multiple subdisciplines and over time to monitor changes in behavior.

The proposed research will build on and supplement previous work through an analysis of observed variables, thereby providing an alternative perspective for understanding and monitoring data sharing behavior.

My research questions:

  1. Does data sharing have benefit for those who share?
  2. Can data sharing and withholding be systematically and automatically measured?
  3. How often is data shared? What predicts sharing? How can we model sharing behavior?


Various versions of my proposal:

Pilot Study

A pilot study for Aim 3 was presented at the recent Symposium on Informetrics and Scientometrics (my slides), and has been accepted for a special issue of the Journal of Informetrics:

  1. Piwowar HA, Chapman WW. Public sharing of research datasets: A pilot study of associations. Journal of Informetrics. Volume 4, Issue 2, April 2010, Pages 148-156

    [joi2010]

Raw data and statistical code are available on github.

Aim 1

Completed and published:

  1. Piwowar HA, Day RS, and Fridsma DB. Sharing detailed research data is associated with increased citation rate. PLoS One. 2007 Mar 21;2(3):e308. DOI:10.1371/journal.pone.0000308 | PubMed ID:17375194 | HubMed [plosone2007]

Raw data and statistical scripts are available on github.

Aim 2a

Will be submitting to a journal for publication soon. Full-text analysis done using the custom steelir Python library, available as open source on github.

Aim 2b

Completed and published:

  1. Piwowar H and Chapman W. Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers. J Biomed Discov Collab. 2010 Mar 28;5:7-20. PubMed ID:20349403 | HubMed [disco2010]

Raw data and statistical scripts are available on github.

Aim 3

Will be submitting to a journal for publication soon.

Raw data and statistical code are available on github.

Data collection done manually and through the custom pypub library, available as open source on github.

Defense

My doctoral dissertation was successfully defended on March 24, 2010 in the Department of Biomedical Informatics, School of Medicine, University of Pittsburgh. slides

Ongoing work

Check out my personal page at OpenWetWare.



Recently Edited Notebook Pages