DataONE/Summer 2010/Research questions: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
 
(29 intermediate revisions by 4 users not shown)
Line 7: Line 7:
<div style="padding: 10px; width: 720px; border: 5px solid #2171B7;">
<div style="padding: 10px; width: 720px; border: 5px solid #2171B7;">
==Open Questions for mentors and the community==
==Open Questions for mentors and the community==
Please comment on questions on the related [[http://www.openwetware.org/wiki/Talk:DataONE/Summer_2010/Research_questions|Talk page]] or by editing this section of the wiki. Please leave your name and time with all comments (use the "Sign your username" link while editing).
* The students don't have access to all the journals they need through their home institutions (eg Simmons Collage).  Can we set up some guest access to other DataONE-affiliated resources?
* What is a good GIS/earth journal for analysis?
* What is a good GIS/earth journal for analysis?
** Use a journal that is well represented in Pangaea?  And/or one affiliated with GSA?  --[[User:Todd Vision|Todd Vision]] 10:46, 10 June 2010 (EDT)
** Use a journal that is well represented in Pangaea?  And/or one affiliated with GSA?  --[[User:Todd Vision|Todd Vision]] 10:46, 10 June 2010 (EDT)
Line 14: Line 16:


==Data citation practice inventory within journals (articles)==
==Data citation practice inventory within journals (articles)==
Owner: Sarah.
Owner: [[http://openwetware.org/wiki/User:Sarah_Judson| Sarah]]
 
===Guiding Research Questions===
# What are various practices for data citation within academic papers?  How prevalent is each variety?
# What are various practices for data citation within academic papers?  How prevalent is each variety?
#* Do authors tend to cite that dataset itself or related paper?
#* How did the author obtain the dataset (i.e. past study, buddy, search, known database)?
# How do these practices vary across discipline, journal, data type, data source?
# How do these practices vary across discipline, journal, data type, data source?
#* Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?
#* Associated with [[http://www.openwetware.org/wiki/DataONE/Summer_2010/Research_questions#Description|Nic's Metadata Questions]]
# How have these practices varied across time?
# How have these practices varied across time?
''good broad questions for now, i'm refining more specific questions and how they fit into the broader picture''
#* Does increased data reuse/sharing correlate with changes in journal policy?
#* Does data reuse/sharing simply increase with time since the advent of the internet?


===Scope and Plan===
===Scope and Plan===
* which journals? --> ''Starting with AmNat, SysBio, MolecularEco. Probably will then move to some of the ESA affiliated journals and a GIS/earth journal (need suggestion). - This will give a broad coverage of subject types (in previously mentioned order: behavioral/model, systematics/phylogeny, genetics, ecology, earth/GIS). Then maybe Evolution, Nature, Science b/c big names in biology, but these are more broad coverage, including the previously mentioned journals.''
*Phase 1: Data Extraction
** We have some survey results on scientist attitudes and behaviours that might sync up nicely with these results if we choose journals that reflect the scientists' fields. When asked "Which of the following best describes your <i>primary</i> field of concentration within evolutionary biology?" the top results were:
**Journals
***Behavior/Neurobiology 23%
***Dryad "Top Three"
***Development/Morphology 21%
****Justification:
***Ecology 17%
*****1. Most currently posted datasets...is it really being reused?
***Genetics/Genomics 14%
*****2. Known "High Impact" Journals
***Molecular evolution 8%
*****3. Cover target disciplines
***Paleontology 8%
****Systematic Biology (SysBio; discipline: Systematics, Phylo-genetics/-geography, Molecular Evolution)
***''Great! thanks for this list. Is there more data from this survey, I'm interested.''
****American Naturalist (AmNat; discipline: Behavior, Natural History, Ecology)
** I don't know which journals best sync up with these fields? ''see above. i may need a more specific paleontology journal.''
****Molecular Ecology (MolEco; discipline: Genetics, Molecular Evolution )
* which time periods? ''starting with 2010, moving back. probably annually through 2000 and then every five years before. For now, the first issue(s) or 25-50 journals published that year. Should move to random sampling to eliminate the effects of special topic issues. the 2010 preliminary dataset, though not random, is important for investigation of extracted data and trends within a single journal issue.''
***ESA family
* what data will you extract? ''Still determining fields. Right now, keywords, article topic, dataset citation (Y/N), how data cited, if data is readily accessible, author reciprocally posting their dataset (y/n, same nested questions as with dataset citation). I have an ever expanding spreadsheet...planning on a more refined database or google doc form soon.''
****Justification:
* how many datapoints do you expect? ''Many! Lots of articles. Planning on 100+ per journal, assuming we pick focal journals. Especially if I can dedicate my time more to this since Valerie has taken depositories which seemed like it was originally under my domain, and because Nic should be able to answer my journal-based questions with the data he is collecting.''
*****1. Known high impact
* what stats will you run?  what is your statistical power?''still need to think on this. Baseline = % of articles in journal that cite a data set, % that do it properly, % that post also post their data, etc. Beyond that, mostly correlations between data citation or lack thereof vs. journal, time, topic (field of concentration), open access, etc. These are relatively simple but may suffice. I'm interested in a more sophisticated method, but am not familiar with traditional statistics in social sciences. Perhaps some multivariate clustering to establish what parameters determine data citation or not. Open to suggestions, especially to common methods in social sciences and specifically data citation (if there are any)! Statistical power should be good b/c large sample size (many more articles than journals), some issues with "unequal sampling" b/c some journals have fewer publications per year/issue.''
*****2. Known involvement with early data-reuse/sharing discussions (ADD LINK!!)
* what do you plan to have complete by June 30th?''1. Establish WHAT information is collected from each article, 2. Establish HOW information is collected (expedite manual searching, possibly text searching and database automation), 3. Get through 2010 articles of SysBio, AmNat, MolecularEco, 4. Evaluate continued article sampling (random, time-scale, by topic)''
****Ecology
* plans for integration with other intern work?''I made brief comments below about collaboration which I hope to update soon. I think a central database would standardize data collection (i.e. fields, character states). Also, this would allow for ease of analysis because an article (or journal or repository) could be evaluated for journal, repository trends/metadata as well (and vice versa for each of our focal areas).''
****Others (open to suggestions)
****Is Evolution affiliated with ESA?
***Discipline Coverage
****Justification:
*****1. To get full disciplinary coverage on "environmental and earth" related articles which are the primary interest of DataONE.
*****2. Target journals that are good at or lacking data re-use.
****Paleobiology
****Pangea-associated journal (SUGGESTIONS PLEASE!)
****GIS/Biogeography associated journal (SUGGESTIONS PLEASE!)
***Broad Coverage (if time!)
****Justification:
*****1. Wider variety of disciplines in the same journal
*****2. Disciplinary overlap with other journals, but then able to elucidate influence of journal policies.
****Evolution (if not ESA)
****Nature
****Science


==Data sharing and citation policies of journals==
**Article sampling
***Time series vs. random sample
****Time: Issue by Issue. Starting with 2010 moving back. Possible confounding factors of Special Issues and unequal sampling b/c some journals have fewer articles per issue and per year.
*****I'm currently working under this method, primarily for working out needed extraction fields and differences between journal formats.
****Random: Controlled "random" search of 100 articles per year or per discipline. Could still do within a time block.
***I appreciate suggestions on Time vs. Random sampling.
 
**Data Extraction
***Primarily information about author, discipline, Y/N dataset citation (reuse and sharing), details on dataset citation
***[[https://spreadsheets.google.com/ccc?key=0Am4hbt8Ef8WXdENmZU83dTRUbW5fNFg3RjFFa1Z0LUE&hl=en|Preliminary list of extracted fields]]
 
 
*Phase 2: Analysis
**Statistics:
***Sample Size: 500-1000 articles (Any opinions on what is realistic in a month period?)
***Percentage: % Data reuse per issue/year, % Proper dataset citation (depository)
***Correlations: Time, Journal, Discipline
****This could expand to Multivariate Cluster or NMDS analysis
**Database Integration
***Set up Access/Excel or opensource database for our communication between all collected metadata
****NOTE, we need to solidify fields (especially link fields) up front to prevent problems later on. See this spreadsheet (still need to link, yell at me if I haven't).
 
===Deliverables===
*June 30th:
**Bulk of data extraction. Hoping for at least 500 articles. Still trying to determine if this is realistic based on the time it takes for data extraction.
**Develop and Standardize Extraction methods. Text searching, zotero/EndNote/etc integration, standardized fields, coding.
*End of Project
**Statistical results
**Database or at least dataset with good metadata for future use by DataONE. This is the whole point of the project, eh? To encourage and be an example of good data sharing and reuse.
**Summary paper (at least documentation, hopefully moving towards publishable manuscript). Coordinate with Nic and Valerie.
***Suggestions for best practices for authors, journals, depositories, DataONE.
***Include anecdotal information I have obtained by authors I know that reuse data (Sites, Peck, SysBIO treebase ppr, etc) and people I've talked to about their databases (Miller, NEON, etc).
 
==Data sharing and citation policies for journals, funding sources and repositories==
Owner:  Nic
Owner:  Nic
# What are the data sharing and citation policies applicable to authors, from funders, journals, institutions, and repositories?
# How are the collages of applicable policies different by discipline, journal, data type, data source?
# How have the collages of applicable policies changed across time?
# How do the applicable policies correlate with data sharing behaviour
''Comment:  this may need narrowing down...''


==Scope and Plan==
===Description===
 
In this project I will be investigating data management policies for the existence (or absence of) requirements for researchers sharing and citing data. This will be accomplished in two phases. In phase I,  I  will collect data management policies from a  number of journals, repositories and funding sources in order to quantitatively assess data sharing and citation requirements.  In phase II, I will be trying to determine the impact of the policies based on correlations with Sarah and Valerie's data.
 
My Specific research questions include


* where to focus the research. Specific issues of specific journals? Same as Sarah's? -->**I'm starting with the "big three" on dryad...AmNat, SysBio, and Molecular Ecology (most posted datasets, broad subject coverage between them)-Sarah
# What are the data sharing and citation policies applicable to authors, from funders, journals, and repositories?
# How do these policies differ by discipline, journal, data type, data source?
# How has the spectrum of applicable policies changed over time? (Need more thought on how to track this)
# How do the applicable policies correlate with data sharing behavior
# How do the applicable policies correlate with citing data


I think Sarah and I should coordinate our research efforts, in so far as the journals she is mining for data reuse and citations should also be the journals where I am collecting Metadata and broader policies on data sharing and citation.
===Scope and Plan===


Our work should also overlap in that I can look for authors funding resources and institutional affiliation for further policies. A potential obstacle is that this isn’t necessarily going to give us clear boundaries for discipline specific data (other than place of publication) Data Types might help to parse this out a bit, but not reliably. 
Project will be carried out in two phases


* what data will you extract?
'''
Phase 1 : Collecting and "quantifying" various attributes of policies'''


(would love recommendations in each of these categories)


Metadata elements of Journals
I'll Use the following sources (Please add sources)
#Publisher
# Date of Publication
# Format of Pub (e-only or available in print)
# Society Affiliation
# Data Repository
# Open Access / Subscription
# Impact Factor
# Peer Reviewed
# Where Indexed / Abstracted
Policy of Sharing and Citations (For Institutions, Repositories, Journals and Funding Sources)
Metadata:
# Entity Name
# Physical Location / Affiliation (if any)
# Domain Affiliation / Included Disciplines  (if any)
# ???
Broader Data
# What are the elements of their Data Policies
## Institutions and Funding sources : Requirements of a  Data Management Plans  for researchers
## Repositories and Journals: Requirements for deposit / publication
# Specific language for sharing data and or citing data
# Suggestions on how to cite data
# ???
* how many datapoints do you expect?
Strongly depends on how broad the sample size is… (in short, I don’t know yet)


* what stats will you run? what is your statistical power?
*Journals : SysBio, AmNat, MolecularEco
* what do you plan to have complete by June 30th?
*Repositories: TreeBASE, Genbank, PanGEa
*Foundations / Funding Bodies :NSF , JISC, AU ANDS


* plans for integration with other intern work?


I think / hope that Sarah and I can coordinate our data gathering. Hopefully this will allow us to have more correlations in our data, and we can begin to see broader patterns of data citation w/r/t impact,  what effect these have on Question 2 of my research (the collage of applicable policies ) and vice versa.
I will collect the following elements from each source (linked to googledoc's SS)
 
Metadata: [https://spreadsheets.google.com/ccc?key=0AmL-EJ5-i7x7dGdPQ2ZqazBTWkE2LW1MeTBUSjZzc1E&hl=en Journals], [https://spreadsheets.google.com/ccc?key=0AmL-EJ5-i7x7dER5SXB1MV9IMFl5T3RXZi1mSVFjWmc&hl=en Repositories] + [https://spreadsheets.google.com/ccc?key=0AmL-EJ5-i7x7dHRHemZ3QU9OZ245Vjd3ZFlMVG5hT3c&hl=en Funders]
 
[https://spreadsheets.google.com/ccc?key=0AmL-EJ5-i7x7dDFZWWU0dldMeVpiUmtPeFE3TFpnWmc&hl=en Policy Data]
 
'''Phase 2 : Determining Impact of Policies'''
 
This will be done by correlating my quantified  policy data with Valerie and Sarah's reuse data. (More to come)
 
===Deliverables===
 
As of 6/30--(If scope seems narrow please comment)
 
*policies retrieved and data / metadata extracted for sources
*comprehensive list of funders and metadata for resources Sarah and Valerie are working with


==Data citation practice inventory for repositories==
==Data citation practice inventory for repositories==
Line 101: Line 153:


===Scope and Plan===
===Scope and Plan===
* which repositories? [http://www.treebase.org/treebase-web/home.html TreeBASE], [http://www.pangaea.de/ Pangaea], the [http://daac.ornl.gov/ ORNL DAAC archive] ?
# '''Which repositories?''' For the June 30, 2010 midpoint, I will mostly focus on [http://www.treebase.org/treebase-web/home.html TreeBASE]. Future repositories to examine include: [http://www.pangaea.de/ Pangaea] and the [http://daac.ornl.gov/ ORNL DAAC archive]
* how will you bound the problem? a subset of repository entries?  a subset of journals for citation and attribution links?
# '''How will you bound the problem?''' The search will be limited to articles in English from the year 2008+ within ISI Web of Science, Scirus, Nature, Sysbio, and Google Scholar.
* what methods will be used to search for citation and attributions? using which search resources?
# '''What methods will be used to search for citation and attributions? Using which search resources?''' I will use fulltext search when available. If fulltext is not available, I will look for relevant articles using keyword search and search through the reference pages for any mention of TreeBASE or respective study accession numbers. Naturally, my methods will change as I find what works and what does not work, keeping regular entries in this [[DataONE:Notebook/Reuse_of_repository_data|lab notebook]].
* what is the estimated coverage of these methods? Could come from Sarah's project results.
# '''What is the estimated coverage of these methods?''' Since my methods may change depending on what works and does not work, I cannot provide an estimate at this time.
* how many datapoints do you expect?
# '''What stats will you run? What is your statistical power?''' Once again, this will depend on the initial data set I draw in the next week. Please check the [[DataONE:Notebook/Reuse_of_repository_data|lab notebook]] for further developments.
* what stats will you run? what is your statistical power?
# '''What do you plan to have complete by June 30th?''' A full spreadsheet containing articles reusing data from TreeBASE and documentation for how that data was cited in those articles; a report summarizing findings based on this spreadsheet.
* what do you plan to have complete by June 30th?
# '''Plans for integration with other intern work?''' Since Sarah's project is similar to mine, I will keep in touch with her regarding methodology and data integration.
* plans for integration with other intern work?
# '''Plans for integration/parallel analysis with Heather's NCBI GEO work?''' Since many articles I have found so far are also in PubMed and BioMed, I expect a lot of overlap to occur with Heather's work and look forward to further collaboration with her.
* plans for integration/parallel analysis with Heather's NCBI GEO work?
 
** ''I'll flush out some background info and this and provide links... feel free to ask in the meantime''
June 10, 2010 I will start looking for the following ways that TreeBASE is cited in articles:
# Mention of TreeBASE or TreeBANK
# DOI or URI
# Full citation as per TreeBASE recommendations.
# Mention of data author only


''More via June 9, 2010 email from Heather ''
I will look in the following databases and journals:
# [http://www.isiknowledge.com/ ISI Web of Science]
# [http://scirus.com/srsapp/ Scirus]
# [http://www.nature.com Nature]
# [http://sysbio.oxfordjournals.org/ Sysbio]
# [http://scholar.google.com/ Google Scholar]


A few more things:
* This information will go into a spreadsheet housed here: [http://spreadsheets.google.com/ccc?key=0AgM1E1R2tI_6dE1LYlYtWHRXblNXa3ladXNNY3BDbEE&hl=en TreeBASE Citations]
* do these databases or repositories have "accession numbers"?  If so, what is the format of the accession numbers? For example, for NCBI's GEO database, the accession number format is GDSxxxxxx or GSExxxxxx and sometimes people just cite data by mentioning the accession number, so we need to be able to search the article full text for GDS* or GSE*
* My observations will also be posted here [[DataONE:Notebook/Reuse_of_repository_data|Reuse of repository data]]
* maybe it would be interesting to have add columns for Dryad, Genbank, NCBI's Gene Expression Omnibus Database, and the ArrayExpress database.  I say this because it would help us draw comparisons to those sources, even though we probably won't be looking for citations to these databases in the literature
* it might be worth copying the full paragraphs of full text of the databases' reuse policies into the spreadsheet, for reference.  Definitely a link to the page where they discuss their policies would be helpful.


===Milestones===
# 6/30/2010: completed spreadsheet and report summarizing findings


===Phase 2===
Where should this program go after this initial investigation? Other repositories to examine in depth later:
*[http://www.godae.org/ GODAE]
*[http://daac.ornl.gov/ ORNL DAAC archive]
*[http://www.pangaea.de/ Pangaea]
*[http://dc110dmz.gfz-potsdam.de/ STD-DOI project]
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->
<!-- ##### DO NOT edit below this line unless you know what you are doing. ##### -->

Latest revision as of 10:55, 15 June 2010

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.

DataONE

Home        People        Research        Summer 2010        Resources       


Research Questions and Research Plans

Let's start brainstorming formal research questions, then you can flush out the scope and add your research plans for a June 30th mini-deliverable.

Open Questions for mentors and the community

Please comment on questions on the related [page] or by editing this section of the wiki. Please leave your name and time with all comments (use the "Sign your username" link while editing).

  • The students don't have access to all the journals they need through their home institutions (eg Simmons Collage). Can we set up some guest access to other DataONE-affiliated resources?
  • What is a good GIS/earth journal for analysis?
    • Use a journal that is well represented in Pangaea? And/or one affiliated with GSA? --Todd Vision 10:46, 10 June 2010 (EDT)
  • Recommendation for a specific paleontology journal?
    • I would recommend 'Paleobiology' as having broad interest, high impact papers --Todd Vision 10:46, 10 June 2010 (EDT)

Data citation practice inventory within journals (articles)

Owner: [Sarah]

Guiding Research Questions

  1. What are various practices for data citation within academic papers? How prevalent is each variety?
    • Do authors tend to cite that dataset itself or related paper?
    • How did the author obtain the dataset (i.e. past study, buddy, search, known database)?
  2. How do these practices vary across discipline, journal, data type, data source?
    • Are data citation practices influenced more by attitude of the discipline towards data sharing or journal policy?
    • Associated with [Metadata Questions]
  3. How have these practices varied across time?
    • Does increased data reuse/sharing correlate with changes in journal policy?
    • Does data reuse/sharing simply increase with time since the advent of the internet?

Scope and Plan

  • Phase 1: Data Extraction
    • Journals
      • Dryad "Top Three"
        • Justification:
          • 1. Most currently posted datasets...is it really being reused?
          • 2. Known "High Impact" Journals
          • 3. Cover target disciplines
        • Systematic Biology (SysBio; discipline: Systematics, Phylo-genetics/-geography, Molecular Evolution)
        • American Naturalist (AmNat; discipline: Behavior, Natural History, Ecology)
        • Molecular Ecology (MolEco; discipline: Genetics, Molecular Evolution )
      • ESA family
        • Justification:
          • 1. Known high impact
          • 2. Known involvement with early data-reuse/sharing discussions (ADD LINK!!)
        • Ecology
        • Others (open to suggestions)
        • Is Evolution affiliated with ESA?
      • Discipline Coverage
        • Justification:
          • 1. To get full disciplinary coverage on "environmental and earth" related articles which are the primary interest of DataONE.
          • 2. Target journals that are good at or lacking data re-use.
        • Paleobiology
        • Pangea-associated journal (SUGGESTIONS PLEASE!)
        • GIS/Biogeography associated journal (SUGGESTIONS PLEASE!)
      • Broad Coverage (if time!)
        • Justification:
          • 1. Wider variety of disciplines in the same journal
          • 2. Disciplinary overlap with other journals, but then able to elucidate influence of journal policies.
        • Evolution (if not ESA)
        • Nature
        • Science
    • Article sampling
      • Time series vs. random sample
        • Time: Issue by Issue. Starting with 2010 moving back. Possible confounding factors of Special Issues and unequal sampling b/c some journals have fewer articles per issue and per year.
          • I'm currently working under this method, primarily for working out needed extraction fields and differences between journal formats.
        • Random: Controlled "random" search of 100 articles per year or per discipline. Could still do within a time block.
      • I appreciate suggestions on Time vs. Random sampling.
    • Data Extraction
      • Primarily information about author, discipline, Y/N dataset citation (reuse and sharing), details on dataset citation
      • [list of extracted fields]


  • Phase 2: Analysis
    • Statistics:
      • Sample Size: 500-1000 articles (Any opinions on what is realistic in a month period?)
      • Percentage: % Data reuse per issue/year, % Proper dataset citation (depository)
      • Correlations: Time, Journal, Discipline
        • This could expand to Multivariate Cluster or NMDS analysis
    • Database Integration
      • Set up Access/Excel or opensource database for our communication between all collected metadata
        • NOTE, we need to solidify fields (especially link fields) up front to prevent problems later on. See this spreadsheet (still need to link, yell at me if I haven't).

Deliverables

  • June 30th:
    • Bulk of data extraction. Hoping for at least 500 articles. Still trying to determine if this is realistic based on the time it takes for data extraction.
    • Develop and Standardize Extraction methods. Text searching, zotero/EndNote/etc integration, standardized fields, coding.
  • End of Project
    • Statistical results
    • Database or at least dataset with good metadata for future use by DataONE. This is the whole point of the project, eh? To encourage and be an example of good data sharing and reuse.
    • Summary paper (at least documentation, hopefully moving towards publishable manuscript). Coordinate with Nic and Valerie.
      • Suggestions for best practices for authors, journals, depositories, DataONE.
      • Include anecdotal information I have obtained by authors I know that reuse data (Sites, Peck, SysBIO treebase ppr, etc) and people I've talked to about their databases (Miller, NEON, etc).

Data sharing and citation policies for journals, funding sources and repositories

Owner: Nic

Description

In this project I will be investigating data management policies for the existence (or absence of) requirements for researchers sharing and citing data. This will be accomplished in two phases. In phase I, I will collect data management policies from a number of journals, repositories and funding sources in order to quantitatively assess data sharing and citation requirements. In phase II, I will be trying to determine the impact of the policies based on correlations with Sarah and Valerie's data.

My Specific research questions include

  1. What are the data sharing and citation policies applicable to authors, from funders, journals, and repositories?
  2. How do these policies differ by discipline, journal, data type, data source?
  3. How has the spectrum of applicable policies changed over time? (Need more thought on how to track this)
  4. How do the applicable policies correlate with data sharing behavior
  5. How do the applicable policies correlate with citing data

Scope and Plan

Project will be carried out in two phases

Phase 1 : Collecting and "quantifying" various attributes of policies


I'll Use the following sources (Please add sources)

  • Journals : SysBio, AmNat, MolecularEco
  • Repositories: TreeBASE, Genbank, PanGEa
  • Foundations / Funding Bodies :NSF , JISC, AU ANDS


I will collect the following elements from each source (linked to googledoc's SS)


Metadata: Journals, Repositories + Funders

Policy Data

Phase 2 : Determining Impact of Policies

This will be done by correlating my quantified policy data with Valerie and Sarah's reuse data. (More to come)

Deliverables

As of 6/30--(If scope seems narrow please comment)

  • policies retrieved and data / metadata extracted for sources
  • comprehensive list of funders and metadata for resources Sarah and Valerie are working with

Data citation practice inventory for repositories

Owner: Valerie

  1. What are all the ways that data housed in given repositories are cited or attributed?
  2. How do these practices vary across discipline, journal, data type, data source?
  3. How have these practices varied across time?

(Very similar to Sarah's project, above)--->*I have some ideas on repository inventory that I haven't been able to explore yet, we should talk about ideas/approaches...I'll post more later, email me if I don't by June 14ish or if you want to collaborate sooner!!!! - Sarah

Link to repository public spreadsheet on Google Docs

Scope and Plan

  1. Which repositories? For the June 30, 2010 midpoint, I will mostly focus on TreeBASE. Future repositories to examine include: Pangaea and the ORNL DAAC archive
  2. How will you bound the problem? The search will be limited to articles in English from the year 2008+ within ISI Web of Science, Scirus, Nature, Sysbio, and Google Scholar.
  3. What methods will be used to search for citation and attributions? Using which search resources? I will use fulltext search when available. If fulltext is not available, I will look for relevant articles using keyword search and search through the reference pages for any mention of TreeBASE or respective study accession numbers. Naturally, my methods will change as I find what works and what does not work, keeping regular entries in this lab notebook.
  4. What is the estimated coverage of these methods? Since my methods may change depending on what works and does not work, I cannot provide an estimate at this time.
  5. What stats will you run? What is your statistical power? Once again, this will depend on the initial data set I draw in the next week. Please check the lab notebook for further developments.
  6. What do you plan to have complete by June 30th? A full spreadsheet containing articles reusing data from TreeBASE and documentation for how that data was cited in those articles; a report summarizing findings based on this spreadsheet.
  7. Plans for integration with other intern work? Since Sarah's project is similar to mine, I will keep in touch with her regarding methodology and data integration.
  8. Plans for integration/parallel analysis with Heather's NCBI GEO work? Since many articles I have found so far are also in PubMed and BioMed, I expect a lot of overlap to occur with Heather's work and look forward to further collaboration with her.

June 10, 2010 I will start looking for the following ways that TreeBASE is cited in articles:

  1. Mention of TreeBASE or TreeBANK
  2. DOI or URI
  3. Full citation as per TreeBASE recommendations.
  4. Mention of data author only

I will look in the following databases and journals:

  1. ISI Web of Science
  2. Scirus
  3. Nature
  4. Sysbio
  5. Google Scholar

Milestones

  1. 6/30/2010: completed spreadsheet and report summarizing findings

Phase 2

Where should this program go after this initial investigation? Other repositories to examine in depth later: