DataONE:Notebook/ArticleCitationPractices/Protocol

Email to mentors. 6/16/2010.

In light of our technical difficulties yesterday and per Heather's recommendation, I'm sending all the mentors more details on my current plan of action. The content of this email is also posted on OWW here. I would prefer that you comment there (with your timestamp signature on the talk page) for better record keeping, but if it's easier to respond through email I understand and will post your email comments on OWW. Just as a reminder, my overarching research questions and plan are posted here and here. Feel free to explore and comment on these as well.

My two approaches:
 * 1) Snapshot: Look at first 25 articles for each journal in 2010. This is partly for determining what fields should be extracted but also as a look into the "current state of things". This is what I've been working on for until now.
 * 2) Random Sampling. Look at 25 articles per journal per year (blocked random sampling) from 2005-2009. This will investigate trends over time, discipline, and journal. This is what I will start next week.

Focal Journals (listed with discipline and likely depository):
 * 1) Systematic Biology (phylogenetics/biodiversity/evolution; likely to use treebase, genbank, internal depository)
 * 2) American Naturalist (behavior/biology; all over the place in terms of depository, multiple datasets posted on dryad)
 * 3) Molecular Ecology (genetics/phylogenetics; likely to use treebase, genbank, internal depository
 * 4) Ecology (ecology/environment; unknown depository preference but ESA led many early datasharing conferences)

Data collected from each article: |Google Spreadsheet - Revised ; |Google Spreadsheet - Field Explanations ; |Google Spreadsheet - Old
 * In summary, info about what types of datasets are cited and how they are cited. Datasets refers to both data reuse and data sharing/deposited.
 * Sorry it's messy. I'm working on cleaning it up, posting more examples (it's easier to operate on my desktop right now), and standardizing fields with Valerie and Nic.

Current method: Randomly select 25 articles for a given journal for a given year. Probably proceed through time for a single journal. Extract data to spreadsheet. Code data when have batch of 100+. Repeat for next journal (or year depending on your opinions to below questions). Analyse data (integrate with Nic and Valerie, possibly through a database)

Total articles collected: Assuming 5 journals and 25 articles per journal per 6 time frames (snapshot + 5 years) = 750.

To put that in perspective, it takes me about 15 minutes to process an article (but not detailed coding, which I'll do later once I have a large batch). That comes to about 32 articles a day, assuming I spend all 8 workday hours on article processing. As it stands now, I spend about half my day on OWW bookkeeping, collaborating with other interns, simply downloading articles, and other tasks related to the project. This hopefully will lessen as we all get settled in, but I'll use this estimate as the conservative amount of what I can realistically get done over the course of this internship. So, 4 hours a day = 16 articles per day, = approx 100 a week, = 500 over the coming weeks (leaving time for coding, analysis, etc). So, baseline, I think 500 articles is realistic. I obviously plan to do more, but that should be factored in when determining the spread and depth of my data collection.

So with that in mind, I have some questions that I would like your input on:


 * 1) Is 25 articles per journal per time frame sufficient (150 total per journal, 125 total per year)? Do you think 750 articles is feasible both for statistical purposes and for the time frame of this internship?
 * 2) Should I proceed through an entire journal (all years) then move onto the next journal, or sample all journals for a given time frame and then move back in time? Or other sampling suggestions?
 * 3) What journals do you think I should prioritize (if possible rank them according to what you think and why)? Am I hitting the right ones? What journals are most relevant to DataONE? What am I missing/neglecting (i.e. suggestions of for journal #5)? Note, that I've avoided Nature b/c it doesn't match with a single discipline, but on the other hand it might be interesting to have a broad journal for comparison, especially considering that SysBio/MolecEco overlap and Ecology/AmNat overlap, so it might be interest to compare those to each other and then them to Nature. That thought just occurred to me, take it for what it's worth and with the acknowledgement that it is fresh and not thought out.
 * 4) Any other suggestions on extracted fields, sampling, journals, etc.

If you would like to discuss your suggestions or alternative methods "face to face" in chat, let me know.

Thank you for your input. Have a good day!

Sincerely, Sarah Walker Judson