From OpenWetWare
Jump to: navigation, search

<!-- sibboleth --><div id="lncal1" style="border:0px;"><div style="display:none;" id="id">lncal1</div><div style="display:none;" id="dtext">06/21/2010</div><div style="display:none;" id="page">User:Sarah Judson/Notebook/DataOne DataCitationPractices:Notebook:Articles</div><div style="display:none;" id="fmt">yyyy/MM/dd</div><div style="display:none;" id="css">OWWNB</div><div style="display:none;" id="month"></div><div style="display:none;" id="year"></div><div style="display:none;" id="readonly">Y</div></div>

Owwnotebook icon.png <sitesearch>title=Search this Project</sitesearch>

Customize your entry pages Help.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

<!-- sibboleth --><div id="lncal1" style="border:0px;"><div style="display:none;" id="id">lncal1</div><div style="display:none;" id="dtext">06/21/2010</div><div style="display:none;" id="page">User:Sarah Judson/Notebook/DataOne DataCitationPractices:Notebook:Articles</div><div style="display:none;" id="fmt">yyyy/MM/dd</div><div style="display:none;" id="css">OWWNB</div><div style="display:none;" id="month"></div><div style="display:none;" id="year"></div><div style="display:none;" id="readonly">Y</div></div>

Owwnotebook icon.png <sitesearch>title=Search this Project</sitesearch>

Customize your entry pages Help.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

Article Notes

This is a running list of the comments I make on individual articles after extracting data from them. Please excuse typos and spelling errors...these are quick notes I make following anaylsis. I thought they might be interesting to the group/community b/c they contain my off-the-cuff observations and insights. I will try and post these at the end of each day. The notes (and the accompanying DOI) can be found at: Fields

Systematic Biology articles

  • this study basically was researching data reuse in genetics and if accurate phylogenetic trees could be made from amalgamated genbank data of differing quality. Very interesting, unfortunately, they didn't reuse or share data since it was a simulation. SysBio says explicitly in their data sharing section that they can help make arrangements to deposit large datasets from simulations. So, the simulation iterations could have been made available. Also, this study cites program R well....if someone later decides to look at software reuse, they could just look at R alone in terms of how custom packages/scripts are credited and how the software itself is cited b/c it varies widely. I would say this one did a good job with an in text and bibliographic citation. Direct quotes: REUSE " An increasing numberof phylogenomic studies are published for data setsincluding more than 100 genes (e.g., Lerat et al. 2003;Rokas et al. 2003; Driskell et al. 2004; Philippe, Lartillot,et al. 2005; Fitzpatrick et al. 2006; Nishihara et al.2007; Wildman et al. 2007; Dunn et al. 2008; Zou et al.2008).Two opposite views have been proposed as to how toincorporate the growing amount of data to infer evolutionaryrelationships." SOFTWARE: "Significant differences in accuracy values between different datasets were assessed using a Pearson’s chi-squared test inthe R Stats Package (R Development Core Team 2007)."
  • I was going to validate what was in the supplementary data, but it is VERY hard to find on sysbio site. The link they provide within articles just goes to the main page. The usual smorgsboard of ways of citing....genbank accession recorded in supplementary tables, but treebase with the url and accession no listed right in the text. Also, i counted the acquirement as ""self"" because the papers the genes were pulled from were by the authors of this study...hence why no genbank accension or reference for the reused data (though they likely are deposited there). This study also looked at how effective/accurate it is to piece together data from multiple sources. ""When reconstructing phylogeny, some authors combine nindependent sources of data irrespective of conflict n under the assumption that the combined analysis will maximize explanatory power of the phylogeny (“total evidence,” Kluge 1989)...Combined analysis is likely to lead to decreased, rather than increased, support for clades identified by one gene and contradicted by another (Bull et al. 1993; Lecointre and Deleporte 2005). In a worst-case scenario, combination may even result in the inference of spurious relationships supported by neither data partition individually (McDade 1992; Bull et al. 1993)."""
  • Study repeadetly refers to "unpublished data" from a single source. However, this "data" is not cited in the biblio (even as personal correspondence with the author). It is unclear if this is actually published data or just a proposed theory from a friend. Quote: "Their method is similar to another recently proposedtechnique (Meng and Kubatko 2009) in which methodsfor assessing the evidence for hybridization in the presenceof incomplete lineage sorting using a sample ofgene tree topologies were developed in both likelihoodand Bayesian frameworks. Nakhleh L. (unpublisheddata) provides a unifying framework for the methods ofThan et al. (2007) and Meng and Kubatko (2009).....A general expression for the likelihood of an arbitraryhybrid species tree has recently been given in Nakhleh L.(unpublished data) for models of this general form,although gene tree densities of the form in equation (2)are not explicitly discussed......The likelihood of any hybrid species phylogenyspecified this way can be computed as described in theprevious section, using an appropriate likelihood analogousto that shown for the example tree in equation (3)(see also Nakhleh L., unpublished data)."
  • The last few sysbio articles have been simulations. Is this common in sysbio? Are these all in the same issue? Could they have used recycled data to test their models? Why would an author not to this? Especially in this study, it seemed like they were using real sequences (references to "empircal data" stated clearly as opposed to "simulated data"), but they never said what species. It may not be "fair" to count unshared simulation iterations unshared..but sysbio explicitly states in their data policy that these can be stored which seems to imply that they should.
  • This paper uses the supplementary data depository of a dump for the figures they couldn't use in the paper. Supp Table 1, however does include accession numbers but never mentions genbank. also, I don't like how sysbio takes you to their entry page for the supplementary data. it implies that it is easy to find since they include it as a separate header in the paper, but you have to search for the article and depending on your browser are able to see the link for supplementary data. I found it, but it was cumbersome..the url provided should be for the specific paper or indicate how to search for it b/c there is no entry link on the main page for supplementary data. I like Molecular Ecology's approach better...a specific list of supplementary data at the end of the paper so you could better assess if you wanted to go dig it up. Tables 3 and 4 were not useful data as I would have hoped....I thought they might be gene alignement matrices. P.S. I know the editor of this ppr (Marshal Hedin)…need to get the cave specimens back from him!
  • odd..paper had no abstract and nonconventional headings but no indication that it was a special article format. References to genes don't clearly state resuse or even the taxa (awkward wording), but do cite authors. Alignments and trees not posted in supplementary data nor even in within the papers (minimal figures).
  • This is what I was hoping to see with the model simulations…a similuated data set and then validation. This one is especially good b/c rather than attacking another method point-blank, it uses the same datasets used by the method it's attacking for a one-to-one comparison....that's good data reuse practices and simply good science. Also, to cite Treebase, they cite the original paper introducing it (novel from what i've seen) rather than the url (more typical, i think in the spirit that people could then go their and use it). Also, simulation results are included in the supplementary data but not the iterated data sets they stemmed I counted this as data produced but not shared.
  • Good example of genbank and treebase reuse and sharing, but accession/authors not clearly credited for genbank (maybe in a supplementary table?) . Both alignments and trees went to treebase and are posted supplementary data!!
  • Data could have been posted to treebase, rather than just internal. Need to figure out a way to code this and if editors are instructed to police the sysbio policy on this.
  • unclear if trees also posted at treebase (could verify later), so this was treated as data produced but not shared. Also genbank/genes was cited in all kinds of crazy ways…most references were to the supplementary table which included all the accession no. and others were vaguely to the authors or the depository alone w/out associated Accession or explicit statement of reuse.
  • XY data produced and recorded in supplementary table, but not mentioned in paper as full locality info. The authors reused their previous data, but at least gave the genbank accession. In my opinion, the accession no. should have also been in the appendix table which contained other detailed info about each specimen.

"*the proposed method requires XY coordinates, but these are not indicated in the data reuse, likely because the reused data was originally produced by the author. code sharing instance: ""The above routine was implemented in the Java software PhyloMapper and is available at"""

  • unclear if the authors obtained seuqence data from genbank or sequenced the genes themselves. I think it was obtained by frequent references to "BLAST" which as far as I know is the genbank search engine. Therefore, the sequences are ,maybe could be attributed as a data reuse in the genbank table. but, these aren't agtc seuqences, so where are they "supposed" to be found/posted. UPDATE: I could not find the indicated supplementary data. no links on the full text, etc. Also, BLAST is also run by NIH and has similar search abilities to Genbank, but for proteins. shouldn't the citation recommendations be the same?
  • no genbank accession given directly in text, but in tables. Accession give for seuqences generated from this study, but not from the reused primers. As typical, gene alignments and phylo trees not posted to treebase or otherwise

"*this article does not have any reuse/sharing, but is about coping with large phylogenetic datasets (presumably compiled from mlutiple studies). ""Datasets used to reconstruct phylogenies are becoming increasingly complex. Not only is the size of trees increasing, but the number and heterogeneity of loci being used to infer phylogenies is also increasing steadily"""

  • i really dislike the sysbio requirement to include after reference to an appendix....primarily b/c it does not lead to the article specifically, nor is there any link to supplementary data! as with 10.1080/10635150802555933, I could not find the referenced supplementary data in the full text or article search (as i am usually able to do for other articles)! does the treebase accession have the gene alignements (see treebase_sharing sheet)? Is appendix one raw data (dispersal distances)? Accession numbers given in table, but genbank and accession not explicitly referenced in text (would this still come up in a full text search?)

"*it was very difficult to discern that this article reused and shared data. the acknowledgmenets were most clear on this and caused me to retrace the word ""dataset"" throughout the article. other than the acknowledgements, it is unclear how datasets were obtained and they are definitely not cited, but instead coded by the authors for their tracking purposes. the only credit the original data authors get is in the ackno section and this likely does not include all authors. datasets posted, but url of both the dataset and the software no longer works= this should have been in a depository!!! actually, they ended up working when accessed from the full-text...but the pdf links/typing the http address does not work for some reason. instance of software sharing: The consensus and best-scoring ML trees can be viewed and browsed online by using a customized version of the PHY.FI display engine, http://cgi-www. (Fredslund, 2006). The CIPRES RAxML Web server has been set up in an analogous way (see Availability:˜stamatak/index- Dateien/software/RAxML-VI-HPC-4.0.0.tar.gz Web Servers:"

  • refers to reused data as "datasets" and also reuses accompanying phylo trees. No indication of how they were obtained even in the ackno.
  • again, unable to find referenced appendicies. Assumed that genbank numbers were given if this was mentioned intext (i.e. w/ the reuse, but not w/ the sharing). It is not clear if the genes were posted to genbank or at all.
  • the dataset may have been shared in the previous study by the author as indicated by the statement "see original publiation for data set". I realized with this and the previous article that I've been classifying phylo trees present as figures intext as "produced" not shared data. I will continue to do this, but indicate if a figure was present if i want to change how this was done. i am inclined to leave it as produced b/c it should have gone to treebase, but i will probably also classify internal shared data to what depository it should have gone to

"*databases, but not data authors attributed. Authors use the term ""BLAST"" search which I thought was exclusive to genbank, but perhaps has become colloqiual for any sequence search. I was unable to track the treebase accession number…it appears invalid. I could finally find it by using the article information, but it was difficult to distinguish the current paper from other work by the author. furthermore, the provided accession number was incorrect....there is not even a possible SM prefix....S is the study id prefix, and M is the matrix id prefix. This paper was about how to cope with large datasets, presumably compiled through sharing and reuse: ""One of the goals of phylogenomics, as we envision it, is to generate data sets that result in topologies that are robust to the addition of new data and stable to changes in assumptions of analyses. ...Second, there are several groups generating Expressed Sequence Tag (EST) libraries for diverse butterfly taxa; e.g., Pieris rapae (http://www., Heliconius melpomene (Jiggins et al., 2005), Bicyclus anynana (Beldade et al., 2006), and Melitaea cinxia (Vera et al., 2008). EST libraries provide DNA sequence of the mRNA of expressed genes...In this paper, we report a new method to search through genome databases for exons of suitable size (500 to 600 bp), comparing these exons to EST databases for related taxa of interest, and finally develop primers potentially universal across the taxa of interest."""

  • conceptual theory illustration
  • genbank directly indicated for reused data, but they do not explicitly say that their sequences were deposited in genbank, instead this is implied by consecutive accession numbers in Table 1 = "no" for depository referenced in genbank sharing
  • good example of reuse with two exceptions: 1. the self reused data was not apparently reposted or otherwise having accession numbers indicated. 2. the accession numbers were not directly given, but referred to in another publication. So the author benefitted from accession numbers, but did not give any in this paper. but the article does clearly post both GA and PT to treebase, however it does have another weird sn prefix number which is difficult for actually retrieving the data
  • authors do not indicate in their "empirical data" section how they obtained the data, then a few paragraphs later, they state that it was "kindly provided by M. Brandley", presumably with sequences already aligned.
  • according to table 1, the authors did reuse some sequences….this was not clearly specified. The only indication is that the range of submitted genbank accession numbers does not inclue about 20 of the sequences used, especially those for cyt b.

"*unclear if the test data was empirical or simualted. if empirical, no comments were made about reuse/credit. if simulations (or if empirical), should have been shared. instance of potential code sharing: The R code required to implement, in conjunction with PAUP*, the arcsine- ILD test, is available from or on request from the senior author and will be published elsewhere."

  • reuse is from a previous paper by the same author, but one that pulled from multiple articles which are indicated to be properly cited in the previous paper. but in table 1, some things are listed as a reference as "This study", but the study didn't mention any internal sequencing, though I think the authors introduce new datasets intext....which I wonder why they didn't also put them in the references column? regardless, they are all present in the biblio. again, i couldn't find the supplementary data online. gave the authors the benefit of the doubt that the compiled data set was made available since they explicitly say "inputs" are available. also, I think PT were produced, b/c the end (or intermediate) product of any phylo analysis is a tree. How does treebase manage different trees for the same taxa?
  • somewhat awkward wording referring to genbank accession numbers. No indication of treebase. Also, simluated data (i.e. simulated gene alignments?) were not published….remember, I'm back in sysbio = simulation inputs are encouraged.

"*questionable if this paper ""produced"" data. It doesnj't sound like seuqences were aligned in any way. A compiled dataset could have been provided (i.e. input files), but they do document their reuse sources well. Blast seems to be a surrogate term for search/compare: ""and then to BLAST those sequences against one another to assess sequence similarity.""" "*unclear where the reused data came is a reuse of a reuse....""data set previously exambined by..."". this again, give credit to the current/re-analyzing author rather than the original dataset. a weird pt produced, but a tree nonetheless that could have been posted to treebase (I think). code sharing: ""A program that implements our model is available from Htm""" "*data shared on personal website could not be accessed, nor could I locate supplementary data on sysbio.= should have been in a depository!!!! But, good effort at datasharing and assume it was raw data b.c well articulated. unclear if the shared dataset also included the condensed datasets: ""Gene resampling was performed on the data set over 8 fungal species and 106 genes by randomly selecting 50 gene sets of size 3, 50 gene sets of size 5, and 50 gene sets of size 10….""" "*the reuse seems to be from a previously reuse dataset one of the authors had access to in a previous study. intermediate trees may have been produced, but they were for method testing only. article didn't produce GA b/c testing non-alignement methods: ""In nonphylogenetic contexts, alignment-free methods are employed in tasks as diverse as sequence classification, database search, and detection of regulatory sequences; the literature on these applications is small but is growing at an increasing rate....We conducted a large-scale comparison (in a phylogenetic context) of 10 alignment-free methods, among them one new approach that does not calculate distances and a faster variant of the pattern-based approach.."""

  • the simulated datasets produced were various extractions from the reused dataset, usually varying the number of taxa. general: like amnat, appendicies are commonly used for more detailed explanation of methods. Also, for valerie: possible keywords to indicate a dataset reuse: "dataset", "data set", "empirical". I usually search for these terms, "simula" (for simulation), "iteration", and the depository names after reading through a paper to make sure i didn't miss anything

"*code sharing: Hybridization and genetic drift were simulated using a program written by the authors (available at http://lamar.colostate.eidu/~reevesp/Hybridize.html)"

  • interesting, the paper refers to the taxon using a genbank accession number. A little confusing, but I understand that it could point interested parties to relevant gene sequences. Does genbank recommend doing this? A few authors have used almost the exact sentence: "data matrix and trees deposited on treebase with accession ##". I like it...clear, simple, gives all the necessary info. they also usually do it after a quick sentence about genbank accession at the end of the sequencing/collection section of the good that it usually tempts me to skip the rest of the article
  • math model…much less common in sysbio than amnat. In general, the 2007 sysbio had fewer "points of view' than usual and almost no models (except this one) that weren't also empirically tested in some way. Does sysbio tend to decline papers that don't have empirical testing?
  • it was difficult to discern if this paper used empirical or simluated data. There was no discussion of taxa used. In the final phylogeny, there were binomial taxa names. In this case, it is very unclear how/what sequences they obtained….they only reference the db and not even their search criteria beyond the algorithm used. though they cited one of the db authors, i counted this as no biblio citation b/c they don't cite the primary data. maybe I didn't understand it, but in any case, they did not attribute the original sequence authors at all. authors do not share the data, but state that it and the code are available upon request to the authors. then why not post it?
  • good use of accession and author citation. a little unclear about which taxa they sampled and which they obtained from genbank…a table could have helped. They also do not specify if they measured the morphological dataset or obtained it from others….in the table they state that this info was extracted from the literature...ok, they end up explaining it a page after the table...obnoxious. another instance of a taxa being cited with a genbank accession. treebase out....i don't think i've had many instances of trebase reuse, but a good handful of sharing.

"*though this paper did not have any reuse or sharing, the first sentence cites treebase as an indication of interest in phylo tree: ""The increased interest in tree reconstruction in recent years (e.g., ""Tree of Life"" project: http://tolweb. org/tree/phylogeny.html; phylogenetic database Tree- B ASE: http: / / www. treebase / index .html) prompts further methodological developments"""

  • math model article

"*simluated dataset produced, but mentioned offhand in the results section. code sharing: The tree space was generated and evaluated by an ocaml (Chailloux et al., 2000) program whose source is available upon request. All of these tree shape statistics can be calculated for any trees in Newick format using the simple command-line software simmons available at http: //"

  • interesting, for primers, I've now seen two papers that cite the full primer sequence intext/table. Does genbank not accept these as submissions? I think other papers did cite primer accession. Does this count as data sharing? I'm counting internal (table) XY coordinates as sharing....i mean, the supplemenatry table is just as much a difficult to use pdf rather than .xls or other formatted data. also,
  • another instance of a general db credit, but nothing for the original authors of the data. the common problem continues that data is supposedly posted to sysbio but unretrievable…luckily they also posted some of it to treebase. Authors not explicit that all trees were posted to treebase, but they were all posted internally, so i figure the likelihood is good. most studies only give the matrix # i didn't count a PT under treebase, but also didn't count it against them as data produced. in general, there doesn't seem to be supplementary data in the 2006 papers....this is one of the few that cites it (but again, it couldn't be found)
  • open extracing the cited dataset, it looks like it was a reuse of a reuse
  • simluation paper
  • table with genbank accessions a little confusing…the authors don't clearly indicate if they were retrieved or produced..then they say that they sequenced things themselves, but not all the genbank numbers are sequential, so I think some were reused. primers are again referenced in a table with full sequence and author citation. why wasn't this done for reused gene sequences (at least the author part)? trees were not clearly deposited in treebase.

"*code sharing: p://www.systematicbiology. org."

  • this article continues the trend of referring to taxa with an accession, but this time without even saying genbank!). Reuse of a reuse = accession not given in this paper (even though the authors had easy access to it!). Would be interesting to know how they credited data authors in their earlier artilces. as such, biblio citation was counted as 0 b/c original authors were not clearly cited.
  • data authors not credited with a reference, but accession numbers are given in a nice appendix table. However, the new sequences are indistinguishable from the reused in the table and intext does not give a range of acession for the new/shared data
  • when did sysbio policy on simluation data start? Nothing I've come across obeys this recommendation…ask Nic
  • again, not clear if pt deposited in treebase. No proper credit given to original genbank authors. Other reused dataset was apparently provided by coauthor from a previous study.
  • reuse of a reuse = not clear if data obtained from that author or genbank (assumed to be genbank, but given a "n" in depository reference b/c not clearly indicated)….also, this then does not credit the original data authors

"*good genbank credit method....(Author Year, Genbank #). again, pt not explicitly posted to treebase but likely is…perhaps write a if formula that takes this into account = new field that says ""something was posted to treebase and presumably everything was"". code sharing: A Perl script, secondchance (available online, along with the other supplementary materials, at,"

  • original data authors not credited…would have been easy to add a references column to table 1. they also post their mrbayes (input) file internally = easy for reuse,. Again, couldn't verify this because of non-accessible supplementary data post-2006? On sysbio
  • Authors very upfront...first paragraph of methods = we published all our data...well almost, they forgot the trees. This seems like method metadata: "In addition, digital images representing character states are available at" If they had uploaded all images, it would be a dataset. However, this is an exciting incidence of morphobank!
  • another tree simulation
  • another tree simulation
  • reuse of a reuse = no credit to original data authors. One original author credited….why this one and not others? Did the previous reuse cite all the original data? How was it obtained?
  • proposted method
  • method tested with empirical example. This seemed like a "points of view" article, but was not labelled as such. When did sysbio start having that article type?
  • these older papers have good appendix tables with the taxa, voucher specimens, and genbank numbers listed…sometimes with author references. this one should be commened for having both in text citations of the accession and a nice summary table which indicated newly produced sequences. Were these tables required/recommended in the past? Or a common trend? Why has it fallen out of practice? Are people simply using larger datasets and think this is unmanageable? i think it isn't b/c they likely have to compose such a list just for their records
  • awkward way of saying the data was posted internally…articles like this are probably why they stared the standardization of "are available on"
  • study used genbank sequences for realignment…I think this is commonly done but not articulate b/c such a common part of the process

American Naturalist articles

  • I thought this was a math model paper with no simulations…until the end in the appendicies text which mentioned reused datasets. The authors appear to be friends with the original authors of the empirical data from the comments in the acknowledgement section.
  • Short study produced morphometric data on turtle shells. From what I noticed, it may have been a negative result…hence the "notes and comments" article rather than regular article?
  • Another math model. Are AmNat e-articles typically model explorations? Or just the two I encountered first? Also, there was no indication of posting the model to a depository….just extensive appendices full of math.
  • Posted at the top of the article near the other metadata!!!!!: Dryad data: (hyperlinked even!).. This is a metanalysis. Reuse Data extracted manually from paper/digitization of figures (whole paragraph on acquirement)….cool that they made the matrix available.
  • Model. Hints at empirical data, but apparently just citing a method/equation/theory.
  • Extensive biological, ecological, and experimental data collected. Only one in-text table which was a brief summary of this (mostly statistics). No indication of data deposition.
  • URLs cited intext…is this a common practice rather than putting in biblio…why not both or combine? I.e. hadcm3 models were proposed by one set of people, but now the GIS coverages are available online and presumably maintained by other people.
  • life history data not shared, but math model is articulated in detail (with rationale, development, etc) in appendix b. unfortunately, this is not considered in the scope of this study but could be in a software/model sharing approach.
  • extensive ecological datasets, but no indication of sharing.
  • model…simulations mentioned. Amnat does not suggest simulation dataset posting like sysbio did.
  • Proposed math equations, no apparent model simulation.
  • as a general trend, I've noticed papers in AmNat that utilize the interal data repository mainly for extra tables and figures, not for uploading their raw this case, the author upload things like the locality XY coordinates, but not the extensive dataset on trill rates and morphology. Also, unclear how sound recordings were obtained and why author didn't reciprocally make hers available. can sound files/morphology pics be deposited at dryad or other depositories? these are somewhat unconventional files for data reuse, but very useful for their fields. perhaps the museum database that house the accompanying specimens could record the measurements, publications, and files related to the specimen (which from my experience, is not done).
  • in terms of model reuse, amnat encourages upload of GUIs and code, but I don't see this happening. For instance, I think R code could easily be required…even just as a text file. This would record re-use of programs, scripts, etc that were written by other people and make the author aware of the need for attribution (they usually attribute R itself, but not the specific packages used). as i read furthre, this paper did attribute an R package! but they cite it as a paper...did they read the article or find the package first? my guess would be the r package. does the paper they cite openly disclose the url of the package or otherwise encourage its reuse? this was not counted as reuse, but the excerpt was included on the reuse "other" sheet. also, in appendix b, the authors give extensive documentation of the model parameters (esentially metadata). I think this was a good practice, but unfortunately cannot count it as actual data sharing.
  • like previous article, good metadata in supplementary info about model parameters. Again, this is a good practice which I support but I did not count it as data sharing, but did store the relevant excerpt on the "internal" ss
  • This study utilized biological material from other individuals (not authors)….how could/should this be cited and credited. They do not refer to the contributions with a paper citation, though that may be appropriate. In the future, is it possible that in addition to dataset's receiving a full bibliographic citation, so could biological material? This would help attribute museums/taxonomists (i.e. could help a museum prove that they are utilized) and could give a collector comparable credit to a published paper (i.e. tenure "credit" for the time spent sharing data). This is the same issue with dataset citation, just a little more complex when you add in non-digital data, but still data that could be reused and needs to be more clear on where it was obtained (from my experience obtaining museum loans). The info about the biological material was recorded under other reuse, but did not "count" as a reuse.
  • No direct attribution to gene sequences (may have been only used for branch lengths, but I'm pretty sure they generated an entire phylogeny/topology from the sequence data.
  • again, detailed metadata on methods, but no sharing of the actual data. Also, they seemed to have obtained a lot of phylogenetic and biological data from another source without saying where they obtained it…turned out it was from themselves (a previous study). would be interesting to know if in their previous paper, they deposited the tree and sequences (aligned too). as a general note, typically if the resued datasets are mentioned in the introduction, it is in the finally paragraph of the intro....or in a paragraph that starts with "in this study..." or "here we examine..", which usually is the last paragraph of the intro
  • at first the authors don't cite the papers from which they extracted data (bio and phylo)…later they refer to the online appendix with this info. Shouldn't the original authors be cited in the main paper for reference/citation tracking?! is this common in meta-analyses (to not cite the original authors of extracted data in the main body of the ppr)? if it's in a table does it get acknowledged by isi as a citation? but what about an appendix table? interesting, appendix a and b have separate "literature cited" sections. again, confusing and do these authors get citation credit?..investigate in isi. also, as a general note, the data stored with the journals (internal) is difficult to resuse...usually a photograph or word file, not a downloadable .txt, .xls, or other format usable for analysis. this is one obvious advantage of the depositories which store the data in multiple usable formats.
  • "This article would be a good example of model datasharing....they reuse a math equation and seem to post their model (I think they do...and it's not just the software used, but the actual GUI from what I can the download site for it, it also looks like they have updated for wider use (i.e. different platforms)). odd use of acknoweldgements to provide url for the uloaded shared data. direct quotes: METHODS: The model used here is derived from Guillaume and Perrin (2006) and is implemented in Nemo 2.0.6 (Guillaume and Rougemont 2006)…..ACKNOWLEDGEMENTS: The simulation software is available at Simulations were performed on Westgrid ("
  • There seems to be a large number of articles in AmNat that are mathematical models…this should be taken into account in analysis…perhaps by sometimes anlaysizing without or blocking by study type (yet to be coded….see yellow notes on article ss). The amnat policy is newer according to nic's research, but there is very low sharing of math/coding/software/GUI (or just the simulation interations at the baseline!) which would be the best mode of reuse/sharing for these types of papers.
  • another model, another huge batch of iterations, and no data sharing


  • another model…so many in this journal!
  • yet another model. Maybe get an estimation from the snapshot about the proportion of models vs. experiments/observational per issue and extrapolate that to per year.this article also has a separate citation list for the appendix. Extensive math/derivation/method metadata in the appendix
  • and yet again. Does amNat publish the pictures of animals in its empty space to compensate for lack of actual naturalist/ecology studies these days? This like the previous article was missing a methods section, but had an extensive methods appendix….is this amnat's typical procedure for model/math papers?
  • supplementary info = methods metadata
  • Almost data sharing, but hard copy of files (almost like a museum specimen): ""Sound recordings associated with this research have been deposited in the MVZ sound collections."" Since not digital, I didn't count this the same way I do for biological specimens."
  • good dryad example
  • article frequently uses the term "data set" in reference to data collected from the different localities. The authors obviously acknowledge the idea of a data set. Also, Valerie should be careful not to rely on "data set" as a keyword in her searches. I counted their in-text XY coordinates as shared data....but this should be downweighted by the "how cited" factor since it was cited in the paper (as indicated by F for figure/table)....should also downweight "self" citations (i.e. author reusing their previous data)
  • software citation: ESRI extension: the Animal Movements Extension (Hooge and Eichenlaub 1997). General question: when did amnat start putting the "online enchancements" section at the top of a pdf article? When did this include dryad? - search amnat and dryad in ISI. no luck in initial search on scirius or isi.
  • partial dataset posted….a summary of % survival by year. I assumed it is useful data for population analysis and age/stage. However, I think they had more extensive abundance estimates and XY coordinates from tracking that were not posted. Also, the "reused" data was reused from the author's previous study...good, but still not transparent.
  • this article was a retort to another article that attacked a previous paper of the authors. Perhaps "throw out" this article? It is only a one page editorial type, but it qualified as an article according to ISI.
  • as typical in amnat, detailed and extensive methods appendicies, but no raw data. In terms of data reuse, the authors reuse their previous review/metaanlysis…thereby citing themselves, but not the original creators of the data. They also don't specify what parameters they obtained, though this is implied by later paragraphs. I think they should have attributed the original authors. they did not share the extensive dataset from their experimental mesocosms (body size, density, etc in relation to predator (fish) density).
  • large empirical and simulation datasets not posted. they are associated with t previous study by the author which may affect this. in data produced, there is a weird data matrix situation: THIS WOULD BE A USEFUL MATRIX: Using this field data, we constructed a six-stage demographic
  • matrix model with 13 nonzero matrix entries (for more details, see Knight 2004). SOME KIND OF MATRIX IS IN A TABLE IN THE RESULTS, BUT I THINK IT IS A SIMULATED RESULT. in general, isi must have started requiring funding data in 2009, b/c few of the 2008 articles have it retrieved even though it is there (in contrast 2009 articles that didn't have it retrieved usually didn't have it in the article)."

GIS data not cited well….not even clear what quadrats they are using, it's format (raster, etc), or how it was obtained (url, self, govt, etc)

  • short model paper…intent: "Our aim, therefore, is not to directly attack the conventional signaling hypothesis of song-type matching but rather to offer an alternative model based on reliable signaling and wait to see whether future experiments support the unique predictions of our model."
  • long, math model paper about evolution. Could they, and others, have used empirical data at least for validation? Is this prevented by problems utilizing depositories or lack of deposited data? Also, in final analysis, can decide to discount "simulation iterations" as undeposited (produced) was mostly included because of the sysbio policy it at least shouldn't 'count against" amnat articles when asessing if it meets journal policy
  • Author makes an interesting comment about the difficulty of obtaining data for fossil calibration: "Unfortunately, most compilations of the fossil record do not provide measures of its richness (for an exception, see the growing Paleobiology Database, but simply provide the age of the oldest first occurrence of each taxon. Even if the primary literature is consulted, it is notoriously difficult to extract what is known about the entire fossil record of a group. *Accordingly, most articles that use the fossil record to calibrate molecular phylogenies only provide estimates of the age of the oldest fossil of each lineage (with a notable exception being Springer 1995)."
  • Unclear if there were iterative simulation outputs. Seems to have just been a few simulations and the results/parameters of which are discussed one by one in the paper.
  • XY coordinates posted and author explicitly states that they are. But why not the biological data set? I've seen purposeful inclusion of XY coordinates in multiple amnat articles now, is this part of their policy?
  • Authors may have posted their code/software: "Calculations were made using purpose-written software, available on the online edition of the American Naturalist." I didn't have internet access at the time of reading to verify this. Since I haven't been tracking this, I didn't count this as data sharing, but if a usable GUI is posted online, it could be considered as such in the future and a good example of posting source code.
  • Total number of simulation iterations (pseudo-data sets) was not indicated.
  • boring model ppr. Unclear how they paramartized their model….it was about a very specific ecosystem and plant, yet it was unclear how things were calibrated
  • EXTENSIVE ecological/biological dataset of both observational and experimental measurements….undeposited as expected.
  • Possible morphological dataset, but may have just been visual taxonomic comparisons. Table 1 alludes to analysis of the behavioral videos, but this is not clearly indicated in the methods. I recorded the excerpts in the Undeposited sheet, but did not count it as data produced since it seemed to be used for qualitative comparisons.
  • Simple study, but had a mortality/mass dataset about eggs from a common garden experiment.
  • Life history data and matrix produced but not deposited.
  • appendix data included nearly all raw data from the extracted literature dataset, except genbank accession/info. however, it did not appropriately credit the original authors. Odd amalgamation of dataset reuse. Literature dataset + genbank + previous study by the author. Genbank was handwavingly acknowledged, but no allusion to accession numbers..upon looking at the bibliography, it is apparent that the author of the gene papers was a coauthor on this paper. I still think accession should have been given for retireval purposes. this seems common though, citing self (racking up your "impact factor" for tenure) but not really making your data readily accessible (or citing other authors whose data you used...i.e. the extracted lit data). the extracted literature data was also not appropriately credited....with instructions to the reader to contact the author if you desired the literature...what about crediting the original authors?! do journals have bibliographic limitations that would prevent citing the 45 papers used herein for determining hybrid viability. is that why amnat has seperate lit cited sections for the appendicies? does isi count these as citations as well? does amnat likewise? could an original author trace the connection between their paper and this one through isi? how else would a cited author find out the paper was utilized (for tenure purposes)?
  • biological/ecological data collected and associated with treatment factors…no apparent reuse or sharing
  • math model…no apparent validation/parameterization/empirical basis
  • bio/eco data collected but not shared. C++ model could be shared.
  • this paper was a mess. datasets were cited haphazardly...some as reviews, one as a database from a publication, some as papers, and others as just "the literature". with no indication (i.e. no methods), the authors all of a sudden say "the data we analyzed". what data?! It didn't have a methods section which it greatly needed. Can an editor not enforce a subheading?! What then can we expect for dataset citation enforcement by content and copy editors? Maybe since this is a "notes and comments" article, amnat considers it permissible...but why then have a results and disucssion section. i suppose the could have deposited data, but I wouldn't want to see the dataset (probably would have bad metadata), it's not clear what compiled dataset they would share, so I indicated "no data produced"
  • "this seems to be a case of data reuse for model validation/parameterization, but for some reason this is not mentioned until the results section so it is unclear how the data were obtained and used in the modelling process. model simulations were run for many scenarios, the iteration data is not shared. the authors do, howeer, share their model code: ""The algorithm, in S-PLUS or R, is available at"". "
  • in 2007, it seems that authors could opt for online or paper (before lit cited) appendicies...did they later move to all online or is it up to how much the author wants to pay? This model used empirical data to infrom the paramaterizion of the model, but in disparate ways, most of which involve unpublished data by the authors or statements about the biology from other sources.
  • model, no apparent simulations or empirical validation
  • short ppr, no true methods section. Small morphological dataset could have been posted (binary and quantitative data)
  • There was an apparent reuse, but it seems to be referring to the initial discovery of the lizard population because at least some of the sampling occured in 2004, which is after the publication date. quote: ""We studied color frequency of male Lacerta vivipara at five sites of the Ossau valley of the French Pyrenees (Heulin et al. 1997): Heulin, B., K. Osenegg, J. Leconte, and D. Michel. 1997. Demography of a bimodal reproductive species of lizard (Lacerta vivipara): survival and density characteristics of oviparous populations. Herpetologica 53:432–444"
  • Results tables are shared, but not raw data. The intext citation sounds as though the raw data was shared, but the appendix is full of statistical output tables.
  • mathematical model. Interesting that many papers like this state "we have demonstrated that factor X causes result Y" when it is a math model
  • open source software credit: ""I then calculated the Horn-Morisita
  • distance measure using the vegan-R package (Oksanen et al. 2006)."". Authors reuse their own data (multiple reuses), but there is no indication of sharing those past or the current compiled data set. a few XY coordinates given, but not of individual transects: ""Reference remnants were three large tracts of forest in the vicinity of St. Arnaud (36 40 S, 143 20 E; 25 kha), Dunolly (36 51 S, 143 18 E; 16 kha), and Rushworth (36 38 S,145 02 E; 41 kha). Seventy-one transects were randomly distributed across the three reference areas."" In general, I only count intext XY if greater than 10 localities
  • model with supporting empirical experiment - binary and density data could have been shared
  • somewhat of a software share: ""MATLAB routines are available from the authors upon request"" would author have posted if available/aware? Author repeats this in the discussion: ""However, contrasts can be constructed by numerous software packages, and the LSVOR regression is relatively simple to program (MATLAB routine available from the authors)."" and seems to be advocting his code as a good alternative to more expensive software"
  • interesting blend of major ecological theories…neutrality, island biogeo, and niche! no clear indication of simulation iterations, seems to just be a few tests of various scenarios. In general, amnat seems to encourage lenghty appendicies for derivation/model theory/computation explanations. This is a step towards open model sharing. Next step = code sharing, next step = GUI sharing, next step = platform independent GUI sharing.
  • FINALLY! Some good data reuse with accompanying sharing….independent posting of a usable excel file of compiled literature data. Props to the authors. Unfortunately, they don't cite the original authors directly in the paper (they did screen 100 original pubs)…see if the appendix citations, and AGAIN, figure out how that works for citation tracking.
  • the appendix tables related to the dataset are actually statistical outputs. The way they cite the data seems to imply that the raw data is available...this is confusing! They re-use a dataset from "pianka and colleagues" and yet again, cite a paper by the author but not those by additional colleagues.
  • no datasharing or reuse. Is there a repository for more agricultural ecology? It seems to me that should be stored separately, but possibly linked to other ecological depositories. Just seems like different types of researchers with different interests (theoretical vs. applied) would be soliciting the data
  • good morphogical dataset on snails to study ecological interaction with beetles. No apparent data reuse or sharing.
  • self proclaimed metaanlysis study (literature extraction). Full (extensive!) extracted dataset available online, but not clearly cited intext. Authors only credited in appendix citations (pdf associated with excel)
  • math model on resource limitation, no simulations just illustration of various scenarios
  • full data set shared in a zipped excel file. Contains XY coordinates, allele type, and biological information. Unclear if allele sequences were used, if so, they were re-used from the schmid 2006 study (one of the coauthors). What motivates an author to make their article open access and with the complete data set available?
  • 1995 dataset alluded to, but not used for any comparisions or model fitting…perhaps they were just referring to a continuation of methods. Bio/eco dataset produced (biological info abt separate species compared to make eco conclusions)
  • some kind of genetic analysis involved, but no indication of sequencing or use of previously generated sequences

"*possible incidence of reuse...model populated with life history parameters, but I think these were static parameters, not a full dataset: quote: ""When possible, parameters were estimated from the empirical study described above and from other observations of Echinacea’s life history (Wagenius 2000)."" model code is shared: ""We developed a stochastic, spatially explicit, and individual- based computer simulation model of plant population dynamics in a habitat that becomes fragmented (see appendix in the online edition of the American Naturalist)."". snapshot of code saved in zotero. "

  • GIS layers and XY data could easily be stored in a zipped file internally, on dryad, dataone, etc….why is this not done? With mongolia stuff, I was able to dig through website archives to find many layers (i.e. the csu ice study)….where could that data be stored?

"*possible example of software sharing/reuse: ""Below, we give a brief introduction to the digital life system, Avida. Additional information is provided in appendixes A and B, including a schematic of a digital organism and a glossary of terms. A more detailed description of the system is available elsewhere (Wilke and Adami 2002; Lenski et al. 2003; Ofria and Wilke 2004), and documentation is available online ("""

  • math model….no apparent simulation iterations beyond testing a variety of parameters
  • pre- and post- treatment dataset. Unclear if the pre- dataset is collected within the context of the current study or previous work by one of the authors (duToit 1990..which seems to have been a more observational study)

"*another instance of appendix citations only for extracted literature data ( i contine to count this as a ""0"" for citation credit)...but at least the compiled data set is shared (though the text is unclear that it is deposited...only indicates that the references are provided, not the actual data....but the online appendix clearly has the data tables)!. reference made to ""phylogenetically corrected analysis"" but unclear indication of phylogeny used or sequences it was built from:""using the topology of the actinopterygiian supertree (Mank et al. 2005)."" from the citation, it is unclear if this was the source for the original phylogeny. general problem with amnat: appendicies (supplemental info) only accessible from the full text page...i think it should be accessible from the abstract/entry and search pages"

  • data was collected to validate a model, but still could have been shared
  • model simulation data
  • gene sequences mentioned in one section of the methods, but then the genbank accession are provided in another (unrelated) section. Reused sequences are not credited by author or accession in the main text, and only by accession in the appendix. Genbank is not explicitly indicated in main body of paper for the reused data (it is for the shared). the appendix tables to state "genbank", but I counted this as a non-explicit reference because of the main body wording
  • the dataset is "reused" in the sense that it is an old dataset, but it is an unpublished one. It is unclear if these authors also collected the original data. XY coordinates are not given, even though mapped in figure 1 and clearly available in the methods descriptions.
  • no apparent reuse, sharing, or production (model)
  • extensive biological dataset about female preference and male traits. Data in raw (video/sound) form could be made available, but for this investigation, more importantly, the coded data could have been shared
  • super short paper of a simple observation. Probably discount in analysis and reconsider use of all "natural history notes". This article is the first instance I've seen of a video posted as supplementary data, which makes it seem as though behavioral videos and sound recordings could be posted.
  • extracted data not credited in paper or appendicies…author states that the sources can be obtained from the author by coorespondence. compiled dataset available online in excel and ascii. Modified phylo supertree could have been made available (not even given as a figure, perhaps b/c it was an intermediate step in the analysis)
  • biological dataset produced (traits, survival).
  • ecological dataset produced, not shared
  • biological/behavioral dataset produced but not shared
  • mice-parasite experiment, dataset includes measurements of parasite load of mice. Dataset not shared. Note, when I classify the produced dataset as bio/eco…that typically means biological parameters of one species were measured, but the treatments were of different ecological interactions with another species (as opposed to eco/bio where many species have multiple parameters measured and eco which is community based)
  • math model with no apparent simulated iterations.
  • reuse of published (literature) datasets, one of which is indicated to be an appendix which may or many not be a downloaded file (excel/ascii). Unclear how the data was obtained….perhaps directly from the original paper
  • unpublished dataset of the coauthors' is utilized for the analysis, but they still articulate how the data was collected
  • reused dataset is one previously published by the author. Used herein to validate the proposed model. Dataset should have been shared as it is unlikely that is was shared in the previous paper, but since it was not generated in this paper, I did not count this as an instance of data produced. acutally, on reading the citation, the dataset was not yet published and is "forthcoming"....does this mean it is accepted for publication, submitted, or just in preparation?
  • each study of the extracted dataset is described in detail. The compiled data set is not provided, though some summary tables are (about range of the data). Most of the datasets cited have at least one of the coauthors invovled in the previous publication. according to the acknowledgements, the others were likely obtained by correspondence but this is not explicit in the paper. to me, a sharing a compiled dataset is analogous to sharing a gene alignment...something another scientist could do, but greatly expedites processing, especially for metaanlyses
  • the authors "reanalyze published data", but they actually perform the measuremsents themselves rather than extraction from previous literature. they don't make reference to how this compares to the original data, so I'm not sure why they termed it a reanalysis. in general, model papers are either model fitting (aic) to empirical data, models with simulated iterations, or math models with solutions.
  • this is a critique of allometric metaanalyses that have problems with consistency in data collection…it would be a much more meaningful study if the author reanalyzed the data and proposed ways for standardization or at least illustrated (rather than pontificated) about the flaws.
  • math model of stable states in mimicry
  • extensive field, experimental, and morphological dataset, unshared
  • dataset not shared
  • study measures something analogous to growth rate (I think, called "titer"). This is used to validate a model
  • five tradeoff scenarios analyzed in the model, but there were parameter inputs, not iterations.
  • application of seed bank model to salmon
  • longterm dataset, but not necessarily reuse of it, though other publications by the same authors utilize it. Data likely not shared b/c they have planned publications which will also utilize it.
  • bio dataset. Again, the way they refered to the data in the results seemed as though the raw data was available in the supplementary data, but it was actually the anova (stats) outputs
  • The data set is not posted, but the authors do mention in the acknowledgements that: "The collated data sets are available from the authors." This was counted as data produced, not shared, b/c presumably many authors could be contacted for the data…hence why the primary author email is required.
  • the article says "in this note", but this is not an official "notes and comments" article
  • this paper refers to "empirical data" in the introduction, so I expected an analysis of the empirical data….but it actually laid out suggested study designs for testing their proposed theory. I would have liked to see them test the data they expressed qualms over and reuse/test it themselves
  • I'm not familiar with this type of study, but it seems to have produced growth rate data according to different factors which could have been shared.
  • ess model that predicts future morphology
  • no profound comments for this one
  • model scenarious w/out empirial validation
  • model of various scenarios, comments in discussion state that "other" studies utilize computer simulations whereas this one does not
  • example of GIS reuse . Poor citation practices with intext citation of url, but no bibliographic citation. Which is particularly odd considering that they cite program R with a full citation in the biblio. Does R specify that this should be done when it is used? Do the respective GIS sites do this as well (USGS, etc)? are any of these datasets also accessible through daac? they also produced a GIS dataset that could have been shared (plant areas).
  • the use of "grids" and acknowledgement for help on "GIS analyses" indicate that there are potentially reusable GIS data, as weel as the biological data concerning mate preferences/characteristics
  • evoluationary ecology model
  • model with simulated data and allusion to empirical data, but these were comparative case studies, not analyzed data
  • comparitive mathematical models
  • behvarioal outcome experiment….should have produced a binary dataset about response and treatment type
  • data extracted from database. Digital "sampling" and species range deliniation performed in GIS…could have been shared
  • math model with solved solutions. Brief reference to empirical data, but as a conceptual case study
  • dataset is not shared, perhaps because of use in other unpublished studies
  • math model based on lotka volterra….a brief comparision to empirical data in the discussion, but not used for validation/parameterization

"*authors refer to importance of empirical data in parameterization of their model, but this is not explicitly discussed anywhere other than the introduction: ""Our analysis is based on empirical estimates of key parameters of the replication process of poliovirus. This is an important aspect of our analysis, as these empirical parameters of the molecular biology of poliovirus replication have been determined by extensive experimentation in a number of laboratories."""

  • yet another model..I feel like there are more in this 2005 set than other years
  • this was a literature survey, basically analogous to what I am doing. I consider it a "sample" of existing data, so therefore an incidence of reuse and a dataset produced. Over a 1000 articles were analyzed.
  • number of simulation iterations never explicitly stated. interesting, a article I just analyzed had this same author as a coauthor. And I see the "P. abrams" author cited or acknowledged frequently
  • extracted literature dataset shared in its entirety. Authors of extracted literature only credited in appendix. Phylogeny given internally, but not posted to treebase.
  • this article should be thrown out. It was an editorial
  • math model. I often check math papers with words like "data set", "dataset", "example", "empirical", etc. to make sure I didn't miss a reuse or sharing
  • bacteria paper…not as common in amnat, used for competiton dynamics. In general, Is it a problems that the 2000 papers were submitted in 1998 and accepted in 1999. I think probably not because we have the same lag time in the 2010, 2005 articles
  • in general, a lot more single authors in this time period (not huge groups)
  • early treebase reuse, unfortunately no accession. Exgtracted literature dataset did not have full indication of original authors, only two….a complete list is supposed to be available from the authors. author explicitly said that he could be contacted to provide the dataset, so I counted this as "sharing".

"*no credit to original authors of extracted literature dataset. weather dataset reuse too. authors state: ""These data were collected in a wide variety of ways, presenting a formidable barrier to comparisons across studies."" but seems like they came up with good ways of normalizing qulaitative and quantiative data into the same dataset"

  • model building from empirical data
  • unclear if simulation datasets generated (or just given parameters changed at relevant increments). general evidence idea: increase of data reuse in amnat over time as evidence that dryad would be utilized b/c of increased interest in data sharing.
  • only in the appendix does it say how many simulation iterations were produced. statement for methods: "since there is not currently a way to track ecological and environmental data reuse through handilng numbers as in other disciplines (cite heather), we manually extracted information from articles. If a similar study is done or an expansion of this dataset that is not interested in documenting the percentage of reuse, various search terms are suggested to narrow down candiate articles: "empirical", "dataset" or "data set", depository names, "hdl", "Ac*" if this isn't to broad, and other terms from valerie. terms like "data set" would still keep the sample set broad to capture non-depository reuses.
  • math model
  • math model….how is notes and comments different than regular articles then?

Molecular Ecology articles

  • "Molecular Eco has entire ""supplementary files"" section at the end of the paper. But they give the caveat: ""Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries
  • This article cites genbank at least 4 ways: as Genbank, as supplementary table of combined genbank info, as url, as accession ….they use/submit 3 types of genbank data and cite it differently each time. all citations including the Accession number instance
  • good full documentation of genbank (i.e. all accession numbers recorded in a table b/c many were used: ID NUMBER ACCESSION REFERENCE HORSE DEPICTION & BREED AUTHOR PUBLICATION, table is saved in zotero as an attachment under the article)...nice resource
  • again, cites genbank two different ways - first as paper for acquired primers, then as genbank (no accession) for additionally analyzed genes..but then in the results, they cite specific accession numbers.
  • is biological material reusable data? As a taxonomist, I think so, but it can't be deposited on line. no genbank accession.
  • Another example of genbank mentioned vaguely throuhgout the mat and methods, but not formally "cited" until results…but in this case it was data sharing so it was understandable. Also another example of "could have been in treebase"
  • genes produced and used, but not cited as datasets. The used dataset may be posted on genbank, might be something to check
  • clearly refers to a breeding success dataset ("data files") but they are not included as tables, supplementary info, or deposited…I called this a "text" indication under Data Produced
  • GIS mentioned and then later clearly stated as in supplemental info (w/ coordinates), alignment and phylo not clearly indicated as shared data until the results section in which alignment was included in supplementary but neither align or phylo figures (trees) were posted to treebase
  • Reuse: mentions genbank generally first, only mentions original authors in table caption, mentions one of many accession numbers in text, remainder in supplementary table. So, to be "generous" I classified this a an "Au" citation even though it was more a "D" because of vaguness in some situations.
  • they did pcr, but fro aflp banding to genotype, not sequence generation

"*many datasets (and types) generated, but none shared. The sequences seemed to be of more than sufficient length for genbank submission. long term study, but not reuse: ""This study continues a protracted investigation of willow evolutionary ecology in eastern North America.""" "*yikes!: ""Sampled individuals were transported alive to the laboratory and sacrificed by freezing."""

  • more than 10 gis coordinates given = counted as shared dataset (especially b.c seeing the map, I would have counted it against them if they hadn't given coordinates. It's a weird coordinate system, but seemed to have high resoultion for location.
  • genetic sequences with some kind of pairwise analysis that may have had an additional dataset that I couldn't discern
  • I thought they sequenced genes, but they actually just detected the presence of certain gene regions. They did provide the raw binary data as an appendix at the end of the pdf
  • sequences are said to be deposited with an accession that looks identical to genbank, but genbank is never formally mentioned
  • xy coordinates could be obtained from author upon request= counted as sharing. Genetic material sampled in someway, but not sequenced….so, a genetic data matrix
  • sequences aligned, but not posted to treebase. No apparent PT produced…but if so, not deposited
  • a relatively rare instance of gene unposted to genbank.
  • weird….two times in a row that GS were not posted to genbank. Does genbank only accept sequences of a certain size? I'm pretty sure they accept anything (i.e. primers included). 4 xy coordinates were given in text…like I said, I count this as a dataset if above ten (otherwise I consider it incidental)
  • the first study that explains BLAST! Yes! Genbank's matching search. They didn't necessarily reused the data except to determine what species they had, but still provided the accession numbers (b/c they resequenced these essentially and wanted to avoid duplicates....good practice). again, primers spelled out in text…is this policy? treebase should have been used for tree indication of alignements produced even though I assume they were. GIS reuse from government sources, not well cited

"*odd…yet another unposted GS…when did molec eco's data policy become more stringent regarding gene deposits? Sharing of biological material: ""Ceratitis capitata embryonic DNA extracts were kindly provided by Kostas Bourtzis (University of Ioannina, Agrinio, Greece"". I'm not counting primers anymore, but this one cited one paper for two of the ten primers."

"*were sampling locations shared? Table S1 Information on the sampling location and genetic analyses in this study,….Fig. 1 Phylogeny of the Acanthodiaptomus pacificus mtCOI sequences using a Bayesian phylogenetic approach, and the geographical distribution of the three mtCOI lineages in Japan shown by circles decollated with the lineage-specific colour in the Bayesian tree."

"*what is different about 2009 that so many gene sequences are not posted? Has nic had luck with tracing history of data policies? Remember, that the true date of the paper to correlate with the policy is the SUBMISSION date. ironically, the paper cites an article by whitlock (but not one of the data sharing ones). code sharing: The source code is included in the Supporting Information."

  • also fewer PT producing studies in this years batch. Are aligned sequences or other matrices also necessary for data input in other phylo analysis (i.e. this one was inbreeding coefficient)
  • yet another undeposited GS
  • xy coordinates given in appendix. Gs not deposited
  • GS annotation reused to annotate sequences…can these be posted to genbank?
  • flow chart (fig 2) outlines the methodology and says "obtain or design gis layers". The authors did not indicate whether theirs were designed or obtained….i'm pretty sure they were obtained (at least the sateilite images and because of the offhand reference to the source of an unused raster), therefore I counted it as a uncited reuse

Ecology Articles

  • Ecology does not seem to have internal depository nor encourage external deposition (dryad would probably be best). Seems odd since they hosted early data sharing conferences. DOI not displayed on Ecology article pdfs. This: ""To test our conceptual model of the trophic relationships and identify which arthropod functional groups linked bird predation to a tree growth response, we constructed a structural equation model (Amos 16.0.1, available online)6"" sounded like model development and posting, but was actually software use.
  • this was a short paper ('report"). It reused data with handwaving acknowledgement in the text about how great the database was yet no citation anywhere, not even a url. Perhaps assumed everyone already knows about nawqa (I do, but I'm a stream ecologist) or that they'll google it and find the right thing. is dataone looking into caching databases? that would be useful for things like naqwa and little databases produced by various studies. also, her criteria for selecting only some sites produced a condensed data set, analogous to aligned genes, which was not then shared.
  • Though the article was looking at a new GPS filtering method, it did collect actual gps locations of seals which could have been shared. Also, a simulated data set was produced.
  • food web dataset (partially eco, partially bio)
  • extensive observational and experimental ecology dataset
  • in general: ecology has an appendix section after the references, could be easily used as a section for posting supplementary information (this paper gives an Ecology archives accession number for the additional results appendix)

"*dataset = seasonal biology of plant and life history (pop) data matrix. small possibility that this may be a raw dataset: Flower and fruit production in Orchis purpurea during eight consecutive years in six locations (Ecological Archives E091-011- A1)."

  • "reused" dataset was used primarily for sample site selection, but was obtained from a database
  • not clear AT ALL in the methods that the data was reused…this was only clear from the acknowledgements. The caribou data at least had an associated paper (with one of the coauthors of this paper as lead author on it, but apparently not over the data collection entirely given the acknolwedgements). the elk data was not cited at all. the moose data may or may not have been collected by the current authors, but it is difficult to tell since they didn't properly indicate the other datasets
  • another extensive, unshared dataset in ecology. Most papers I've read in ecology reuse a small, minor dataset but collect most of their own data and don't share it

"*again, minor datasets reused, or shared….not those for the crux of the paper. authors state that the data set is ""compiled"" but it is all their own data and they give complete methods….they just collected it for multiple species in different places. this dataset: ""APPENDIX A Growth habit and floral characteristics and pollinator visitation frequency for 13 species of Gesneria and Rhytidophyllum from Puerto Rico and the Dominican Republic (Ecological Archives E091-014-A1)."" was posted, but was not really used in analysis and was the authors from a previous paper, it was used more for qualitative description of the species studied"

  • online appendix has a condensed dataset, but was not clearly reference in text nor is it the original complete data
  • no apparent reuse or sharing
  • reuse of a previous literature extraction by the lead author. Gis/earth data also used
  • GIS reuse, but unclear who accessed and who is the tru original author (in text citations unclear)
  • simple dataset of food consumed, could be shared in assocation with experimental factors (number of wings clipped)
  • biologcal and soil dataset about plants…unshared.

"*two datasets analyzed. One is older (1984), but had no indication of reuse and was from a biological station where the author still lists their primary address...this one was assumed to be previously unpublished and therefore a candidate for sharing (produced). The other dataset was ""reanalyzed"" but with no indication of its origin….nor any mention of how the data was originally collected or currently obtained. authors mention the importance of long term datasets: ""In order to test hypotheses about changes in ecosystems induced by man, including climatic change, ecologists sample portions of the landscape repeatedly across time. Selected portions of the landscape are set aside for long-term studies; these may be transects or surfaces of different sizes. Research of this kind is conducted, for example, at the 26 sites belonging to the Long Term Ecological Research (LTER) network in the United States (available online).""" "*code sharing/reuse: The diversity index was calculated in the Excel add-on ‘‘diversity’’ (available online).8"

  • could species abundances be one category of "ecology" data type? It's biological data about each species in the context of the community ecology
  • I thought the XY coordinates for the sites were shared, but the indicated appendix was just a map of the sampling sites. In general, since there is such a low data reuse incidence (esp of major datasets), it would be interesting to see how some ecology metaanalyses reuse data, I'm guessing primarily literature extraction