From OpenWetWare
Jump to navigationJump to search

<!-- sibboleth --><div id="lncal1" style="border:0px;"><div style="display:none;" id="id">lncal1</div><div style="display:none;" id="dtext"></div><div style="display:none;" id="page">DataONE:Notebook/ArticleCitationPractices:Articles</div><div style="display:none;" id="fmt">yyyy/MM/dd</div><div style="display:none;" id="css">OWWNB</div><div style="display:none;" id="month"></div><div style="display:none;" id="year"></div><div style="display:none;" id="readonly">Y</div></div>

Owwnotebook icon.png <sitesearch>title=Search this Project</sitesearch>

Customize your entry pages Help.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

<!-- sibboleth --><div id="lncal1" style="border:0px;"><div style="display:none;" id="id">lncal1</div><div style="display:none;" id="dtext"></div><div style="display:none;" id="page">DataONE:Notebook/ArticleCitationPractices:Articles</div><div style="display:none;" id="fmt">yyyy/MM/dd</div><div style="display:none;" id="css">OWWNB</div><div style="display:none;" id="month"></div><div style="display:none;" id="year"></div><div style="display:none;" id="readonly">Y</div></div>

Owwnotebook icon.png <sitesearch>title=Search this Project</sitesearch>

Customize your entry pages Help.png

This DataONE OpenWetWare site contains informal notes for several research projects funded through DataONE. DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement.


Home        People        Research        Summer 2010        Resources       

Article Notes

This is a running list of the comments I make on individual articles after extracting data from them. Please excuse typos and spelling errors...these are quick notes I make following anaylsis. I thought they might be interesting to the group/community b/c they contain my off-the-cuff observations and insights. I will try and post these at the end of each day. The notes (and the accompanying DOI) can be found at: Fields

Systematic Biology articles

  • this study basically was researching data reuse in genetics and if accurate phylogenetic trees could be made from amalgamated genbank data of differing quality. Very interesting, unfortunately, they didn't reuse or share data since it was a simulation. SysBio says explicitly in their data sharing section that they can help make arrangements to deposit large datasets from simulations. So, the simulation iterations could have been made available. Also, this study cites program R well....if someone later decides to look at software reuse, they could just look at R alone in terms of how custom packages/scripts are credited and how the software itself is cited b/c it varies widely. I would say this one did a good job with an in text and bibliographic citation. Direct quotes: REUSE " An increasing numberof phylogenomic studies are published for data setsincluding more than 100 genes (e.g., Lerat et al. 2003;Rokas et al. 2003; Driskell et al. 2004; Philippe, Lartillot,et al. 2005; Fitzpatrick et al. 2006; Nishihara et al.2007; Wildman et al. 2007; Dunn et al. 2008; Zou et al.2008).Two opposite views have been proposed as to how toincorporate the growing amount of data to infer evolutionaryrelationships." SOFTWARE: "Significant differences in accuracy values between different datasets were assessed using a Pearson’s chi-squared test inthe R Stats Package (R Development Core Team 2007)."
  • I was going to validate what was in the supplementary data, but it is VERY hard to find on sysbio site. The link they provide within articles just goes to the main page. The usual smorgsboard of ways of citing....genbank accession recorded in supplementary tables, but treebase with the url and accession no listed right in the text. Also, i counted the acquirement as ""self"" because the papers the genes were pulled from were by the authors of this study...hence why no genbank accension or reference for the reused data (though they likely are deposited there). This study also looked at how effective/accurate it is to piece together data from multiple sources. ""When reconstructing phylogeny, some authors combine nindependent sources of data irrespective of conflict n under the assumption that the combined analysis will maximize explanatory power of the phylogeny (“total evidence,” Kluge 1989)...Combined analysis is likely to lead to decreased, rather than increased, support for clades identified by one gene and contradicted by another (Bull et al. 1993; Lecointre and Deleporte 2005). In a worst-case scenario, combination may even result in the inference of spurious relationships supported by neither data partition individually (McDade 1992; Bull et al. 1993)."""
  • Study repeadetly refers to "unpublished data" from a single source. However, this "data" is not cited in the biblio (even as personal correspondence with the author). It is unclear if this is actually published data or just a proposed theory from a friend. Quote: "Their method is similar to another recently proposedtechnique (Meng and Kubatko 2009) in which methodsfor assessing the evidence for hybridization in the presenceof incomplete lineage sorting using a sample ofgene tree topologies were developed in both likelihoodand Bayesian frameworks. Nakhleh L. (unpublisheddata) provides a unifying framework for the methods ofThan et al. (2007) and Meng and Kubatko (2009).....A general expression for the likelihood of an arbitraryhybrid species tree has recently been given in Nakhleh L.(unpublished data) for models of this general form,although gene tree densities of the form in equation (2)are not explicitly discussed......The likelihood of any hybrid species phylogenyspecified this way can be computed as described in theprevious section, using an appropriate likelihood analogousto that shown for the example tree in equation (3)(see also Nakhleh L., unpublished data)."
  • The last few sysbio articles have been simulations. Is this common in sysbio? Are these all in the same issue? Could they have used recycled data to test their models? Why would an author not to this? Especially in this study, it seemed like they were using real sequences (references to "empircal data" stated clearly as opposed to "simulated data"), but they never said what species. It may not be "fair" to count unshared simulation iterations unshared..but sysbio explicitly states in their data policy that these can be stored which seems to imply that they should.
  • This paper uses the supplementary data depository of a dump for the figures they couldn't use in the paper. Supp Table 1, however does include accession numbers but never mentions genbank. also, I don't like how sysbio takes you to their entry page for the supplementary data. it implies that it is easy to find since they include it as a separate header in the paper, but you have to search for the article and depending on your browser are able to see the link for supplementary data. I found it, but it was cumbersome..the url provided should be for the specific paper or indicate how to search for it b/c there is no entry link on the main page for supplementary data. I like Molecular Ecology's approach better...a specific list of supplementary data at the end of the paper so you could better assess if you wanted to go dig it up. Tables 3 and 4 were not useful data as I would have hoped....I thought they might be gene alignement matrices. P.S. I know the editor of this ppr (Marshal Hedin)…need to get the cave specimens back from him!
  • odd..paper had no abstract and nonconventional headings but no indication that it was a special article format. References to genes don't clearly state resuse or even the taxa (awkward wording), but do cite authors. Alignments and trees not posted in supplementary data nor even in within the papers (minimal figures).
  • This is what I was hoping to see with the model simulations…a similuated data set and then validation. This one is especially good b/c rather than attacking another method point-blank, it uses the same datasets used by the method it's attacking for a one-to-one comparison....that's good data reuse practices and simply good science. Also, to cite Treebase, they cite the original paper introducing it (novel from what i've seen) rather than the url (more typical, i think in the spirit that people could then go their and use it). Also, simulation results are included in the supplementary data but not the iterated data sets they stemmed I counted this as data produced but not shared.
  • Good example of genbank and treebase reuse and sharing, but accession/authors not clearly credited for genbank (maybe in a supplementary table?) . Both alignments and trees went to treebase and are posted supplementary data!!
  • Data could have been posted to treebase, rather than just internal. Need to figure out a way to code this and if editors are instructed to police the sysbio policy on this.
  • unclear if trees also posted at treebase (could verify later), so this was treated as data produced but not shared. Also genbank/genes was cited in all kinds of crazy ways…most references were to the supplementary table which included all the accession no. and others were vaguely to the authors or the depository alone w/out associated Accession or explicit statement of reuse.
  • XY data produced and recorded in supplementary table, but not mentioned in paper as full locality info. The authors reused their previous data, but at least gave the genbank accession. In my opinion, the accession no. should have also been in the appendix table which contained other detailed info about each specimen.

"*the proposed method requires XY coordinates, but these are not indicated in the data reuse, likely because the reused data was originally produced by the author. code sharing instance: ""The above routine was implemented in the Java software PhyloMapper and is available at"""

  • unclear if the authors obtained seuqence data from genbank or sequenced the genes themselves. I think it was obtained by frequent references to "BLAST" which as far as I know is the genbank search engine. Therefore, the sequences are ,maybe could be attributed as a data reuse in the genbank table. but, these aren't agtc seuqences, so where are they "supposed" to be found/posted. UPDATE: I could not find the indicated supplementary data. no links on the full text, etc. Also, BLAST is also run by NIH and has similar search abilities to Genbank, but for proteins. shouldn't the citation recommendations be the same?
  • no genbank accession given directly in text, but in tables. Accession give for seuqences generated from this study, but not from the reused primers. As typical, gene alignments and phylo trees not posted to treebase or otherwise

"*this article does not have any reuse/sharing, but is about coping with large phylogenetic datasets (presumably compiled from mlutiple studies). ""Datasets used to reconstruct phylogenies are becoming increasingly complex. Not only is the size of trees increasing, but the number and heterogeneity of loci being used to infer phylogenies is also increasing steadily"""

  • i really dislike the sysbio requirement to include after reference to an appendix....primarily b/c it does not lead to the article specifically, nor is there any link to supplementary data! as with 10.1080/10635150802555933, I could not find the referenced supplementary data in the full text or article search (as i am usually able to do for other articles)! does the treebase accession have the gene alignements (see treebase_sharing sheet)? Is appendix one raw data (dispersal distances)? Accession numbers given in table, but genbank and accession not explicitly referenced in text (would this still come up in a full text search?)

"*it was very difficult to discern that this article reused and shared data. the acknowledgmenets were most clear on this and caused me to retrace the word ""dataset"" throughout the article. other than the acknowledgements, it is unclear how datasets were obtained and they are definitely not cited, but instead coded by the authors for their tracking purposes. the only credit the original data authors get is in the ackno section and this likely does not include all authors. datasets posted, but url of both the dataset and the software no longer works= this should have been in a depository!!! actually, they ended up working when accessed from the full-text...but the pdf links/typing the http address does not work for some reason. instance of software sharing: The consensus and best-scoring ML trees can be viewed and browsed online by using a customized version of the PHY.FI display engine, http://cgi-www. (Fredslund, 2006). The CIPRES RAxML Web server has been set up in an analogous way (see Availability:˜stamatak/index- Dateien/software/RAxML-VI-HPC-4.0.0.tar.gz Web Servers:"

  • refers to reused data as "datasets" and also reuses accompanying phylo trees. No indication of how they were obtained even in the ackno.
  • again, unable to find referenced appendicies. Assumed that genbank numbers were given if this was mentioned intext (i.e. w/ the reuse, but not w/ the sharing). It is not clear if the genes were posted to genbank or at all.
  • the dataset may have been shared in the previous study by the author as indicated by the statement "see original publiation for data set". I realized with this and the previous article that I've been classifying phylo trees present as figures intext as "produced" not shared data. I will continue to do this, but indicate if a figure was present if i want to change how this was done. i am inclined to leave it as produced b/c it should have gone to treebase, but i will probably also classify internal shared data to what depository it should have gone to

"*databases, but not data authors attributed. Authors use the term ""BLAST"" search which I thought was exclusive to genbank, but perhaps has become colloqiual for any sequence search. I was unable to track the treebase accession number…it appears invalid. I could finally find it by using the article information, but it was difficult to distinguish the current paper from other work by the author. furthermore, the provided accession number was incorrect....there is not even a possible SM prefix....S is the study id prefix, and M is the matrix id prefix. This paper was about how to cope with large datasets, presumably compiled through sharing and reuse: ""One of the goals of phylogenomics, as we envision it, is to generate data sets that result in topologies that are robust to the addition of new data and stable to changes in assumptions of analyses. ...Second, there are several groups generating Expressed Sequence Tag (EST) libraries for diverse butterfly taxa; e.g., Pieris rapae (http://www., Heliconius melpomene (Jiggins et al., 2005), Bicyclus anynana (Beldade et al., 2006), and Melitaea cinxia (Vera et al., 2008). EST libraries provide DNA sequence of the mRNA of expressed genes...In this paper, we report a new method to search through genome databases for exons of suitable size (500 to 600 bp), comparing these exons to EST databases for related taxa of interest, and finally develop primers potentially universal across the taxa of interest."""

  • conceptual theory illustration
  • genbank directly indicated for reused data, but they do not explicitly say that their sequences were deposited in genbank, instead this is implied by consecutive accession numbers in Table 1 = "no" for depository referenced in genbank sharing
  • good example of reuse with two exceptions: 1. the self reused data was not apparently reposted or otherwise having accession numbers indicated. 2. the accession numbers were not directly given, but referred to in another publication. So the author benefitted from accession numbers, but did not give any in this paper. but the article does clearly post both GA and PT to treebase, however it does have another weird sn prefix number which is difficult for actually retrieving the data
  • authors do not indicate in their "empirical data" section how they obtained the data, then a few paragraphs later, they state that it was "kindly provided by M. Brandley", presumably with sequences already aligned.
  • according to table 1, the authors did reuse some sequences….this was not clearly specified. The only indication is that the range of submitted genbank accession numbers does not inclue about 20 of the sequences used, especially those for cyt b.

"*unclear if the test data was empirical or simualted. if empirical, no comments were made about reuse/credit. if simulations (or if empirical), should have been shared. instance of potential code sharing: The R code required to implement, in conjunction with PAUP*, the arcsine- ILD test, is available from or on request from the senior author and will be published elsewhere."

  • reuse is from a previous paper by the same author, but one that pulled from multiple articles which are indicated to be properly cited in the previous paper. but in table 1, some things are listed as a reference as "This study", but the study didn't mention any internal sequencing, though I think the authors introduce new datasets intext....which I wonder why they didn't also put them in the references column? regardless, they are all present in the biblio. again, i couldn't find the supplementary data online. gave the authors the benefit of the doubt that the compiled data set was made available since they explicitly say "inputs" are available. also, I think PT were produced, b/c the end (or intermediate) product of any phylo analysis is a tree. How does treebase manage different trees for the same taxa?
  • somewhat awkward wording referring to genbank accession numbers. No indication of treebase. Also, simluated data (i.e. simulated gene alignments?) were not published….remember, I'm back in sysbio = simulation inputs are encouraged.

"*questionable if this paper ""produced"" data. It doesnj't sound like seuqences were aligned in any way. A compiled dataset could have been provided (i.e. input files), but they do document their reuse sources well. Blast seems to be a surrogate term for search/compare: ""and then to BLAST those sequences against one another to assess sequence similarity.""" "*unclear where the reused data came is a reuse of a reuse....""data set previously exambined by..."". this again, give credit to the current/re-analyzing author rather than the original dataset. a weird pt produced, but a tree nonetheless that could have been posted to treebase (I think). code sharing: ""A program that implements our model is available from Htm""" "*data shared on personal website could not be accessed, nor could I locate supplementary data on sysbio.= should have been in a depository!!!! But, good effort at datasharing and assume it was raw data b.c well articulated. unclear if the shared dataset also included the condensed datasets: ""Gene resampling was performed on the data set over 8 fungal species and 106 genes by randomly selecting 50 gene sets of size 3, 50 gene sets of size 5, and 50 gene sets of size 10….""" "*the reuse seems to be from a previously reuse dataset one of the authors had access to in a previous study. intermediate trees may have been produced, but they were for method testing only. article didn't produce GA b/c testing non-alignement methods: ""In nonphylogenetic contexts, alignment-free methods are employed in tasks as diverse as sequence classification, database search, and detection of regulatory sequences; the literature on these applications is small but is growing at an increasing rate....We conducted a large-scale comparison (in a phylogenetic context) of 10 alignment-free methods, among them one new approach that does not calculate distances and a faster variant of the pattern-based approach.."""

  • the simulated datasets produced were various extractions from the reused dataset, usually varying the number of taxa. general: like amnat, appendicies are commonly used for more detailed explanation of methods. Also, for valerie: possible keywords to indicate a dataset reuse: "dataset", "data set", "empirical". I usually search for these terms, "simula" (for simulation), "iteration", and the depository names after reading through a paper to make sure i didn't miss anything

"*code sharing: Hybridization and genetic drift were simulated using a program written by the authors (available at http://lamar.colostate.eidu/~reevesp/Hybridize.html)"

  • interesting, the paper refers to the taxon using a genbank accession number. A little confusing, but I understand that it could point interested parties to relevant gene sequences. Does genbank recommend doing this? A few authors have used almost the exact sentence: "data matrix and trees deposited on treebase with accession ##". I like it...clear, simple, gives all the necessary info. they also usually do it after a quick sentence about genbank accession at the end of the sequencing/collection section of the good that it usually tempts me to skip the rest of the article
  • math model…much less common in sysbio than amnat. In general, the 2007 sysbio had fewer "points of view' than usual and almost no models (except this one) that weren't also empirically tested in some way. Does sysbio tend to decline papers that don't have empirical testing?
  • it was difficult to discern if this paper used empirical or simluated data. There was no discussion of taxa used. In the final phylogeny, there were binomial taxa names. In this case, it is very unclear how/what sequences they obtained….they only reference the db and not even their search criteria beyond the algorithm used. though they cited one of the db authors, i counted this as no biblio citation b/c they don't cite the primary data. maybe I didn't understand it, but in any case, they did not attribute the original sequence authors at all. authors do not share the data, but state that it and the code are available upon request to the authors. then why not post it?
  • good use of accession and author citation. a little unclear about which taxa they sampled and which they obtained from genbank…a table could have helped. They also do not specify if they measured the morphological dataset or obtained it from others….in the table they state that this info was extracted from the literature...ok, they end up explaining it a page after the table...obnoxious. another instance of a taxa being cited with a genbank accession. treebase out....i don't think i've had many instances of trebase reuse, but a good handful of sharing.

"*though this paper did not have any reuse or sharing, the first sentence cites treebase as an indication of interest in phylo tree: ""The increased interest in tree reconstruction in recent years (e.g., ""Tree of Life"" project: http://tolweb. org/tree/phylogeny.html; phylogenetic database Tree- B ASE: http: / / www. treebase / index .html) prompts further methodological developments"""

  • math model article

"*simluated dataset produced, but mentioned offhand in the results section. code sharing: The tree space was generated and evaluated by an ocaml (Chailloux et al., 2000) program whose source is available upon request. All of these tree shape statistics can be calculated for any trees in Newick format using the simple command-line software simmons available at http: //"

  • interesting, for primers, I've now seen two papers that cite the full primer sequence intext/table. Does genbank not accept these as submissions? I think other papers did cite primer accession. Does this count as data sharing? I'm counting internal (table) XY coordinates as sharing....i mean, the supplemenatry table is just as much a difficult to use pdf rather than .xls or other formatted data. also,
  • another instance of a general db credit, but nothing for the original authors of the data. the common problem continues that data is supposedly posted to sysbio but unretrievable…luckily they also posted some of it to treebase. Authors not explicit that all trees were posted to treebase, but they were all posted internally, so i figure the likelihood is good. most studies only give the matrix # i didn't count a PT under treebase, but also didn't count it against them as data produced. in general, there doesn't seem to be supplementary data in the 2006 papers....this is one of the few that cites it (but again, it couldn't be found)
  • open extracing the cited dataset, it looks like it was a reuse of a reuse
  • simluation paper
  • table with genbank accessions a little confusing…the authors don't clearly indicate if they were retrieved or produced..then they say that they sequenced things themselves, but not all the genbank numbers are sequential, so I think some were reused. primers are again referenced in a table with full sequence and author citation. why wasn't this done for reused gene sequences (at least the author part)? trees were not clearly deposited in treebase.

"*code sharing: p://www.systematicbiology. org."

  • this article continues the trend of referring to taxa with an accession, but this time without even saying genbank!). Reuse of a reuse = accession not given in this paper (even though the authors had easy access to it!). Would be interesting to know how they credited data authors in their earlier artilces. as such, biblio citation was counted as 0 b/c original authors were not clearly cited.
  • data authors not credited with a reference, but accession numbers are given in a nice appendix table. However, the new sequences are indistinguishable from the reused in the table and intext does not give a range of acession for the new/shared data
  • when did sysbio policy on simluation data start? Nothing I've come across obeys this recommendation…ask Nic
  • again, not clear if pt deposited in treebase. No proper credit given to original genbank authors. Other reused dataset was apparently provided by coauthor from a previous study.
  • reuse of a reuse = not clear if data obtained from that author or genbank (assumed to be genbank, but given a "n" in depository reference b/c not clearly indicated)….also, this then does not credit the original data authors

"*good genbank credit method....(Author Year, Genbank #). again, pt not explicitly posted to treebase but likely is…perhaps write a if formula that takes this into account = new field that says ""something was posted to treebase and presumably everything was"". code sharing: A Perl script, secondchance (available online, along with the other supplementary materials, at,"

  • original data authors not credited…would have been easy to add a references column to table 1. they also post their mrbayes (input) file internally = easy for reuse,. Again, couldn't verify this because of non-accessible supplementary data post-2006? On sysbio
  • Authors very upfront...first paragraph of methods = we published all our data...well almost, they forgot the trees. This seems like method metadata: "In addition, digital images representing character states are available at" If they had uploaded all images, it would be a dataset. However, this is an exciting incidence of morphobank!
  • another tree simulation
  • another tree simulation
  • reuse of a reuse = no credit to original data authors. One original author credited….why this one and not others? Did the previous reuse cite all the original data? How was it obtained?
  • proposted method
  • method tested with empirical example. This seemed like a "points of view" article, but was not labelled as such. When did sysbio start having that article type?
  • these older papers have good appendix tables with the taxa, voucher specimens, and genbank numbers listed…sometimes with author references. this one should be commened for having both in text citations of the accession and a nice summary table which indicated newly produced sequences. Were these tables required/recommended in the past? Or a common trend? Why has it fallen out of practice? Are people simply using larger datasets and think this is unmanageable? i think it isn't b/c they likely have to compose such a list just for their records
  • awkward way of saying the data was posted internally…articles like this are probably why they stared the standardization of "are available on"
  • study used genbank sequences for realignment…I think this is commonly done but not articulate b/c such a common part of the process

"*blanket statement of genbank was luckily followed by a table with accession. code sharing: ""LILD,"" a UNIX-compatible program that quantifies local incongruence and tests its significance using this strategy is available from the authors at www.columbia. edu / ~jt!21 /LILD. An interactive on-line version is available at the same site."""

  • gene alignement shared internally, pt not shared
  • clear that data was reused, but unclear how it was acquired and who was original author…it is clear however, that they did not use the whiting (byu) dataset and instead used a "smaller dataset" presumably by the author mentioned authors but not clearly indicated
  • they deposit ga and pt internally…why not to treebase? Has there been sysbio migration problems with this early data?
  • pictures of many trna inferred structures would such a raw dataset be disseminated, i assumed in a data matrix. like in gcb, a weird occurrence of the same author publishing twice in a single issue. gs listed in other pub from same issue. do genbank accessions require locality info (author states that detailed locatliy info is associated with the accession)...this would make sense for supporting biogeo studies. In general, I notice that when I download from sysbio, it records that I did so…does sysbio (or isi for that matter), record downloads of papers…does this factor into impact factor?
  • looking over the given accession numbers, it is possible that some sequences were reused, but this is not clearly indicated in the text. "nexus" is another good search term for finding deposited ga or pt
  • example of a good accession paper with both author citations and accession numbers. (very long too…so other people should have no excuse!) it takes up multiple pages…I wonder if the page charges affect author decision to give such a table…if so, they should be waived for data sharing
  • point of view articles sometimes are more like regular articles, but this as expected was a presentation of a commentary on methods without validation

"*point of view: interesting commentary of the independence of gene alignements and resulting phylo analyses: ""Alignment (making hypotheses of primary homology) and tree searches (testing hypotheses of primary homology) are logically independent steps in phylogenetic analysis."" provides further justification as counting these as sepearate datasets even though one typically implies the other"

  • in theory, the authors could have shared the original pictures as well (but these were not even clearly mentioned in the methods…maybe they were disposed of autmoatically by the morphometric software). Also, it is unclear how they obtained phylogenies and combined them...did they just look at terminal taxa or rerun alignment inputs?
  • the compiled datasets appear to be shared in their enteriety in the appendix. I'm still not sure how to categorize non-sequence genetic data, in this case karotypes… I keep calling it biological.
  • the overall database is cited with an author reference, but not the individual datasets….they might in their shared dataset but this would not provide a feedback for original data author credit. They have a personal website for datasharing which I generally frown upon, but it's also not totally clear where there data should go other than a general depository like dryad
  • simulation study that actually states the number of simlulations…imagine that!
  • on a personal note, this is similar to my simulated species for validating my mongolia models. They also employed maxent
  • one of a few treebase reuses…doesn't indicate original authors, but cites the original paper announcing treebase. this is the first study I've seen that shares the simulated data sets….state in discussion section that these were tracked but not used (didn't count against people as data produced)

"*weird, the paper gives genbank format accession numbers but never refers to genbank in the paper. Also, is there a specific gis/xy depository. Probably daac for gis, but what about just species localities? Instance of code sharing: ""This code is open source and available at"". this whole paper shares all data why not also be open access? maybe because the data sharing is more in the spirit of scientific progress than total transparency"

  • good genbank table (table 1)….ends up that the "previous studies" were all by the same authors, so really a "self" reuse, but at least they gave accession no. also, the clearly put ga in treebase, but did not indicate phylo

"todd is a coauthor on this paper. Code sharing: The TAO was developed and is being maintained according to OBO Foundry principles (Smith et al. 2007). In keeping with these principles, TAO is open (under CC0license;, available to users, follows a community syntax (OBO), and employs a versioning system (Concurrent Versions System) to preserve previous versions of the ontology.....TAO was made publicly available in September 2007 within the OBO CVS repository ( and from the National Center for Biomedical Ontology (NCBO) Bioportal (" another instance where when I went to treebase, the study number provided in the published paper did not yield any results. ….alert treebase/community to this in discussion..perhaps test it for all my treebase studies. Also, another instance where the author cites a previous review paper, but not the original data authors (except in a non-linked, non-ISI tracked appendix!...actually the appendix just had acession, not author credits) yet again, could not find the treebase deposition using the provided study number, even when using the "legacy study id". Also, could not find this one using title or author.

  • this was mostly a methodology proposal. I think they used or produced some kind of morphological data matrix, but this was difficult to discern….they seem more like case studies that illustrate the problem
  • since these are conference proceedings, the papers so far in this issue have been lit reviews
  • again, a review of case studies to illustrate a concept. This one did not explicitly mention the symposium in the acknowledgements.
  • oh good…this whole issue isn't special review papers! alignments interestingly posted on a personal website…why not treebase?

"*acknoweldgements state: ""J. Alroy made unpublished data available"", but this was for parameterization of the simulations"

  • only some of the original authors are cited if the sequences were more of a special case
  • authors cite their in press paper rather than fully articulating methods and crediting original data. Would be interesting to see the other paper (which is surely published by now)
  • genbank never directly referenced, but accession numbers clearly of that format. "reused" sequences are from author;s previous work
  • do authors assume that pt shouldn't be posted since the figures are in the text? I somewhat understand the lack of treebase in 2000 because it was relatively new, but why hasn't it (at least qualitatively) seemed to increase much by 2010? Is it difficult to post? not enforced like genbank? etc. also, this was the third paper in a row that posted the alignment to a personal website...what explains this practice? maybe a paper they all read did it or sysbio advised it at the time (or the particular editor did).
  • yet another wiens paper. Given the number ot self citations in the biblio, this is his favorite journal to post in (nearly the only one that he does publish in). Acknowledgements give shout out to jack sites

American Naturalist articles

  • I thought this was a math model paper with no simulations…until the end in the appendicies text which mentioned reused datasets. The authors appear to be friends with the original authors of the empirical data from the comments in the acknowledgement section.
  • Short study produced morphometric data on turtle shells. From what I noticed, it may have been a negative result…hence the "notes and comments" article rather than regular article?
  • Another math model. Are AmNat e-articles typically model explorations? Or just the two I encountered first? Also, there was no indication of posting the model to a depository….just extensive appendices full of math.
  • Posted at the top of the article near the other metadata!!!!!: Dryad data: (hyperlinked even!).. This is a metanalysis. Reuse Data extracted manually from paper/digitization of figures (whole paragraph on acquirement)….cool that they made the matrix available.
  • Model. Hints at empirical data, but apparently just citing a method/equation/theory.
  • Extensive biological, ecological, and experimental data collected. Only one in-text table which was a brief summary of this (mostly statistics). No indication of data deposition.
  • URLs cited intext…is this a common practice rather than putting in biblio…why not both or combine? I.e. hadcm3 models were proposed by one set of people, but now the GIS coverages are available online and presumably maintained by other people.
  • life history data not shared, but math model is articulated in detail (with rationale, development, etc) in appendix b. unfortunately, this is not considered in the scope of this study but could be in a software/model sharing approach.
  • extensive ecological datasets, but no indication of sharing.
  • model…simulations mentioned. Amnat does not suggest simulation dataset posting like sysbio did.
  • Proposed math equations, no apparent model simulation.
  • as a general trend, I've noticed papers in AmNat that utilize the interal data repository mainly for extra tables and figures, not for uploading their raw this case, the author upload things like the locality XY coordinates, but not the extensive dataset on trill rates and morphology. Also, unclear how sound recordings were obtained and why author didn't reciprocally make hers available. can sound files/morphology pics be deposited at dryad or other depositories? these are somewhat unconventional files for data reuse, but very useful for their fields. perhaps the museum database that house the accompanying specimens could record the measurements, publications, and files related to the specimen (which from my experience, is not done).
  • in terms of model reuse, amnat encourages upload of GUIs and code, but I don't see this happening. For instance, I think R code could easily be required…even just as a text file. This would record re-use of programs, scripts, etc that were written by other people and make the author aware of the need for attribution (they usually attribute R itself, but not the specific packages used). as i read furthre, this paper did attribute an R package! but they cite it as a paper...did they read the article or find the package first? my guess would be the r package. does the paper they cite openly disclose the url of the package or otherwise encourage its reuse? this was not counted as reuse, but the excerpt was included on the reuse "other" sheet. also, in appendix b, the authors give extensive documentation of the model parameters (esentially metadata). I think this was a good practice, but unfortunately cannot count it as actual data sharing.
  • like previous article, good metadata in supplementary info about model parameters. Again, this is a good practice which I support but I did not count it as data sharing, but did store the relevant excerpt on the "internal" ss
  • This study utilized biological material from other individuals (not authors)….how could/should this be cited and credited. They do not refer to the contributions with a paper citation, though that may be appropriate. In the future, is it possible that in addition to dataset's receiving a full bibliographic citation, so could biological material? This would help attribute museums/taxonomists (i.e. could help a museum prove that they are utilized) and could give a collector comparable credit to a published paper (i.e. tenure "credit" for the time spent sharing data). This is the same issue with dataset citation, just a little more complex when you add in non-digital data, but still data that could be reused and needs to be more clear on where it was obtained (from my experience obtaining museum loans). The info about the biological material was recorded under other reuse, but did not "count" as a reuse.
  • No direct attribution to gene sequences (may have been only used for branch lengths, but I'm pretty sure they generated an entire phylogeny/topology from the sequence data.
  • again, detailed metadata on methods, but no sharing of the actual data. Also, they seemed to have obtained a lot of phylogenetic and biological data from another source without saying where they obtained it…turned out it was from themselves (a previous study). would be interesting to know if in their previous paper, they deposited the tree and sequences (aligned too). as a general note, typically if the resued datasets are mentioned in the introduction, it is in the finally paragraph of the intro....or in a paragraph that starts with "in this study..." or "here we examine..", which usually is the last paragraph of the intro
  • at first the authors don't cite the papers from which they extracted data (bio and phylo)…later they refer to the online appendix with this info. Shouldn't the original authors be cited in the main paper for reference/citation tracking?! is this common in meta-analyses (to not cite the original authors of extracted data in the main body of the ppr)? if it's in a table does it get acknowledged by isi as a citation? but what about an appendix table? interesting, appendix a and b have separate "literature cited" sections. again, confusing and do these authors get citation credit?..investigate in isi. also, as a general note, the data stored with the journals (internal) is difficult to resuse...usually a photograph or word file, not a downloadable .txt, .xls, or other format usable for analysis. this is one obvious advantage of the depositories which store the data in multiple usable formats.
  • "This article would be a good example of model datasharing....they reuse a math equation and seem to post their model (I think they do...and it's not just the software used, but the actual GUI from what I can the download site for it, it also looks like they have updated for wider use (i.e. different platforms)). odd use of acknoweldgements to provide url for the uloaded shared data. direct quotes: METHODS: The model used here is derived from Guillaume and Perrin (2006) and is implemented in Nemo 2.0.6 (Guillaume and Rougemont 2006)…..ACKNOWLEDGEMENTS: The simulation software is available at Simulations were performed on Westgrid ("
  • There seems to be a large number of articles in AmNat that are mathematical models…this should be taken into account in analysis…perhaps by sometimes anlaysizing without or blocking by study type (yet to be coded….see yellow notes on article ss). The amnat policy is newer according to nic's research, but there is very low sharing of math/coding/software/GUI (or just the simulation interations at the baseline!) which would be the best mode of reuse/sharing for these types of papers.
  • another model, another huge batch of iterations, and no data sharing


  • another model…so many in this journal!
  • yet another model. Maybe get an estimation from the snapshot about the proportion of models vs. experiments/observational per issue and extrapolate that to per year.this article also has a separate citation list for the appendix. Extensive math/derivation/method metadata in the appendix
  • and yet again. Does amNat publish the pictures of animals in its empty space to compensate for lack of actual naturalist/ecology studies these days? This like the previous article was missing a methods section, but had an extensive methods appendix….is this amnat's typical procedure for model/math papers?
  • supplementary info = methods metadata
  • Almost data sharing, but hard copy of files (almost like a museum specimen): ""Sound recordings associated with this research have been deposited in the MVZ sound collections."" Since not digital, I didn't count this the same way I do for biological specimens."
  • good dryad example
  • article frequently uses the term "data set" in reference to data collected from the different localities. The authors obviously acknowledge the idea of a data set. Also, Valerie should be careful not to rely on "data set" as a keyword in her searches. I counted their in-text XY coordinates as shared data....but this should be downweighted by the "how cited" factor since it was cited in the paper (as indicated by F for figure/table)....should also downweight "self" citations (i.e. author reusing their previous data)
  • software citation: ESRI extension: the Animal Movements Extension (Hooge and Eichenlaub 1997). General question: when did amnat start putting the "online enchancements" section at the top of a pdf article? When did this include dryad? - search amnat and dryad in ISI. no luck in initial search on scirius or isi.
  • partial dataset posted….a summary of % survival by year. I assumed it is useful data for population analysis and age/stage. However, I think they had more extensive abundance estimates and XY coordinates from tracking that were not posted. Also, the "reused" data was reused from the author's previous study...good, but still not transparent.
  • this article was a retort to another article that attacked a previous paper of the authors. Perhaps "throw out" this article? It is only a one page editorial type, but it qualified as an article according to ISI.
  • as typical in amnat, detailed and extensive methods appendicies, but no raw data. In terms of data reuse, the authors reuse their previous review/metaanlysis…thereby citing themselves, but not the original creators of the data. They also don't specify what parameters they obtained, though this is implied by later paragraphs. I think they should have attributed the original authors. they did not share the extensive dataset from their experimental mesocosms (body size, density, etc in relation to predator (fish) density).
  • large empirical and simulation datasets not posted. they are associated with t previous study by the author which may affect this. in data produced, there is a weird data matrix situation: THIS WOULD BE A USEFUL MATRIX: Using this field data, we constructed a six-stage demographic
  • matrix model with 13 nonzero matrix entries (for more details, see Knight 2004). SOME KIND OF MATRIX IS IN A TABLE IN THE RESULTS, BUT I THINK IT IS A SIMULATED RESULT. in general, isi must have started requiring funding data in 2009, b/c few of the 2008 articles have it retrieved even though it is there (in contrast 2009 articles that didn't have it retrieved usually didn't have it in the article)."

GIS data not cited well….not even clear what quadrats they are using, it's format (raster, etc), or how it was obtained (url, self, govt, etc)

  • short model paper…intent: "Our aim, therefore, is not to directly attack the conventional signaling hypothesis of song-type matching but rather to offer an alternative model based on reliable signaling and wait to see whether future experiments support the unique predictions of our model."
  • long, math model paper about evolution. Could they, and others, have used empirical data at least for validation? Is this prevented by problems utilizing depositories or lack of deposited data? Also, in final analysis, can decide to discount "simulation iterations" as undeposited (produced) was mostly included because of the sysbio policy it at least shouldn't 'count against" amnat articles when asessing if it meets journal policy
  • Author makes an interesting comment about the difficulty of obtaining data for fossil calibration: "Unfortunately, most compilations of the fossil record do not provide measures of its richness (for an exception, see the growing Paleobiology Database, but simply provide the age of the oldest first occurrence of each taxon. Even if the primary literature is consulted, it is notoriously difficult to extract what is known about the entire fossil record of a group. *Accordingly, most articles that use the fossil record to calibrate molecular phylogenies only provide estimates of the age of the oldest fossil of each lineage (with a notable exception being Springer 1995)."
  • Unclear if there were iterative simulation outputs. Seems to have just been a few simulations and the results/parameters of which are discussed one by one in the paper.
  • XY coordinates posted and author explicitly states that they are. But why not the biological data set? I've seen purposeful inclusion of XY coordinates in multiple amnat articles now, is this part of their policy?
  • Authors may have posted their code/software: "Calculations were made using purpose-written software, available on the online edition of the American Naturalist." I didn't have internet access at the time of reading to verify this. Since I haven't been tracking this, I didn't count this as data sharing, but if a usable GUI is posted online, it could be considered as such in the future and a good example of posting source code.
  • Total number of simulation iterations (pseudo-data sets) was not indicated.
  • boring model ppr. Unclear how they paramartized their model….it was about a very specific ecosystem and plant, yet it was unclear how things were calibrated
  • EXTENSIVE ecological/biological dataset of both observational and experimental measurements….undeposited as expected.
  • Possible morphological dataset, but may have just been visual taxonomic comparisons. Table 1 alludes to analysis of the behavioral videos, but this is not clearly indicated in the methods. I recorded the excerpts in the Undeposited sheet, but did not count it as data produced since it seemed to be used for qualitative comparisons.
  • Simple study, but had a mortality/mass dataset about eggs from a common garden experiment.
  • Life history data and matrix produced but not deposited.
  • appendix data included nearly all raw data from the extracted literature dataset, except genbank accession/info. however, it did not appropriately credit the original authors. Odd amalgamation of dataset reuse. Literature dataset + genbank + previous study by the author. Genbank was handwavingly acknowledged, but no allusion to accession numbers..upon looking at the bibliography, it is apparent that the author of the gene papers was a coauthor on this paper. I still think accession should have been given for retireval purposes. this seems common though, citing self (racking up your "impact factor" for tenure) but not really making your data readily accessible (or citing other authors whose data you used...i.e. the extracted lit data). the extracted literature data was also not appropriately credited....with instructions to the reader to contact the author if you desired the literature...what about crediting the original authors?! do journals have bibliographic limitations that would prevent citing the 45 papers used herein for determining hybrid viability. is that why amnat has seperate lit cited sections for the appendicies? does isi count these as citations as well? does amnat likewise? could an original author trace the connection between their paper and this one through isi? how else would a cited author find out the paper was utilized (for tenure purposes)?
  • biological/ecological data collected and associated with treatment factors…no apparent reuse or sharing
  • math model…no apparent validation/parameterization/empirical basis
  • bio/eco data collected but not shared. C++ model could be shared.
  • this paper was a mess. datasets were cited haphazardly...some as reviews, one as a database from a publication, some as papers, and others as just "the literature". with no indication (i.e. no methods), the authors all of a sudden say "the data we analyzed". what data?! It didn't have a methods section which it greatly needed. Can an editor not enforce a subheading?! What then can we expect for dataset citation enforcement by content and copy editors? Maybe since this is a "notes and comments" article, amnat considers it permissible...but why then have a results and disucssion section. i suppose the could have deposited data, but I wouldn't want to see the dataset (probably would have bad metadata), it's not clear what compiled dataset they would share, so I indicated "no data produced"
  • "this seems to be a case of data reuse for model validation/parameterization, but for some reason this is not mentioned until the results section so it is unclear how the data were obtained and used in the modelling process. model simulations were run for many scenarios, the iteration data is not shared. the authors do, howeer, share their model code: ""The algorithm, in S-PLUS or R, is available at"". "
  • in 2007, it seems that authors could opt for online or paper (before lit cited) appendicies...did they later move to all online or is it up to how much the author wants to pay? This model used empirical data to infrom the paramaterizion of the model, but in disparate ways, most of which involve unpublished data by the authors or statements about the biology from other sources.
  • model, no apparent simulations or empirical validation
  • short ppr, no true methods section. Small morphological dataset could have been posted (binary and quantitative data)
  • There was an apparent reuse, but it seems to be referring to the initial discovery of the lizard population because at least some of the sampling occured in 2004, which is after the publication date. quote: ""We studied color frequency of male Lacerta vivipara at five sites of the Ossau valley of the French Pyrenees (Heulin et al. 1997): Heulin, B., K. Osenegg, J. Leconte, and D. Michel. 1997. Demography of a bimodal reproductive species of lizard (Lacerta vivipara): survival and density characteristics of oviparous populations. Herpetologica 53:432–444"
  • Results tables are shared, but not raw data. The intext citation sounds as though the raw data was shared, but the appendix is full of statistical output tables.
  • mathematical model. Interesting that many papers like this state "we have demonstrated that factor X causes result Y" when it is a math model
  • open source software credit: ""I then calculated the Horn-Morisita
  • distance measure using the vegan-R package (Oksanen et al. 2006)."". Authors reuse their own data (multiple reuses), but there is no indication of sharing those past or the current compiled data set. a few XY coordinates given, but not of individual transects: ""Reference remnants were three large tracts of forest in the vicinity of St. Arnaud (36 40 S, 143 20 E; 25 kha), Dunolly (36 51 S, 143 18 E; 16 kha), and Rushworth (36 38 S,145 02 E; 41 kha). Seventy-one transects were randomly distributed across the three reference areas."" In general, I only count intext XY if greater than 10 localities
  • model with supporting empirical experiment - binary and density data could have been shared
  • somewhat of a software share: ""MATLAB routines are available from the authors upon request"" would author have posted if available/aware? Author repeats this in the discussion: ""However, contrasts can be constructed by numerous software packages, and the LSVOR regression is relatively simple to program (MATLAB routine available from the authors)."" and seems to be advocting his code as a good alternative to more expensive software"
  • interesting blend of major ecological theories…neutrality, island biogeo, and niche! no clear indication of simulation iterations, seems to just be a few tests of various scenarios. In general, amnat seems to encourage lenghty appendicies for derivation/model theory/computation explanations. This is a step towards open model sharing. Next step = code sharing, next step = GUI sharing, next step = platform independent GUI sharing.
  • FINALLY! Some good data reuse with accompanying sharing….independent posting of a usable excel file of compiled literature data. Props to the authors. Unfortunately, they don't cite the original authors directly in the paper (they did screen 100 original pubs)…see if the appendix citations, and AGAIN, figure out how that works for citation tracking.
  • the appendix tables related to the dataset are actually statistical outputs. The way they cite the data seems to imply that the raw data is available...this is confusing! They re-use a dataset from "pianka and colleagues" and yet again, cite a paper by the author but not those by additional colleagues.
  • no datasharing or reuse. Is there a repository for more agricultural ecology? It seems to me that should be stored separately, but possibly linked to other ecological depositories. Just seems like different types of researchers with different interests (theoretical vs. applied) would be soliciting the data
  • good morphogical dataset on snails to study ecological interaction with beetles. No apparent data reuse or sharing.
  • self proclaimed metaanlysis study (literature extraction). Full (extensive!) extracted dataset available online, but not clearly cited intext. Authors only credited in appendix citations (pdf associated with excel)
  • math model on resource limitation, no simulations just illustration of various scenarios
  • full data set shared in a zipped excel file. Contains XY coordinates, allele type, and biological information. Unclear if allele sequences were used, if so, they were re-used from the schmid 2006 study (one of the coauthors). What motivates an author to make their article open access and with the complete data set available?
  • 1995 dataset alluded to, but not used for any comparisions or model fitting…perhaps they were just referring to a continuation of methods. Bio/eco dataset produced (biological info abt separate species compared to make eco conclusions)
  • some kind of genetic analysis involved, but no indication of sequencing or use of previously generated sequences

"*possible incidence of reuse...model populated with life history parameters, but I think these were static parameters, not a full dataset: quote: ""When possible, parameters were estimated from the empirical study described above and from other observations of Echinacea’s life history (Wagenius 2000)."" model code is shared: ""We developed a stochastic, spatially explicit, and individual- based computer simulation model of plant population dynamics in a habitat that becomes fragmented (see appendix in the online edition of the American Naturalist)."". snapshot of code saved in zotero. "

  • GIS layers and XY data could easily be stored in a zipped file internally, on dryad, dataone, etc….why is this not done? With mongolia stuff, I was able to dig through website archives to find many layers (i.e. the csu ice study)….where could that data be stored?

"*possible example of software sharing/reuse: ""Below, we give a brief introduction to the digital life system, Avida. Additional information is provided in appendixes A and B, including a schematic of a digital organism and a glossary of terms. A more detailed description of the system is available elsewhere (Wilke and Adami 2002; Lenski et al. 2003; Ofria and Wilke 2004), and documentation is available online ("""

  • math model….no apparent simulation iterations beyond testing a variety of parameters
  • pre- and post- treatment dataset. Unclear if the pre- dataset is collected within the context of the current study or previous work by one of the authors (duToit 1990..which seems to have been a more observational study)

"*another instance of appendix citations only for extracted literature data ( i contine to count this as a ""0"" for citation credit)...but at least the compiled data set is shared (though the text is unclear that it is deposited...only indicates that the references are provided, not the actual data....but the online appendix clearly has the data tables)!. reference made to ""phylogenetically corrected analysis"" but unclear indication of phylogeny used or sequences it was built from:""using the topology of the actinopterygiian supertree (Mank et al. 2005)."" from the citation, it is unclear if this was the source for the original phylogeny. general problem with amnat: appendicies (supplemental info) only accessible from the full text page...i think it should be accessible from the abstract/entry and search pages"

  • data was collected to validate a model, but still could have been shared
  • model simulation data
  • gene sequences mentioned in one section of the methods, but then the genbank accession are provided in another (unrelated) section. Reused sequences are not credited by author or accession in the main text, and only by accession in the appendix. Genbank is not explicitly indicated in main body of paper for the reused data (it is for the shared). the appendix tables to state "genbank", but I counted this as a non-explicit reference because of the main body wording
  • the dataset is "reused" in the sense that it is an old dataset, but it is an unpublished one. It is unclear if these authors also collected the original data. XY coordinates are not given, even though mapped in figure 1 and clearly available in the methods descriptions.
  • no apparent reuse, sharing, or production (model)
  • extensive biological dataset about female preference and male traits. Data in raw (video/sound) form could be made available, but for this investigation, more importantly, the coded data could have been shared
  • super short paper of a simple observation. Probably discount in analysis and reconsider use of all "natural history notes". This article is the first instance I've seen of a video posted as supplementary data, which makes it seem as though behavioral videos and sound recordings could be posted.
  • extracted data not credited in paper or appendicies…author states that the sources can be obtained from the author by coorespondence. compiled dataset available online in excel and ascii. Modified phylo supertree could have been made available (not even given as a figure, perhaps b/c it was an intermediate step in the analysis)
  • biological dataset produced (traits, survival).
  • ecological dataset produced, not shared
  • biological/behavioral dataset produced but not shared
  • mice-parasite experiment, dataset includes measurements of parasite load of mice. Dataset not shared. Note, when I classify the produced dataset as bio/eco…that typically means biological parameters of one species were measured, but the treatments were of different ecological interactions with another species (as opposed to eco/bio where many species have multiple parameters measured and eco which is community based)
  • math model with no apparent simulated iterations.
  • reuse of published (literature) datasets, one of which is indicated to be an appendix which may or many not be a downloaded file (excel/ascii). Unclear how the data was obtained….perhaps directly from the original paper
  • unpublished dataset of the coauthors' is utilized for the analysis, but they still articulate how the data was collected
  • reused dataset is one previously published by the author. Used herein to validate the proposed model. Dataset should have been shared as it is unlikely that is was shared in the previous paper, but since it was not generated in this paper, I did not count this as an instance of data produced. acutally, on reading the citation, the dataset was not yet published and is "forthcoming"....does this mean it is accepted for publication, submitted, or just in preparation?
  • each study of the extracted dataset is described in detail. The compiled data set is not provided, though some summary tables are (about range of the data). Most of the datasets cited have at least one of the coauthors invovled in the previous publication. according to the acknowledgements, the others were likely obtained by correspondence but this is not explicit in the paper. to me, a sharing a compiled dataset is analogous to sharing a gene alignment...something another scientist could do, but greatly expedites processing, especially for metaanlyses
  • the authors "reanalyze published data", but they actually perform the measuremsents themselves rather than extraction from previous literature. they don't make reference to how this compares to the original data, so I'm not sure why they termed it a reanalysis. in general, model papers are either model fitting (aic) to empirical data, models with simulated iterations, or math models with solutions.
  • this is a critique of allometric metaanalyses that have problems with consistency in data collection…it would be a much more meaningful study if the author reanalyzed the data and proposed ways for standardization or at least illustrated (rather than pontificated) about the flaws.
  • math model of stable states in mimicry
  • extensive field, experimental, and morphological dataset, unshared
  • dataset not shared
  • study measures something analogous to growth rate (I think, called "titer"). This is used to validate a model
  • five tradeoff scenarios analyzed in the model, but there were parameter inputs, not iterations.
  • application of seed bank model to salmon
  • longterm dataset, but not necessarily reuse of it, though other publications by the same authors utilize it. Data likely not shared b/c they have planned publications which will also utilize it.
  • bio dataset. Again, the way they refered to the data in the results seemed as though the raw data was available in the supplementary data, but it was actually the anova (stats) outputs
  • The data set is not posted, but the authors do mention in the acknowledgements that: "The collated data sets are available from the authors." This was counted as data produced, not shared, b/c presumably many authors could be contacted for the data…hence why the primary author email is required.
  • the article says "in this note", but this is not an official "notes and comments" article
  • this paper refers to "empirical data" in the introduction, so I expected an analysis of the empirical data….but it actually laid out suggested study designs for testing their proposed theory. I would have liked to see them test the data they expressed qualms over and reuse/test it themselves
  • I'm not familiar with this type of study, but it seems to have produced growth rate data according to different factors which could have been shared.
  • ess model that predicts future morphology
  • no profound comments for this one
  • model scenarious w/out empirial validation
  • model of various scenarios, comments in discussion state that "other" studies utilize computer simulations whereas this one does not
  • example of GIS reuse . Poor citation practices with intext citation of url, but no bibliographic citation. Which is particularly odd considering that they cite program R with a full citation in the biblio. Does R specify that this should be done when it is used? Do the respective GIS sites do this as well (USGS, etc)? are any of these datasets also accessible through daac? they also produced a GIS dataset that could have been shared (plant areas).
  • the use of "grids" and acknowledgement for help on "GIS analyses" indicate that there are potentially reusable GIS data, as weel as the biological data concerning mate preferences/characteristics
  • evoluationary ecology model
  • model with simulated data and allusion to empirical data, but these were comparative case studies, not analyzed data
  • comparitive mathematical models
  • behvarioal outcome experiment….should have produced a binary dataset about response and treatment type
  • data extracted from database. Digital "sampling" and species range deliniation performed in GIS…could have been shared
  • math model with solved solutions. Brief reference to empirical data, but as a conceptual case study
  • dataset is not shared, perhaps because of use in other unpublished studies
  • math model based on lotka volterra….a brief comparision to empirical data in the discussion, but not used for validation/parameterization

"*authors refer to importance of empirical data in parameterization of their model, but this is not explicitly discussed anywhere other than the introduction: ""Our analysis is based on empirical estimates of key parameters of the replication process of poliovirus. This is an important aspect of our analysis, as these empirical parameters of the molecular biology of poliovirus replication have been determined by extensive experimentation in a number of laboratories."""

  • yet another model..I feel like there are more in this 2005 set than other years
  • this was a literature survey, basically analogous to what I am doing. I consider it a "sample" of existing data, so therefore an incidence of reuse and a dataset produced. Over a 1000 articles were analyzed.
  • number of simulation iterations never explicitly stated. interesting, a article I just analyzed had this same author as a coauthor. And I see the "P. abrams" author cited or acknowledged frequently
  • extracted literature dataset shared in its entirety. Authors of extracted literature only credited in appendix. Phylogeny given internally, but not posted to treebase.
  • this article should be thrown out. It was an editorial
  • math model. I often check math papers with words like "data set", "dataset", "example", "empirical", etc. to make sure I didn't miss a reuse or sharing
  • bacteria paper…not as common in amnat, used for competiton dynamics. In general, Is it a problems that the 2000 papers were submitted in 1998 and accepted in 1999. I think probably not because we have the same lag time in the 2010, 2005 articles
  • in general, a lot more single authors in this time period (not huge groups)
  • early treebase reuse, unfortunately no accession. Exgtracted literature dataset did not have full indication of original authors, only two….a complete list is supposed to be available from the authors. author explicitly said that he could be contacted to provide the dataset, so I counted this as "sharing".

"*no credit to original authors of extracted literature dataset. weather dataset reuse too. authors state: ""These data were collected in a wide variety of ways, presenting a formidable barrier to comparisons across studies."" but seems like they came up with good ways of normalizing qulaitative and quantiative data into the same dataset"

  • model building from empirical data
  • unclear if simulation datasets generated (or just given parameters changed at relevant increments). general evidence idea: increase of data reuse in amnat over time as evidence that dryad would be utilized b/c of increased interest in data sharing.
  • only in the appendix does it say how many simulation iterations were produced. statement for methods: "since there is not currently a way to track ecological and environmental data reuse through handilng numbers as in other disciplines (cite heather), we manually extracted information from articles. If a similar study is done or an expansion of this dataset that is not interested in documenting the percentage of reuse, various search terms are suggested to narrow down candiate articles: "empirical", "dataset" or "data set", depository names, "hdl", "Ac*" if this isn't to broad, and other terms from valerie. terms like "data set" would still keep the sample set broad to capture non-depository reuses.
  • math model
  • math model….how is notes and comments different than regular articles then?
  • authors give very detailed methods on their search process for the literature extraction dataset. Good sharing. Extracted datasets seem to have a greater frequency of sharing…from my perspective, because they are difficult to obtain and the authors acknowledge that and don't want others to do thorugh that. whereas with genbank, it's just so easy to do a batch download of species that its also easy to neglect giving credit and giving the information need to reobtain the same sequnce
  • simulation data may have been produced, but unclear if multiple iterations were done…from table 1, I think only a few certain parameters were tested.
  • no url given for accessed databases. Also, how should I categorize and intext citation of an organization? Perhaps as the db?
  • math model. Uses the word empirical a lot, but mostly to justify parameter decisions
  • math model with empirical validation, but dataset not shared
  • a self reuse in which the author also gratuitously states how interesting, useful, and novel their previous study is.
  • a historical impt paper, one of the many post mt. st. helens investigations
  • pop bio/movement study why should have a simple binary dataset plus demographics
  • math model without apparent simulations or empirical validation

"*this would be an interesting article to test true data availability...the authors state that the site species list can be obtained by request to one of the it would be interesting to try the email in the paper or otherwise track him down to get it and see if a 2000 dataset could be realistically obtained. haha: ""Later, the plot was baited with Pecan Sandies (Keebler). These shortbread cookies contain fats, carbohydrates, and proteins and are excellent ant bait (S. Cover, personal communication). The cookies can be crumbled for uniform coverage on a plot, and they contrast well against dark soil. The bait allowed us to locate cryptic nests and to sample ants foraging on the plots, but nesting off the plots."""

  • I'm still unclear what the notes and comments section is…they are often math model papers just like in the regular articles, occasionally shorter but often comparable length. At the least, they aren't obligated to include an abstract or acknowledgements
  • math model without empirical validation or aparent case studeis

"*comment on dealing with varied data sources : ""Because of issues with data availability, the temporal spans of the biological data and environmental data overlap only partially or not at all. However, any differences from the longterm averages will be small compared to the continental variation in these variables, which is the pattern of interest."""

  • number of biological measurements, including blood chemistry which would perhaps be a different data type in future studies like this (or recoding of this data)….in general, the biology category is too broad, but difficult to breakdown since contains so many data types.
  • authors comment on the smallness of the dataset, but it still could have been shared at least for method/result testing
  • math model without empirical validatation (though they do use the word empirical a lot, I think refering to the numerical results that "prove" their theory)
  • the authors share the data matricies. They could have potentially shared the very raw original data, but this was from different papers (albeit by the same authors), and the matrices were the unit of anlaysis in this study so far as I could tell
  • this paper was basically just a review. statement to make in discussion: part of the percent reuse is influenced by typical type of articles produced in that journal…i.e. sysbio articles nearly require data reuse, whereas many amnat articles are models or conceptual reviews. This also shifts in time (2000 vs. 2010)
  • in general, the majority of bio/eco datasets are not shared
  • does the appendix reference with the plant measurements refers to methodology, not raw data. They share extensive site descriptions and a species list, but not the raw data that was actually used in analysis. The shadow of sharing, but not the real thing.
  • math model refers to case studies, but does not employ their raw data for validation purposes

Molecular Ecology articles

  • "Molecular Eco has entire ""supplementary files"" section at the end of the paper. But they give the caveat: ""Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries
  • This article cites genbank at least 4 ways: as Genbank, as supplementary table of combined genbank info, as url, as accession ….they use/submit 3 types of genbank data and cite it differently each time. all citations including the Accession number instance
  • good full documentation of genbank (i.e. all accession numbers recorded in a table b/c many were used: ID NUMBER ACCESSION REFERENCE HORSE DEPICTION & BREED AUTHOR PUBLICATION, table is saved in zotero as an attachment under the article)...nice resource
  • again, cites genbank two different ways - first as paper for acquired primers, then as genbank (no accession) for additionally analyzed genes..but then in the results, they cite specific accession numbers.
  • is biological material reusable data? As a taxonomist, I think so, but it can't be deposited on line. no genbank accession.
  • Another example of genbank mentioned vaguely throuhgout the mat and methods, but not formally "cited" until results…but in this case it was data sharing so it was understandable. Also another example of "could have been in treebase"
  • genes produced and used, but not cited as datasets. The used dataset may be posted on genbank, might be something to check
  • clearly refers to a breeding success dataset ("data files") but they are not included as tables, supplementary info, or deposited…I called this a "text" indication under Data Produced
  • GIS mentioned and then later clearly stated as in supplemental info (w/ coordinates), alignment and phylo not clearly indicated as shared data until the results section in which alignment was included in supplementary but neither align or phylo figures (trees) were posted to treebase
  • Reuse: mentions genbank generally first, only mentions original authors in table caption, mentions one of many accession numbers in text, remainder in supplementary table. So, to be "generous" I classified this a an "Au" citation even though it was more a "D" because of vaguness in some situations.
  • they did pcr, but fro aflp banding to genotype, not sequence generation

"*many datasets (and types) generated, but none shared. The sequences seemed to be of more than sufficient length for genbank submission. long term study, but not reuse: ""This study continues a protracted investigation of willow evolutionary ecology in eastern North America.""" "*yikes!: ""Sampled individuals were transported alive to the laboratory and sacrificed by freezing."""

  • more than 10 gis coordinates given = counted as shared dataset (especially b.c seeing the map, I would have counted it against them if they hadn't given coordinates. It's a weird coordinate system, but seemed to have high resoultion for location.
  • genetic sequences with some kind of pairwise analysis that may have had an additional dataset that I couldn't discern
  • I thought they sequenced genes, but they actually just detected the presence of certain gene regions. They did provide the raw binary data as an appendix at the end of the pdf
  • sequences are said to be deposited with an accession that looks identical to genbank, but genbank is never formally mentioned
  • xy coordinates could be obtained from author upon request= counted as sharing. Genetic material sampled in someway, but not sequenced….so, a genetic data matrix
  • sequences aligned, but not posted to treebase. No apparent PT produced…but if so, not deposited
  • a relatively rare instance of gene unposted to genbank.
  • weird….two times in a row that GS were not posted to genbank. Does genbank only accept sequences of a certain size? I'm pretty sure they accept anything (i.e. primers included). 4 xy coordinates were given in text…like I said, I count this as a dataset if above ten (otherwise I consider it incidental)
  • the first study that explains BLAST! Yes! Genbank's matching search. They didn't necessarily reused the data except to determine what species they had, but still provided the accession numbers (b/c they resequenced these essentially and wanted to avoid duplicates....good practice). again, primers spelled out in text…is this policy? treebase should have been used for tree indication of alignements produced even though I assume they were. GIS reuse from government sources, not well cited

"*odd…yet another unposted GS…when did molec eco's data policy become more stringent regarding gene deposits? Sharing of biological material: ""Ceratitis capitata embryonic DNA extracts were kindly provided by Kostas Bourtzis (University of Ioannina, Agrinio, Greece"". I'm not counting primers anymore, but this one cited one paper for two of the ten primers."

"*were sampling locations shared? Table S1 Information on the sampling location and genetic analyses in this study,….Fig. 1 Phylogeny of the Acanthodiaptomus pacificus mtCOI sequences using a Bayesian phylogenetic approach, and the geographical distribution of the three mtCOI lineages in Japan shown by circles decollated with the lineage-specific colour in the Bayesian tree."

"*what is different about 2009 that so many gene sequences are not posted? Has nic had luck with tracing history of data policies? Remember, that the true date of the paper to correlate with the policy is the SUBMISSION date. ironically, the paper cites an article by whitlock (but not one of the data sharing ones). code sharing: The source code is included in the Supporting Information."

  • also fewer PT producing studies in this years batch. Are aligned sequences or other matrices also necessary for data input in other phylo analysis (i.e. this one was inbreeding coefficient)
  • yet another undeposited GS
  • xy coordinates given in appendix. Gs not deposited
  • GS annotation reused to annotate sequences…can these be posted to genbank?
  • flow chart (fig 2) outlines the methodology and says "obtain or design gis layers". The authors did not indicate whether theirs were designed or obtained….i'm pretty sure they were obtained (at least the sateilite images and because of the offhand reference to the source of an unused raster), therefore I counted it as a uncited reuse
  • xy coordinates not given even though detailed locality descriptions are….gps may have been less common and more cumbersome back then, so this may have been the best possible. The authors say they maybe redownladed some of their stuff from genbank, but they never give the accession for the 400+ cytb sequence they produced. clearly they were aware of genbank.
  • cytb sequences shared. Again, localities shared but not xy
  • sscp is another type of gel measurement. They also fully sequence some sections and posted to genbank.
  • they reference the microsat dataset, but I'm led to believe they also have parentage estimates but they never come out and say it. This dataset is being continually analyzed by the authors, so it would be interesting to no if and in what paper they share the full dataset
  • microsat + some biological data (which was obtained apparently from an article but it may have been a reference to a method…difficult to discern)

  • the authors make note of a similar sequence already posted to genbank, but they also post all theirs which confirms my suspicion that genbank should be set up in some way to hold multiple sequences of the same region for the same species = even local level sequences could be posted. this produced a haplotype network rather than a tree….where then should the gene alignment go?
  • authors give credit to sequence used for alignment even though many other papers do not
  • reuse primarily for identifying plausible gene regions. Produced sequences not shared….maybe have been more of a visual anaylsis of gels than sequence analysis. Map figure states that another table contains "precise" geographic info…but not coordinates were given in the referenced table
  • does treebase accept haplotype alignments and outputs? If not, where should they go (at least the alignments, the outputs could be considered results)?
  • population study utilizing gene sequences that were not shared. More commonplace in tree studies to use and shared with genbank.
  • allele diversity study, still should have produced sequences
  • again, pop/allele diversity study, but I still think gs could have been posted to genbank or at least internaly.
  • why don't pop level papers post gs? It's possible judging from the "marked genetic structure" paper in ME 2010 issue 1.
  • I've never noticed this phrase at the bottom of pages: " This article is a US Government work and is in the public domain in the USA.:" it wasn't funded by a usa or govt source, but by new zealand
  • authors were very through in sequence annotation and verification procudures. There may have been additional allignments or sequences not posted, but I'm pretty sure they all were, even though it was difficult for me to discern
  • this gs study understandably did not produce sequences because the dna was analyzed in an assay that detected presence of specific sequence types
  • I now think that aflp is a type of assay….i encountered this once before and am pretty sure I also called that one a "bio (genetic)" datatype
  • another AFLP
  • another aflp, but this one also utilized genbank sequences…it was a modge-podge of various genetic anlaysis
  • maybe go back and reclassify bio (genetic) and similar cartegorization as "og" representing "other genetic info"
  • allele genotyping rather than full sequencign as far as I could tell.
  • microsatelite analysis. Xy coordinates given in text table
  • rlfp (blot) genetic analysis. Map with many specific localities plotted….text made it sound like appendix might have coordinates, but it did not. In general, should be careful about overanalyzing the absence of gs sharing in earlier papers since technology was vastly different and straight sequencing was rarely performed in favor of other techniques.
  • actual sequences used in this study and appropriately deposited to genbank
  • authors mention reuse of their own sequences in text, but then other accession numbers outside of the range of the produced sequences appear in the appendix table. I searched these species in the full text and found no additional reference to them. Yet another citation of self but not credit for original data authors. they do credit specimen donation in the acknowledgements, but still unclear how these undocumented sequences were obtained. they do have an in press article that perhaps they are simulatneously using the data for. alignments were probably produced as a necessary step in analysis, but are not mentioned anywhere
  • genbank accession given in table….all by the same author, so it's good they gave the accession and better explains why the text in the methods was vague about sequences that were produced or obtained (didn't really distinguish)
  • this was more of an ecological study which happened to use a molecular techinque to sex individuals, rather than morphology
  • map indicates possibility of xy information, but not clear if greater than 10 sites….this probably demonstrates that produced xy datasets may be underdocumented as I had to rely on explicit text referring to them
  • allelic frequency study..again, could genbank accept these and if not, where should they go? Now-a-days, are they too outdated of a technique to really matter?
  • though this paper employed multiple genetic methods, I'm pretty sure they sequenced two regions of sufficient length for posting to genbank.
  • rflp genetic analysis…with pictures of representative blots. Could the blots be saved as the original images plus the numerical scorring/coding?
  • full rapd gel photos given….i would think this would be typical data sharing for these types of anlayses. There may be some kind of quantitative matrix that could accompany it, but this paper didn't employ that in analysis anyways
  • could potentially omit problematic opinion, editorial, etc articles by page length
  • dataset was shared in appendix at the end, but the appendix was never referred to in text…counted as data sharing but with "N" ratings.

Ecology Articles

  • Ecology does not seem to have internal depository nor encourage external deposition (dryad would probably be best). Seems odd since they hosted early data sharing conferences. DOI not displayed on Ecology article pdfs. This: ""To test our conceptual model of the trophic relationships and identify which arthropod functional groups linked bird predation to a tree growth response, we constructed a structural equation model (Amos 16.0.1, available online)6"" sounded like model development and posting, but was actually software use.
  • this was a short paper ('report"). It reused data with handwaving acknowledgement in the text about how great the database was yet no citation anywhere, not even a url. Perhaps assumed everyone already knows about nawqa (I do, but I'm a stream ecologist) or that they'll google it and find the right thing. is dataone looking into caching databases? that would be useful for things like naqwa and little databases produced by various studies. also, her criteria for selecting only some sites produced a condensed data set, analogous to aligned genes, which was not then shared.
  • Though the article was looking at a new GPS filtering method, it did collect actual gps locations of seals which could have been shared. Also, a simulated data set was produced.
  • food web dataset (partially eco, partially bio)
  • extensive observational and experimental ecology dataset
  • in general: ecology has an appendix section after the references, could be easily used as a section for posting supplementary information (this paper gives an Ecology archives accession number for the additional results appendix)

"*dataset = seasonal biology of plant and life history (pop) data matrix. small possibility that this may be a raw dataset: Flower and fruit production in Orchis purpurea during eight consecutive years in six locations (Ecological Archives E091-011- A1)."

  • "reused" dataset was used primarily for sample site selection, but was obtained from a database
  • not clear AT ALL in the methods that the data was reused…this was only clear from the acknowledgements. The caribou data at least had an associated paper (with one of the coauthors of this paper as lead author on it, but apparently not over the data collection entirely given the acknolwedgements). the elk data was not cited at all. the moose data may or may not have been collected by the current authors, but it is difficult to tell since they didn't properly indicate the other datasets
  • another extensive, unshared dataset in ecology. Most papers I've read in ecology reuse a small, minor dataset but collect most of their own data and don't share it

"*again, minor datasets reused, or shared….not those for the crux of the paper. authors state that the data set is ""compiled"" but it is all their own data and they give complete methods….they just collected it for multiple species in different places. this dataset: ""APPENDIX A Growth habit and floral characteristics and pollinator visitation frequency for 13 species of Gesneria and Rhytidophyllum from Puerto Rico and the Dominican Republic (Ecological Archives E091-014-A1)."" was posted, but was not really used in analysis and was the authors from a previous paper, it was used more for qualitative description of the species studied"

  • online appendix has a condensed dataset, but was not clearly reference in text nor is it the original complete data
  • no apparent reuse or sharing
  • reuse of a previous literature extraction by the lead author. Gis/earth data also used
  • GIS reuse, but unclear who accessed and who is the tru original author (in text citations unclear)
  • simple dataset of food consumed, could be shared in assocation with experimental factors (number of wings clipped)
  • biologcal and soil dataset about plants…unshared.

"*two datasets analyzed. One is older (1984), but had no indication of reuse and was from a biological station where the author still lists their primary address...this one was assumed to be previously unpublished and therefore a candidate for sharing (produced). The other dataset was ""reanalyzed"" but with no indication of its origin….nor any mention of how the data was originally collected or currently obtained. authors mention the importance of long term datasets: ""In order to test hypotheses about changes in ecosystems induced by man, including climatic change, ecologists sample portions of the landscape repeatedly across time. Selected portions of the landscape are set aside for long-term studies; these may be transects or surfaces of different sizes. Research of this kind is conducted, for example, at the 26 sites belonging to the Long Term Ecological Research (LTER) network in the United States (available online).""" "*code sharing/reuse: The diversity index was calculated in the Excel add-on ‘‘diversity’’ (available online).8"

  • could species abundances be one category of "ecology" data type? It's biological data about each species in the context of the community ecology
  • I thought the XY coordinates for the sites were shared, but the indicated appendix was just a map of the sampling sites. In general, since there is such a low data reuse incidence (esp of major datasets), it would be interesting to see how some ecology metaanalyses reuse data, I'm guessing primarily literature extraction
  • dataset may not have been shared because author states that it is an ongoing study with data still being added (and presumably with another publication in the works)
  • gis datasets used and produced. xY information shared.
  • an entertaining read….ant gladiator experiments
  • thought exercise with case studies, but no data reuse
  • simple morphological dataset
  • math model based on a case study, but mostly on descriptive observations about it
  • detailed observations of pollination encounters and results. In general, when did ecological archives start?
  • two main related datasets collected, neither shared. Classified as bio/eco b/c measurements only taken on plants or insects seperately (not community level) and other species used as a factor in analysis (i.e. they are two separate organismic level datasets not collected simultaneously)
  • interesting experimental treatment of manually simulating herbivory. Relatively simple measurements taken about plant size, no. flowers, and weight/no. of seeds. In general, it would be interesting to know if any of these datasets were retroactively shared in dryad or ecological archives and if the practice of doing so is encouraged.
  • classified as bio/eco because independent datasets taken for each of the involved taxa and only compared in correlation, not in a community assessment. is this a special issue on herbivory? All the articles so far seem to be on the same topic
  • ecological interaction was the factor/treatment, not measured = bio dataset
  • mix of eco and ea datasets. Yet again, not shared or with any reuse incidences…were there available depositories? Did policies advocate data sharing through correspondence (like the nature policy)? Were there editorials from this time period about making data available?
  • soil dataset seemed to be more extensive than plant community
  • bdef study with lots of tilman citations
  • woo hoo! A reuse! table 1 is a good summary of the data, but is means, not individual measurements. So it might be useful to an extraction data collection, but not for reanalysis
  • gut contents were considered part of the bio data since they were isotope analyzed rather than community analyzed
  • acknoweldgements said this was a metaanlysis, but I was disappointed to find out it was a metanalysis of multiple experiments by the same person. It might be interesting to just do an anlaysis of reuse in metaanalysis studies which would likely increase incidences of reuse (but maybe not sharing)
  • I'm learning all kinds of words for "killed" in reference to how people preserve their specimens
  • authors use their primary data to reconstruct past conditions
  • seems like a prety detailed datset that could be used for multiple inquiries pertaining to community composition, chemistry, and invidual taxa
  • type classified as bio b/c all measures taken for species individually and not when in competition with each other
  • reuses landsat, but only refers to it by name without accompanying url. Does at least give date ranges of utilized sets
  • highly varied datasets collected, all seemingly unrelated from reading the methods! Definitely very detailed data that could be reused for many different things

"*this paper utilizes ecological archives = they were available in 2000. code sharing: Supplementary software for computation of parameter estimates, confidence intervals, and goodness-of-fit tests is available in ESA’s Electronic Data Archive: Ecological Archives E081-001."

  • model with no emipircal validation
  • even though this is a notes paper, it was not really any different in length or content than the previous article I assessed
  • this was more focused on the technique rather than the ecological results, but an bio/eco dataset was produced nonetheless
  • "database" cited, meaning accumulation of data by the same author. No credit to original authors in biblio, though appendix has list. Counted as no biblio and not indicated in all reuse categories. But on the bright side, they shared the compiled dataset in its entirety, accessed through ecological archives
  • scored as eco/bio because corresponding measurements made on the two species to evaluate an ecological interaction

Global Change Biology Articles

  • having a weird problem with gcb 2000 pdfs that copies the text in cryptic characters = only limited full text excerpts or brief explanations given
  • same author as previous article (and nearly the same topic, probably from the same dataset)
  • did any of the earth/environmental depositories (daac, etc) exist at this time?
  • all raw data provided, perhaps could have provided some processing iterations….but I was unable to discern what these would be, possibly GIS layers or scripts. Also, a little unclear if the data they link to is reuse or sharing…I think it was their processed GIS dataset based on its location at the end of the methods.
  • again, a second paper by the same authors in a single issue…strange. They even reuse their data from the simulatenously published paper….all collected data was basically reused data…from what I heard at NABS, this is typical of ocean studies.
  • have a funding category for "other gov't" to lump usda, nasa, etc that would otherwise be "other" orphans
  • most 2010 were on global warming, this is the first one from 2000. for funding "other" categories, could text search for "foundation", "department", "institute", "national", "fellowship", "federal" etc
  1. REF!
  • HUGE dataset of many different experiments on the plant and its herbivore predators….another interesting study for dataone might be to survey authors (again) about if they would provide their dataset to inquiring scientists and/or post it retroactively...this could move beyond a survey and see if the authors would either respond to an email requesting data or post their data when receiving an email instructing them to do so.
  • isotopic study…I classified the produced data as bio and earth, but they were both "chemical" if that ends up being a category
  • another bog study
  • decomposition study…classified as an earth dataset rather than biological because looking at decomp rates more than characters of a speicific species.
  • flux study, but no reference to ornl or daac
  • unsure what to count the photographs as…perhaps GIS?
  • model at least used test data…but it was all by the same author as this paper. Additional data acquired from the "boreal" research station with no credit given to original data authors, but same author clearly was associated with that project as well
  • effects of co2 on plants….organism measurements of chemical content, gas exchange, etc
  • decent sized dataset.
  • model parameterized with climate prediction, which may have been a GIS layer but not indicated and therefore classified as earth
  • gis processing to form a new layer (which is stored in at a url provided clear in the paper…even if not in the methods). The url is for landsat and comparable data….the project was presumably governement affiliated to have their data posted here. I think i tracked down the dataset, but it didn't have any citation information about this paper.
  • there papers are some of the earlier work on climate change
  • only data credit given in acknowledgements. Unclear on what the wine data set was or what it was used for because it was not specified in methods and seemed to be used for more descriptive comparisons in results
  • I haven't been counting soil/flux etc data as biological data because even though the studies often look at a particular plant, the focus is on the rates of gas exchange and other abiotics.
  • they use soi in some way (as a factor?) but don't state how that data was reused. I only know it was likely reused from a previous paper.
  • soil analysis
  • extracted literature dataset with authors credited fully in table and biblio + full dataset shared in text table.authors are from oak ridge.
  • phylo tree compiled presumably from reused data (either trees or gs), but not well described (just says a tree from molecular data). the final tree is shared. the extracted literature dataset is shared, but not the fad (bird migration) data....what causes an author to share one of their datasets and not another? in general, gcb has a supporting information section similar to molecular ecology…I think all journals should implement something like this, but also have separate sections for supplemental info and actual raw data.
  • would be interesting (esp for data validation) if they shared their extrapolated dataset
  • study only uses reused data (and some calculations based on it). The compiled dataset should have been shared. I saved the supplementary data in zotero to have tableS1 because it was interesting…it gave additional information on the data sources (well, a little), but still had no reference to original authors or compilation of the data
  • GS collected, but primarily for local level analysis and parental determination. Still could have been shared even if not at genbank. Should I indicate these gs in some different way?
  • unclear (or I read too fast) whether the species occurences were analyzed at the community or species level
  • a variety of datasets collected in different ways over time, I didn't examine the details of how each dataset was employed
  • data not shared

"*reuse of meterological data to fill in gaps. flux and bio data not shared. ornl flux towers used to validate data, but not as raw data for their study: ""Energy balance measurements indicated that sensible and latent heat fluxes accounted for 79–101%of the sumof net radiation and soil heat fluxes (Table 1; data available at"""

  • the gcb papers are a little difficult for me to catergorize. Also, since I've set up my binary data coding, I am more tempted to list each individual datset (rather than data type as I did in the past)….I'm trying to be careful. If someone were to do a similar study it woudl be interesting to tabulate the number of datasets used or shared (i.e. each sequence).
  • how would I be able to tell if the sequence should have been posted to genbank? What if the squirrel was already posted, but they were looking at more local haplotype patterns? But since they sampled 9 microsats and cytb, I expect that at least something as new and they could have at least made their dataset available for further analysis even if sequences already existed on genbank.

"*authors accessed and cited data to make a statement about a trend, but it was not used in data analysis: ""This warming off the Tasmanian east coast has exceeded warming observed off the Tasmanian west coast [Japan Meteorological Agency, data accessed through the NOAA PMEL Live Access Server ( home)]."""

  • the calculations employed used parameters from previous studies (a per capita estimate), but not a full or raw data set, so this factored into data produced but not data reused. The authors used US census data but cited it in a round about way that was difficult for discerning the reuse
  • where should earth data be shared? Daac?
  • small xy coordinate dataset shared explicitly in text with a clear reference that is was provided.
  • various earth and biological measurements taken. Some of the wording (such as "database…") was unclear, but I think it is because the authors are foreign.
  • reuse of flux data…ornl given some credit in acknowl. Compiled data set (fewer sites and processed) could have been shared
  • varoius soil measurements taken with biological info as factors which is generally more common in gcb (and vice versa in ecology…bio measurements, abiotic factors)

"*weather data used for site description was obtained from another source and didn't appear to be a major factor in analysis (or was just a factor and not associated with each raw data point) ""ACKNO: The meteorological data were kindly provided by Prof. Dr. Thomas Foken and Dr. Johannes Luers (Dept. of Micrometeorology, University of Bayreuth)"""

  • meteorological data obtained from a us weather station (not apparently online, but through correspondence of some sort). Large variety of flux and biological (photosyn) datasets collected, but not shared
  • first meteoro set I've seen that had a clear author, but still unclear how the actual data were obtained.
  • earth (climate) data retrieved from a large number of sources, only one with a semi-clear indication of how it was acquired…a url in the biblio

"*authors refer to ornl and noaa data in the introduction, but do not ultimately use them instead collecting their own local level data: ""Coastal plain forests remain one of the few undercharacterized ecosystems in the otherwise dense Ameriflux network of eddy covariance sites (Hargrove & Hoffman, 2005;….The contrasting precipitation regime during these years, including a moderate El Nin˜o in 2006–2007 (National Weather Service Climate Prediction Center, php), provided insight into the impact of extreme variations in precipitation on ecosystem C exchange."""

  • reused meteoro data sentence tucked in the middle of the paragraph and almost missed. If this was a more rigorous study, I would employ methods of extraction literature studies where two people read over the article, score it and reconcile discrepancies. but alas.
  • no apparent sharing except for one misleading reference to the supplementary data which was as expected actually means/outputs
  • data classified as bio/eco because measurements taken at bio (organism) level but included interaction with a pathogen