DataONE:Notebook/Summer 2010/2010/07/22

Email:Scoring and Stats questions_Dataone
22 messages Sarah Walker Judson 	 Wed, Jul 21, 2010 at 5:15 PM To: Heather Piwowar  Heather -

I'm running into some hurdles with my data analysis. First, I want to confirm my scoring categories with you and second, as I got deeper into the statistics, I realized that most of my experience has been with continuous, not categorical data...so, I need your opinion on some things. And just to warn you, this is a rather lenghty email. I can send my Excel and R files if it's easier, but those are messy at best. On second though, I'll post them to OWW (Excel here, R here)

So, first, the scoring categories: For each, I have a binomial (YN) and ordinal (scored) version. I'm still not sure which is best for each aspect. I'd like your opinion on binomial vs. ordinal and the scoring levels I've proposed. I'm more inclined to scoring because it gives a more detailed picture of what is happening in the data, but I also worry that I've made too many categories. At this point, these are just for reuse, I'm planning on coding Sharing tomorrow in a similar manner.

1. Resolvability (Could the dataset be retrieved from the information provided?) ResolvableYN 1=Y=Depository and Accession 0=N=lacking one or the other or both

ResolvableScore 0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature") 1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info) 2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods) 3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format even though genbank was enver mentioned anywhere in the paper) 4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data) 5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used)

2. Attribution (Is proper credit given to original data authors?) AttributionYN 1=Y=author and Accession (biblio assumed) 0=N=lacking one or the other or both AttributionScore*I have two alternatives for this: one that doesn't worry about "self" citations and counts them the same as others (i.e. combines 6&7 and 4&5), and another that throws out all the "self" citations (and cuts my already small sample size by a lot!) 0=no author/biblio or accession 1=self citation, other reuse (Justification: Author refers to a previous review paper of theirs, but not the original data authors…this assumes that the original data authors are attributed in the previous paper) 2=organization or URL only (Justification: data collectors/project acknolwedged, but not specific individuals or relevant publications) 3= accession only (Justification: Data acknolwedged) 4=author/biblio, but self 5=author/biblio only, not self (Justification: original data author acknowledged and this is the currently accepted mode of attribution 6=author + accession, but self  7=author + accession, not self (Justfication: attribution to author and data…and this is the mode of attribution we hope for)

3. Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

And now, the stats. While in Knoxville, we talked about doing an ordinal regression. However, as I tried it out and read up on it, I think it's the wrong match for this data. I think it is primarily used for survey analysis where you are looking for an interaction of two variables both ranked on an ordinal scale (i.e. do people that "strongly agree" with a political issue also classify themselves as "strongly conservative"). Case in point, this R example. Maybe I'm reading the literature wrong, but I don't think this is the right fit...let me know what you think (and if you have a good resource!)

Instead, I think I should be using either chi-squared or a linear model for categorical data (i.e. binomial or Poisson distribution). I've run both and am a little stuck on interpretation, but can figure that out. I wanted to get your opinion on using chisq vs. a Poisson glm. The main pros for chi = easy and p-value output; the main con = low resolution (i.e. it says something is up but not what). The main pros for Poisson = higher resoluation (i.e. specifies what factors are most significant/influential) and has ability to look at multiple factors at once (then arriving at the "best" model of combined explanatory factors based on AIC); the main con for Poisson = I'm not super up to speed on the interpretation.

Here is a summary of some of my R analysis (I can send you the R history or TinnR file if you want) on the Resolvability aspect to give you an idea of the outputs.
 * 1) Tables
 * 2) > b=table(ResolvableYN,DatasetType);b
 * 3)                      DatasetType
 * 4) ResolvableYN Bio Ea Eco GA GIS GO GS PA PT XY
 * 5)                  0  22  35 10    5   10    5    33  19  8   5
 * 6)                  1   0   0   0     0    0     0    17   0   1   0
 * 7) > bb=table(ResolvableYN,BroaderDatatypes);bb
 * 8)                     BroaderDatatypes
 * 9) ResolvableYN EA Eco G  O PT  S
 * 10)                0   35  10   43  41  8  15
 * 11)                1    0   0    17   0   1   0
 * 12) > cc=table(ResolvableScore,DatasetType);cc
 * 13)                         DatasetType
 * 14) ResolvableScore Bio Ea Eco GA GIS GO GS PA PT XY
 * 15)                      0   6   15    3    0     4   0    3    4    1   3
 * 16)                      1  13   10   6    2     2   4   12   10   5   0
 * 17)                      2   1    8    0    2     2   1    6     3    0   0
 * 18)                      3   0    0    0    0     0   0    4     0    0   0
 * 19)                      4   2    2    1    1     2   0    8     2    2   2
 * 20)                      5   0    0    0    0     0   0   17    0    1   0
 * 21) > ccc=table(ResolvableScore,BroaderDatatypes);ccc
 * 22)                         BroaderDatatypes
 * 23) ResolvableScore EA Eco G  O PT  S
 * 24)              0         15    3    3  10   1   7
 * 25)              1         10    6   18  23  5   2
 * 26)              2          8     0    9   4   0   2
 * 27)              3          0     0    4   0   0   0
 * 28)              4          2     1    9   4   2   4
 * 29)              5          0     0   17  0   1   0


 * 1) Chi-Squared
 * 2) > chisq.test(table (ResolvableScore,DatasetType))
 * 3)        Pearson's Chi-squared test
 * 4) data: table(ResolvableScore, DatasetType)
 * 5) X-squared = 98.1825, df = 45, p-value = 7.922e-06
 * 6) Warning message:
 * 7) In chisq.test(table(ResolvableScore, DatasetType)) :
 * 8)  Chi-squared approximation may be incorrect


 * 1) Linear Model-Poisson (alternative = binomial or zero inflated for "Resolvable YN")
 * 2) > poisson = glm(ResolvableScore~DatasetType,data=a,family=poisson)
 * 3) > summary(poisson)
 * 4) Call:
 * 5) glm(formula = ResolvableScore ~ DatasetType, family = poisson,
 * 6)    data = a)
 * 7) Deviance Residuals:
 * 8)     Min        1Q    Median        3Q       Max
 * 9) -2.47386 -1.37229  -0.04478   0.60369   2.29458
 * 10) Coefficients:
 * 11)                   Estimate Std. Error z value Pr(>|z|)
 * 12) (Intercept)        0.04445    0.20851   0.213   0.8312
 * 13) DatasetTypeEa -0.07344    0.26998  -0.272   0.7856
 * 14) DatasetTypeEco -0.04445   0.37878  -0.117   0.9066
 * 15) DatasetTypeGA  0.64870    0.37879   1.713   0.0868.
 * 16) DatasetTypeGIS 0.29202    0.33898   0.861   0.3890
 * 17) DatasetTypeGO  0.13787    0.45842   0.301   0.7636
 * 18) DatasetTypeGS  1.07396    0.22364   4.802 1.57e-06 ***
 * 19) DatasetTypePA  0.18916    0.29180   0.648   0.5168
 * 20) DatasetTypePT  0.64870    0.31470   2.061   0.0393 *
 * 21) DatasetTypeXY  0.42555    0.41042   1.037   0.2998
 * 22) Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 * 23) (Dispersion parameter for poisson family taken to be 1)
 * 24)    Null deviance: 283.03  on 169  degrees of freedom
 * 25) Residual deviance: 212.36 on 160  degrees of freedom
 * AIC: 566.94
 * 1) Number of Fisher Scoring iterations: 5
 * 1) Number of Fisher Scoring iterations: 5


 * 1) Ordered Logit Model (library MASS, function polr)
 * 2) polr(formula = as.ordered(ResolvableScore) ~ DatasetType, data = a)
 * 3) Coefficients:
 * 4)                    Value Std. Error    t value
 * 5) DatasetTypeEa -0.1696435  0.4978595 -0.3407457
 * 6) DatasetTypeEco -0.1249959 0.6788473 -0.1841296
 * 7) DatasetTypeGA  1.3932769  0.8160008  1.7074454
 * 8) DatasetTypeGIS 0.2656731  0.7333286  0.3622839
 * 9) DatasetTypeGO  0.6023778  0.8112429  0.7425369
 * 10) DatasetTypeGS  2.3786340  0.4917395  4.8371833
 * 11) DatasetTypePA  0.3831136  0.5557081  0.6894152
 * 12) DatasetTypePT  1.0288666  0.7284316  1.4124410
 * 13) DatasetTypeXY -0.2736481  1.1134912 -0.2457569
 * 14) Intercepts:
 * 15)    Value   Std. Error t value
 * 0|1 -0.6708 0.3879    -1.7294
 * 1|2 1.2789  0.4015     3.1855
 * 2|3 2.0552  0.4237     4.8511
 * 3|4 2.2097  0.4285     5.1573
 * 4|5 3.3495  0.4775     7.0144
 * 1) Residual Deviance: 483.3118
 * AIC: 511.3118

Sincerely, Sarah Walker Judson

P.S. I'm planning to post this to OWW, just thought email would be the best mode of communication for these questions at this point. Sarah Walker Judson 	 Wed, Jul 21, 2010 at 7:23 PM To: Heather Piwowar  And here's a bunch of pivot tables that help show the trends....helps put the stats/scoring, etc. in perspective.

Sincerely, Sarah Walker Judson [Quoted text hidden]

DataSets_Pivot.xls 1296K Heather Piwowar 	 Wed, Jul 21, 2010 at 9:22 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson , Todd Vision  Sarah,

I wasn't quite sure where to respond on OWW, so feel free to copy my email comments over. Also, CCing Todd because he knows his stats and I want to make sure I send you down the right path.

Thanks for the full summary... it made it very easy to understand your question and think about the alternatives.

I hear you on the ordered having more information in it, so let's try to find some statistics to take advantage of that. Your levels make sense. You do have finely grained levels... this may be a problem if it makes the data too sparse for good estimates. Your crosstabs suggest it is pretty sparse across your covariates of interest. I'd go with it as is for now (I saw a paper that argues for maintaining lots of levels even in small datasets), but keep the fact that it is sparse in mind... an obvious fix is to collapse some levels if it seems necessary.

Stats. I'm not very familiar with Poisson in this context either. I could see how chi squared could be applied, but agree that it seems to leave a lot of the power on the table. And its output isn't very informative, as you said.

So let's go back and think about ordinal logistic regression again. I think that may still be quite appropriate. Here's the best document that I could find... this group does great R writeups. http://www.ats.ucla.edu/stat/r/dae/ologit.htm

What do you think? I think that your levels are definitely analogous to a likert scale, or soft drink sizes.

This example helps with gut feel as well: http://www.uoregon.edu/~aarong/teaching/G4075_Outline/node27.html

The two approaches in R seem to be MASS::polr and Hmisc::lrm

I'd probably go with the latter because that is what the cool tutorial above uses :)

What do you think? After reading these refs do you still feel like it isn't appropriate? If so, let's talk it through.... I'm avail on chat tomorrow most of the day (multitasking with a remote meeting) so feel free to initiate chat whenever. Heather

On Wed, Jul 21, 2010 at 5:15 PM, Sarah Walker Judson  wrote: [Quoted text hidden]

Sarah Walker Judson 	 Thu, Jul 22, 2010 at 12:12 PM To: hpiwowar@gmail.com Cc: Todd Vision  Heather -

Thank you very much for the prompt help!

The UCLA link was very helpful and interesting....cool stuff. I ran my data following the tutorial. I didn't have any problems running it, but my data clearly violates a number of the assumptions:

1. Small cells/empty cells: because of the number of categories, I had many zeros or small values in my crosstabs. They warn against this saying the model either won't run at all or be unstable....I'm not clear what they mean by "unstable". Mine ran, but I don't know if we can trust the results. (See attached "OrdinalLogisticOutput" for results.)

2. Proportional odds assumption: My data did not hold up to either the parallel slopes or plot tests of this (see attached .txt and .jpg).

3. Sample size. There example was with 400, plus it was mostly binary, so the samples weren't splayed out among many categories. Mine is 170 (for just the 2000/2010 comparision....I have about 100 more if I pool the 2000/2010 "snapshots" and the Time Series) and distributed among a lot of potential categories.

In general, I'm still concerned about the nature of my data in this analysis. They give two examples at the top that match my data, but then the one they use as an example is with more progressive/scalar categories (i.e. your parents had no education, the next "logical" step is that they did get an education). Mine on the other hand is A, B, and C which have no relation to each other....i.e. journal 1, journal 2, and journal 3 or datatype A, B, and C. I don't know if I'm articulating that well, but from their example, I think my type of data would work, but I'm unclear how I would interpret the results. Especially the coefficients....for example, the UCLA example says "So for pared [parent education level], we would say that for a one unit increase in pared (i.e., going from 0 to 1), we expect a 1.05 increase in the expect value of apply [likelihood of applying to grad school] on the log odds scale, given all of the other variables in the model are held constant." I don't know how you would interpret this from journal to journal or datatype to datatype.

Also, I would get the books they recommend to figure out the best approach/interpretation, but I'm operating out of the world's smallest library (seriously, smaller than my apartment) and don't know other ways to obtain the books besides the limited previews on google books. Even more so, I'm almost positive it would take me over a week to get them short of a road trip to LA or thereabouts.

I'm on gchat (just invisible) if you want to hash things out now. I decided to email so I could articulate and ponder over my thoughts better. Thanks again for your help now and throughout this project!

Sincerely, Sarah Walker Judson [Quoted text hidden]

3 attachments Test_ProporitionalOddsAssumption.jpeg 52K ParallelSlopeTable.txt 5K OrdinalLogisticOutput.txt 2K Heather Piwowar 	 Thu, Jul 22, 2010 at 12:36 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson  Cc: Todd Vision  Hi Sarah,

Nice job on the fast analysis and thoughtful interpretation (or interpretation attempts, as the case may be).

I will read and think and respond in the next few hours.

fwiw I do have "Frank E. Harrell, Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York, 2001. [335-337]" at home (it is a great book in many ways! recommended) and would be happy to zoom any relevant pages to you.... will see if that is helpful.

More soon, Heather

[Quoted text hidden] Heather Piwowar 	 Thu, Jul 22, 2010 at 2:14 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson  Cc: Todd Vision  Sarah,

I'm going to write in email too, to help organize my thoughts.

These responses are going to sound like I'm arguing for the ordinal regression. I'm not, per se.... just trying to fully trying to see if it can work before we go to something else.

1. Small cells/empty cells Agreed, a potential problem. I think it would be a potential problem for most statistical techniques because hard to estimate from little information. That said, there are some algorithms that are designed to deal well with this, like Fisher's test for chi squared.

I think by "unstable" they mean very sensitive to individual datapoints. One way to test this is to do a loop wherein you exclude a datapoint, recompute, see if it changed anything drastically. I don't think this makes sense to do first, but we can plan to do it at the end if we are worried about potential instability.

2. Proportional odds assumption

I'm going to do a bit more reading here. There are related algorithms for non-proportional ordinal regressions, though no obvious best choices in R....

another idea might be to collapse the levels into 3 or 4 and see if they are more proportional then (since the chance of having 6 things happen to be proportional is lower than 3 things, more sensitive to outliers...)

3. Sample size I'd add in the extra 100. Also, there is no suggestion that their sample size was a minimum.... That said, I agree, we are trying to estimate a lot of parameters based on not very much data. Rules of thumb are always tricky, and they depend on estimates of effect size, which of course we don't know yet. That said, a rule of thumb is to have 30 datapoints for every multivariate coefficient you are trying to estimate. 6 levels takes up 6, 6*30=180, and that is before estimating anything for your covariates.

So maybe another argument to collapse your levels down to 3 for now???

4. Factor/categorical variables. Yup, your journals and subdiscliplines are factors. I don't believe this will cause a problem. I would model them with dummy variables (one variable for each of your journals and subdisciplines, binary 0/1). Of course that is a lot of covariates, but I think that is the only way to have interpretable results.

A bit on dummy variables here: http://www.psychstat.missouristate.edu/multibook/mlt08m.html

I know that the Design library often does smart things with factor variables, too.... so before you create dummy variables you could try redefining your journal variable as a factor, feed that in, and see what it does.... "If you have constructed those variables as factors, the regression functions in R will interpret them correctly, i.e. as though the dummies were in there. " as per here.

ok, I'm going to end my stream of consciousness there, do a bit more reading, then find you for an interactive chat. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Thu, Jul 22, 2010 at 2:24 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

This link may be of interest too. It is the parallel tutorial to the "ordered" regression one.... this is for regressing against categories/factors where there is no order. Not what you are doing and so it probably loses a lot of power, but it definitely doesn't have a proportional odds assumption!

could be informative to give it a try on your data for fun if easy, pretending that your levels were all distinct unrelated labels?

http://www.ats.ucla.edu/stat/R/dae/mlogit.htm Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 3:49 PM To: hpiwowar@gmail.com Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely, Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 4:09 PM To: hpiwowar@gmail.com Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com>	 Thu, Jul 22, 2010 at 6:31 PM To: hpiwowar@gmail.com Heather -

I've got notes from my reattempts this afternoon, so expect a lengthy email following this (but possibly not until tomorrow morning).

Main success: getting results I understand and generating meaningful questions (I feel like I'm on the verge of having something to report) Main problem: can't get all the factors to run at once.

So, that's what i'm hoping for help on....I'm planning on writing a more detailed email about successes and further questions, but for now I'll just barf my code into this email b/c maybe you've run into this problem before. The input file is also attached. My notes on the error are at the bottom as part of the code. My apologies for the mess...mostly, my husband's complaining that I'm still on the computer rather than eating dinner, so I'll send the long version later.

a=read.csv("ReuseDatasetsSnap.csv") attach(a) names(a) str(a) xtabs(~ Journal+ResolvableScoreRevised) xtabs(~ YearCode+ResolvableScoreRevised) xtabs(~ DepositoryAbbrv+ResolvableScoreRevised) xtabs(~ DepositoryAbbrvOtherSpecified+ResolvableScoreRevised) xtabs(~ TypeOfDataset+ResolvableScoreRevised) xtabs(~ BroaderDatatypes+ResolvableScoreRevised) library(Design) ddist4<- datadist(Journal+YearCode+DepositoryAbbrv+BroaderDatatypes)    #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4) sf <- function(y) c('Y>=0'=qlogis(mean(y >= 0)),'Y>=1'=qlogis(mean(y >= 1)),     'Y>=2'=qlogis(mean(y >= 2))) s <- summary(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, fun=sf) s text(Stop!)#modify to match output before running -->which, xlim plot(s, which=1:3, pch=1:3, xlab='logit', main=' ', xlim=c(-2.3,1.7))
 * 1) modify to match output before running-->#Y repeats
 * 1) Error in datadist(Journal + YearCode + DepositoryAbbrv + BroaderDatatypes) :
 * 2)  fewer than 2 non-missing observations for Journal + YearCode + DepositoryAbbrv + BroaderDatatypes
 * 3) In addition: Warning messages:
 * 1: In Ops.factor(Journal, YearCode) : + not meaningful for factors
 * 2: In Ops.factor(Journal + YearCode, DepositoryAbbrv) :
 * 1)  + not meaningful for factors
 * 3: In Ops.factor(Journal + YearCode + DepositoryAbbrv, BroaderDatatypes) :
 * 1)  + not meaningful for factors
 * 2) i got this error before when I was running in it non-factor, but it cleared up when I either ran less variables at once or coded to dummy variables (1,2,3,4, etc) instead of letters (ea, eco, bio, etc)
 * 3) internet searches primarily turning up code that i don't understand or a few dicussion forms that don't make sense to me
 * 4) search terms used: "datadist" & "not meaningful for factors"l; "datadist" and "fewer than 2 non-missing"
 * 5) don't get this problem when running each factor separately. i ran most separately to practice interpretation. main problem is that ME (journal) is correlated with genbank (depository) and gene (datatype) = each comes out significantly "better" when run as a separate model (factor by factor)... this is where a multiple factor model (which isn't working) would come in handy...to (maybe) tease these apart (i.e. is publishing in ME, reusing from genbank, or using a gene what determines resolvability/attribution)

Sincerely, Sarah Walker Judson [Quoted text hidden]

ReuseDatasetsSnap.csv 259K Heather Piwowar <hpiwowar@gmail.com>	 Thu, Jul 22, 2010 at 8:08 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah,

Hmmmm interesting. Possibly related to zeros in your crosstabs? I may have some ideas. No time now but will dig in tomorrow morning.

Heather

Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:17 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah,

Good question. I'd say yes, go for it. I agree, it helps to flush out the story. I like the first option better... a linear combination of the other two, then, isn't it? Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:22 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> So based on my gut and what Todd said, I think combine them. Include a binary variable for snapshotYN... we hope that that one is not significant, but it will help catch things if it is. And you have a variable for year already, right?

I'm not worried about interpretability.

All of this said, once we get the stats working in general, it is probably worth an email to the data_citation list summarizing your approach and prelim interpretations, so they can give feedback if anything is out of wack methodologically. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:23 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah, I'm busy till 10am but will have time after that to dig in. Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com>	 Fri, Jul 23, 2010 at 10:33 AM To: hpiwowar@gmail.com Heather -

Thanks so much again for your help! Here's the promised "long version"....sorry it took me awhile to get it to you. I apologize for the length and don't necessarily expect lengthy responses to each portion, writing this out is helping me think through it all and hopefully helping you become more acquainted with my data. I indicate the most important questions with a double asterix.

First, attached (PrelimOutputs.txt) are some preliminary results with my interp written under each output. Even though I know we want to be running multiple factors at once (I think) I found this to be a useful exercise to familiarize myself with the statistic and running it in R. It started to reveal some support for trends I was expecting to see (i.e. gene sequences are more resolvable and attributed), so that's promising.


 * Like I said before, the main problem I'm having is running all the factors together. I don't understand the error I'm getting and can't find much help on it (at least not that I can understand). This would be especially helpful for distinguishing if publishing in Molecular Ecology, using a gene sequence, or utilizing Genbank is most influential in having a resolvable/attributable data citation. But at the same time, these are all correlated so it might just be more of a mess because of multicollinearity.

I have some more specific questions in general and about the attribution/resolvability scoring.

In general: - You mentioned "subdiscpline" as a factor yesterday. Were you referring to what I call "data type" or the discipline of the journal? Concerning the later, many of the journals are classified (according to ISI) by usually two of our three major disciplines. I.E. American Naturalist is classified as Ecology and EvoBio, GCB is classified as Environmental Science and Ecology. Few have just one. I coded this for now as binary for each discipline, but given the existing problems with multiple factors, this might be too much to add. Also, I tend to think most of the journals belong to one discipline more strongly than the other...i.e. I would say AmNat is Ecology, Sysbio is EvoBio, and GCB is Environ Sci, etc. This would also reduce the number of factors for this category. Thoughts?


 * - By testing so many factors and character states, aren't we pretty prone to Type 1 error? How do we "prevent" this? Does running factors separately vs. combined help at all?

- For some papers I have multiple datasets per paper. During data collection, I had them all pooled and separated by commas to indicate nuances. Primarily, I only split an article into multiple datasets if they were different datatypes OR if one dataset was a self reuse and the other was acquired via another mechanism. There are about 5-10 incidences where a dataset was split even though they were the same datatype because one was attributed/resolvable and the other was not (i.e. they were acquired in different ways). Will this lead to independence problems? (P.S. I have some preliminary sentences about this for the methods if this doesn't make sense, let me know if you need it).


 * - For some of my factors, I have both a "broad" and a "specified" classification. I'm more inclined to the broad for stats, but always hate to toss resolution. Right now I'm most inclined to keep datatypes broad and depository specific. Here are the classifications for comparison.

Datatypes - Specified (*how data was collected) Bio = organismic, living Paleo = organismic, fossil Eco = community (multi-species) GS = gene sequence GA = gene alignment GO = other gene (blots, protein) Ea = earth (soil, weather, etc) GIS = layers XY = coordinates PT = phylogenetic tree

Datatypes - Broad (*What i am currently using) G = gene (Gs, Ga, Go) O = organismic (living and fossil = Bio and Paleo) S= spatial (GIS, XY) Eco = community (multi-species) Ea = earth (soil, weather, etc) PT = phylogenetic tree

Dataypes - Broader still (haven't attempted) Ecology = organismic (Eco, Bio, Paleo) Environ Sci= spatial & earth EvoBio = gene (PT, GS, GA, GO)

Depository - Specified (*currently using) G = Genbank T = treebase U = url or database (non-depository) E = extracted literature O = other (correspondence, not indicated)

Depository - Broad (*results similar to above) G = Genbank T = treebase O = other (url, extracted, correspondence, not indicated)

Depository - Binary (haven't attempted) D = Depository (i.e. people can both deposit and extract data = genbank, treebase) O= other

Resolvability:

- I'm having a little problem that will probably require recoding where I only counted a depository reference if it was in the body of the text, but not in supplementary appendices or even a table caption. I think I started counting a depository reference later in data collection if it was in the table caption, but still not if it was in the supplementary caption. I want to get your opinion on how this should be coded in the resolvability categories: 0="no information, can't find it" = none of the below 1="could find it with extra research"= depository or author or accession ONLY 2="could find it just with info provided in the paper" = depository and (author and/or accession)


 * I think a table with Genbank mentioned in the table caption and accessions given therein should be a "2". However, I think Genbank mentioned in the header of an appendix followed by accession (i.e. same table as previous but in supplementary information) should be counted as a "1" because you would have to track down the supplementary information, which in the case of sysbio and other articles is difficult. Again, this is considering that genbank was never mentioned in the body of the paper, but the authors said something like "additional information about sequences is provided in appendix a". This gives the reader no guarantee that when they dig up the appendix that it will actually have accession numbers, as it may just describe which taxa each sequence was from or a museum voucher number for the specimen. So that's my bias, just want to see if it's justified in your mind. It will require some recoding no matter.

a little bit of a problem with f/ac

Attribution: Another quick question about scoring which like the above requires lengthy text to explain. Here are my final scoring categories and explanations:

0= "the data is not attributed" - no author or acession (no author also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all)

1 = "the data is indirectly attributed" - accession only or author only (author only also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all) - this still includes self reuses of previous data. In the discussion, I would then talk about what % self reuse occurs as a caveat/modifier about this information

2 = "the data is directly attributed" - author and accession - regardless of self. I think if the author reused their own data and gave the accession number, that's great (it happened so rarely, so I appreciated it when it did....it seemed less like personal aggrandizement to rack up a citation and more of open data sharing...of "hey, you can use this data too" rather than, "please go read my other publications to see if you want my data and maybe can dig up how to get it in those other papers because i don't feel like explaining it here"

So, my explanations probably show my bias. I think it's ok to include self reuses partly because the sample size is small already and partly because some people legitimately reuse their own data. However, I don't think it should count when they cite themself but really used other data (what I call self citation/other reuse....meaning they refer to their previous collection of data and vaguely state that it was from external sources, but give no credit to the original data authors in the current paper and they might in the previous paper, but I don't think we can assume that). So, again, do my biases/categories seem justified? Should we just throw out self reuses altogether as you've been doing. Also, I should note that as I mentioned above (in "independence"), self reuses and other reuses from the same article were separated for analysis.

Well, hopefully you survived all that. Thanks again for your diligence and continued help!!

Sincerely, Sarah Walker Judson [Quoted text hidden]

PrelimOutputs.txt 21K Sarah Walker Judson <walker.sarah.3@gmail.com>	 Fri, Jul 23, 2010 at 10:59 AM To: hpiwowar@gmail.com One crazy idea about the multiple factor problem. It worked when I ran everything together as dummy variables (not binary like the link you sent, but 0,1,2,3, etc)...that was how I ran it before our chat yesterday. I could numerically code/rank the journals, datatypes, etc according to their coefficients when run separately, then run all the factors together to maybe get at which is the most influential (journal vs. datatype vs. year). I dunno if that even works at all, but it's the only plausible work around I can think of given I don't know a ton about this method. It's probably totally unconventional, but I thought I might as well mention it.

Sincerely, Sarah Walker Judson

P.S. I'm on gchat most of the day (i.e until 6pm), but will be invisible as usual. [Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 1:10 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah. I'm going to go have lunch and then come back and chat.

Question: have you tried datadist with commas rather than +s?

so: ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes)

These lines seem to run successfully: ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4) ok, more chatting later,

Heather

Sarah Walker Judson <walker.sarah.3@gmail.com>	 Fri, Jul 23, 2010 at 6:27 PM To: hpiwowar@gmail.com, Todd Vision <tjv@bio.unc.edu> Heather -

Recombined datasets I recombined any dataset that was the same article but different citation practice (i.e. self and other, or different sources). I made some binary and factor codes to indicate this if we deem it relevant or important to test for that artifact. I have the min and max score for attribution and resolvability for each and they can be relatively easily modified in later updates.

Now, I'm wondering, should min, max, or average score be used? I've ruled out average to keep things ordinal, but I'm not sure of the (dis)advantages to min and max. Could you enlighten me?

Dealing with Self reuses I recombined datasets that were self (or different practices) into a single dataset to avoid independence problems. Now, the main problem I'm running into is how to use the Self reuses. First, the way I've coded it, and then the resulting options

Coding: 0= no self (all use of someone else's data) 1 = self citation, other reuse (citing a previous study/review, but not mentioning the original data authors) 2 = some self reuse, some other (some of the data was recycled from a personal previous study, but the gaps were filled in by other sources....common with gene sequences...authors use their own data and then fill in outgroups or missing taxa from a blast search) 3 = all self reuse (recycling data from a previous study to look at a new question)

Options: 1. Keep Self reuses in the dataset but use the above coding as a factor (maybe should do this to help us determine if we throw them out or not) 2. Throw out some level of self reuses....but the question is, which level? I think a score of 1 counts as a bad reuse of someone else's data...using it but not crediting them. A score of 2 is also debatable for inclusion...could keep the "non-self" portion and toss the self portion (i.e. then don't have to deal with min and max scores...basically go back to my split ones, but just use the non-self). A score of 3 should definitely be tossed...just recycling old data (but I stand by my point that option 1 should at least be considered b/c I think authors citing their own data are sloppier than when citing someone elses). I'm inclined to toss 3 only (but analyze it separately!) because I think a score of 2 and 1 still hold meaningful information about using other people's data. In terms of sample size, cutting 3's only, we lose 36 from the whole dataset (originally 270 now 245 with recombined datasets), cutting 2's = another 14, cutting 1' = another 9.

Snapshot vs. Time Series vs. Pooled At some quick glances (not considering the elimination of self reuses or classification changes yet), the results are coming out pretty much the same when run any way. For resolvability, 2010 always comes out significantly different than 2000, Journals not sig, Datatypes not sig (again, without change in categories yet), and Depositories sig (Genbank and treebase the same, other sig diff = "worse" resolvability and attribution).

Depository and Datatype reclassifications I haven't finished this yet and am hoping to hash it out this weekend or Monday morning. I'm leaning towards the following classification, but haven't pondering it long and hard yet. I'm open to arguments or new ideas on the matter.

DATATYPE: Reuse: "Genera" of data raw gene: GS, GO processed gene: GA PT earth: GIS, Ea species: bio, eco, paleo, xy

Sharing: where the data "should" go (but this could be a separate measure...like "data should go to depository X, Y, or Z" and then "data did go to depository X, Y, or Z") gene: GS (genbank) processed gene: GA PT (treebase) earth: GIS, Ea (daac - does daac handle both of these? if not, split) species: bio, eco, xy (dryad) fossil: pa (paleodb) other gene: GO (actually, I don't think any of these were shared)

I haven't come to a conclusion about Depository yet, the main debate was Depository YN vs. Depository and the Other categories vs. Depository types (specify genbank and treebase) and other categories. I think the middle....so answering the question about resolvability/attribution if it comes from any depository, from a url/db (non-depository), extracted literature, other (correspondence), or not indicated. That's where I'm leaning on that.

Well, these aren't complete thoughts, but I figured I should spit out all of the above so if you had time this weekend, but not Monday I could still get your input.

Thanks again for all your help. Will keep you updated on my progress. Hoping to have a draft Mon/Tues, but am struggling to put words to paper before the stats/methods we've discussed are more solid.

Sincerely, Sarah Walker Judson [Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com>	 Mon, Jul 26, 2010 at 10:09 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

Sounds like good progress hashing out the issues. Some thoughts below.

I'm on chat all day... a lot of my focus will be on helping Nic with mutivariate R stuff today. That said, if there is something that you need help thinking through and it is holding you back then send me a chat or email and we'll work though it. Heather

Now, I'm wondering, should min, max, or average score be used? I've ruled out average to keep things ordinal, but I'm not sure of the (dis)advantages to min and max. Could you

I don't know that one is better than the other, but it changes interpretation, for example from "when all of an author's genbank citations include an author name" to "when at least one of the genbank citations include an author name." So I think it just depends on which you'd rather know something about.

Dealing with Self reuses I think I like option 1 for now. keeps in the most data, lets us tell a story about "citing data" no matter where the data, etc. That said, I could believe that workign through it may provoke a different opinion. Maybe tomorrow I can step through your R analysis with you and we could go over some of these questions with the data in front of us?

Snapshot vs. Time Series vs. Pooled For pooled, which I'm going to call "combined" because I'm familiar with pooled analysis being something specific and different, you are including all rows from snapshot, all rows from time series, a column YN for whether snapshot or time series and a numeric column for the year, is that right? in that scenario is the year significant? is the snapshotYN significant? curious :)

Heahter

Sarah Walker Judson <walker.sarah.3@gmail.com>	 Mon, Jul 26, 2010 at 6:08 PM To: hpiwowar@gmail.com Cc: Todd Vision <tjv@bio.unc.edu> Heather -

Here (attached) are my thoughts on possibilities for factor classifications of data type and depository. I organized my thoughts in an Excel table and thought that would also be the easiest way to send it. I give my rationale for each and highlight the ones I'm leaning towards. Let me know what you think by email or proposing a chat time tomorrow (I'm open anytime). Thanks!

Sincerely, Sarah Walker Judson [Quoted text hidden]

DataAndDepositoryClassification.xls 10K Heather Piwowar <hpiwowar@gmail.com>	 Tue, Jul 27, 2010 at 7:26 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Great, Sarah, let's chat about it today. I have a few meetings and I don't know how long they'll run, so I'll initiate a chat when I am free. Maybe 10 or 10:30 pacific, or else after our group meeting. Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com>	 Tue, Jul 27, 2010 at 9:16 AM To: hpiwowar@gmail.com Sounds good. I'm available at either time.

Sincerely, Sarah Walker Judson [Quoted text hidden]

Email: Advice on data collections in stats
3 messages Heather Piwowar <hpiwowar@gmail.com>	 Thu, Jul 22, 2010 at 8:07 PM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu>, Sarah Walker Judson <walker.sarah.3@gmail.com> Todd,

Could do with some stats advice.

Sarah collected data in two different ways: randomly and consecutively. My guess is that she can concatenate these for her main analysis... maybe with a binary variable indicating the type of data collection to hopefully catch artifacts.

That said, I'm a bit unsure and I don't want to lead her down the wrong path.

What do you think? Heather

-- Forwarded message -- From: Sarah Walker Judson <walker.sarah.3@gmail.com> Date: Thu, Jul 22, 2010 at 4:09 PM Subject: Re: Scoring and Stats questions_Dataone To: hpiwowar@gmail.com

Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson

On Thu, Jul 22, 2010 at 3:49 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote: Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely, Sarah Walker Judson

Todd Vision <tjv@bio.unc.edu>	 Fri, Jul 23, 2010 at 4:41 AM To: "hpiwowar@gmail.com" <hpiwowar@gmail.com> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> I haven't been following the discussion closely enough to be sure, but a general approach would be to combine, test for heteregeneity and, in its absence, accept the stats on the combined sample. But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret. I'm not sure you would not want to use them to ask the same questions.

Todd [Quoted text hidden] --

Todd Vision

Associate Professor Department of Biology University of North Carolina at Chapel Hill

Associate Director for Informatics National Evolutionary Synthesis Center http://www.nescent.org

Heather Piwowar <hpiwowar@gmail.com>	 Fri, Jul 23, 2010 at 7:14 AM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> Thanks Todd, that's helpful. Heather

Chat Transcript July 22
2:39 PM Heather: Hi Sarah! You around? Sarah: yep...trying to work on what you sent Heather: cool. I don't have any definitive answers, just ideas. Sarah: ok 2:40 PM Heather: what do you think? based on what you've seen, do you think continuing down this path a bit more makes sense? or turn to the chi-sq or poisson or something else? Sarah: yeah, i think it will work, it's just a steeper learning curve for me than i expected 2:41 PM my main question before proceeding on, is: what exactly are we trying to test with this? Heather: yeah, agreed. it feels like good stuff to know Sarah: i.e. mainly trying to say which factor is most influential? Heather: but always a bit hard to be learning it when you actually need to use it, yesterday. Sarah: or just trying to get a measure of significance to stamp on the results? 2:42 PM to me, the percentage table breakdowns really show what's going on Heather: right good question. so what the percentage table doesn't give us is a multivariate look that tries to hold potential confounders constant 2:43 PM Sarah: agreed Heather: so we are trying to figure out which factors are important which ones aren't Sarah: still agreed Heather: yeah, I think that is mostly it :) maybe also 2:44 PM an estimate of prevalence and percentages in some factors, independently of confounders  for example  journal impact factors vary a lot by subdiscipline 2:45 PM and there is lots of prior evidence (mine, nic's, etc) that high impact factors  correlate with stronger policies, and probably more sharing Sarah: i wasn't considering throwing in high impact factors b/c the journals were already selected by that criteria and there are only 6 journals, so i didn't think that would be an informative variable 2:46 PM Heather: so it would be ideal if we could decouple impact factor from rates of sharing  yeah, I hear you. true, you only have 6 journals to work from. Sarah: i would use it more as explanatory in the discussion to maybe say why one journal was better or worse....or running impact factor as a variable if journal was a significant predictor Heather: right. 2:47 PM so then let me change my example Sarah: ok, sorry to get off track on details Heather: to be figuring out the relative rates of sharing in any given journal Sarah: so say, journal vs. datatype Heather: right assuming the mix of datatypes was the same 2:48 PM (which obviously it isn't) so I'd say the goals are Sarah: yeah, it's highly coorelated with journal Heather: right (which of course will make it hard for the stats to tease it out but that's life) so I'd say the goals are a) which factors are important 2:49 PM b) what are the relative levels of sharing, independent of other variables that sync with what you think? Sarah: could you explain what you mean by b)? just percentages by journal/datatype, etc? Heather: yeah. let's see, my head needs to get more into your data. 2:50 PM so, for example, we could want to say that  when data is sequence data, the odds that it will be shared 2:51 PM Sarah: are better than if it's ecological Heather: at a really high level of best-practice  are 1.5 times more than if it were ecological.  yes, exactly.  so you would have to choose a "baseline" 2:52 PM (or I think when you define the variable as a factor, it chooses a baseline for you? I forget) Sarah: hmmm...i'm not clear Heather: .... independent of whatever journal it is published in 2:53 PM Sarah: (and on a technical note, I'm having trouble defining factors in Design) Heather: ok.... oh, I think that is easy. can you just say as.factor(x) or factor(x), or does that not work? 2:54 PM Sarah: as.factor hasn't been working, in the documentation it says: "In addition to model.frame.default being replaced by a modified version, [. and [.factor are replaced by versions which carry along the label attribute of a variable." Heather: hmmmm. apparently not easy.  I can look in my code. Sarah: basically, when the only way I could get my data to behave in Design, i substituted numbers  but that makes things act like ordinal scales 2:55 PM Heather: yeah, and that is probably making the results strange Sarah: let me dig up the output differences real quickl  journal straight: Coef S.E. Wald Z P y>=1 1.5626 0.4291 3.64 0.0003 y>=2 -0.4607 0.4010 -1.15 0.2506 y>=3 -1.2100 0.4127 -2.93 0.0034 y>=4 -1.3597 0.4164 -3.27 0.0011 y>=5 -2.4611 0.4617 -5.33 0.0000 Journal=EC -1.5524 0.6321 -2.46 0.0140 Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127 journal coded as a dummy number (1,2,3,4,5 &6): Coef S.E. Wald Z P y>=1 2.84154 0.42461 6.69 0.0000 y>=2 0.90291 0.36971 2.44 0.0146 y>=3 0.18469 0.36705 0.50 0.6148 y>=4 0.04250 0.36783 0.12 0.9080 y>=5 -1.01183 0.39524 -2.56 0.0105 JournalCode -0.55736 0.09414 -5.92 0.0000 2:56 PM Heather: oh, leaving the journal straight, do you mean you have 5 different binary variables? Sarah: no, i don't but i think Design is interpretting it as such my columns look like: Journal ME GBC ME SB  etc 2:57 PM or Journal sorry, JournalCode Heather: right. so I'm guessing then maybe Design is interpreting it as a factor already? Sarah: 1 2 1  3 Heather: try str(yourdataframe) and see what datatype R thinks the Journal column is? 2:58 PM Sarah: yep it's coming through as a factor Heather: ok, good. confusing for you, but good. Sarah: but, that type of output doesn't make a lick of sense to me Heather: ok. 2:59 PM Sarah: well, i guess it just will give a ton of covariates like you were saying Heather: right Sarah: which seems like you would have type1 error problems Heather: one for each journal in the list in the results that looks like this Journal=EC -1.5524 0.6321 -2.46 0.0140 Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127 does that include all of your journals, or is it missing one? 3:00 PM Sarah: ummm...it's missing Amnat Heather: right Sarah: I might have just not copied it Heather: so I think that means it used Amnat as the base good question... go have a look.... Sarah: oh Heather: I'm thinking it might not be there it is kind of like having a state column 3:01 PM well nevermind that analogy was just going to make things worse. Sarah: it's in the table (there was a chance it didn't have reuse) Heather: can you see, is it actually missing Amnat? Sarah: it's in the crosstab Heather: in the regression results? but it is in the input data? 3:02 PM Sarah: yep, it's in the raw table and the crosstab Heather: yup. Sarah: ResolvableScore Journal 0 1 2 3 4 5 AN 3 9 4 0 4 0 EC 10 4 2 0 2 0 GCB 13 13 3 0 1 0 ME 0 6 2 2 4 8 PB 9 13 4 0 4 0 SB 4 19 8 2 7 10 Heather: so that means that it is using it as the base case those results mean, 3:03 PM or rather this line Journal=PB -0.6605 0.5230 -1.26 0.2066 means that 3:04 PM "whether the journal was amnat or PB made no difference in how we'd predict the citation-quality score" (or whatever it was that regression was regressing on) whereas Journal=EC -1.5524 0.6321 -2.46 0.0140 means 3:05 PM "being in the journal EC made a difference to the citation-quality score, compared to being in the journal Amnat, p=0.01" and if you want to see how big a difference Sarah: ok...that makes more sense Heather: we have to look at the coefficients and decode them but they would tell us something like 3:06 PM Sarah: gut would say EC (ecology) is worse than Amnat and ME (molecular eco) is better so, yeah, EC is neg and ME positive Heather: "being in the journal EC made a dataset 1.4 times more likely to have a quality score 1 level higher than an equivalent study published in amnat" or something liek that oh, whichever, I didn't try to make my guess very realistic :) Sarah: makes sense for the others based on what i know about the data 3:07 PM Heather: and I'd have to reread the ordered tutorial to make 100% sure I'm getting my summary blurb right Sarah: so then, can i force which journal (or factor) is the base? Heather: about "1 level higher" etc because I'm not very used to this ordinal stuff but that is the general idea Sarah: i.e. my impression of which is "worst" or "best" Heather: good question  maybe  I think so Sarah: or, should I let the stats remove my bias? Heather: if you do levels(dataframe$Journal) what does it say? Sarah: or determine "worst" from the pivot tables Heather: urm, I think mathematically it doesn't matter 3:08 PM Sarah: just for interp Heather: so there is an advantage to picking one that is easy to interp  exactly  I wonder how it picks it now?  might be the level with the most N  which would probably be a good call regardless Sarah: that code didn't work i'm getting an error "dataframe not found"...oh sorry, i need to insert my data object there whoops 3:09 PM just a sec yeah, so AN is the first, but they are arranged alphabetically, not in order of encounter in the raw table Heather: interesting. so I'm guessing it might use levels[0] Sarah: i mean, it may correlated with sample size, but i don't think so Heather: as the base? 3:10 PM Sarah: i'm not familiar with levels, sorry Heather: ok Sarah: so i can't make an intelligent stab at that Heather: no problem so a factor is a vector Sarah: but, i can figure it out to spare you the time, or just interp the way it comes out it makes a lot more sense now Heather: well, hrm no levels is like the "codebook" that is uses to code factors 3:11 PM try ?levels and for what it is worth it isn't the most intuitive part of R to me either so I'd maybe skip trying to force that for now let it pick what it wants to pick and later when/if we decide this is the way we want to go and you see an 3:12 PM opportunity to really improve the interpretation by forcing it, figure it out then... anyway, your call, but that's what I'd do. so right now to interp your results 3:13 PM you'll have to see what is left out of the results output, or check out levels for each of your categorical variables or some combo does that make sense? or enough sense? 3:14 PM Sarah: yep... a lot more sense than before Heather: cool Sarah: i thought the output with all the journal types list out was the wrong way Heather: so for what it is worth Sarah: because the example use all binary coding rather than A, B, C, etc Heather: your y output variable could be coded as a factor as well if you wanted to 3:15 PM Sarah: well, i tried to order it in a way that was somewhat "bad to god" Heather: and then you can do a multinomial regression on that unordered-factor-category y variable, Sarah: *good Heather: like in the last tutorial I sent. right! and mostly that is a great idea the only reason you could, maybe, treat it like a factor instead is to get around the "proportional odds" stuff 3:16 PM Sarah: ok...i don't know how that will come out with this new way Heather: by seeing how it behaves if you just remove all semblence or order. Sarah: ok. i'll try this again and maybe give that a shot Heather: right, I don't know either. And I'm not necessarily really recommending it..... except maybe..... kind of like how we always use two-sided p-values 3:17 PM we think we know which way the interaction will go and so we could, in theory, use a one-sided p-value but maybe we are wrong and we should use stat tests that reflect that Sarah: hmmm...i'm not following 3:18 PM Heather: let me back up for a minute and ask a question to make sure I'm on the right page, because I forget for your "best practice" levels, does everything that meets the criteria to be in level 3 also meet the criteria to be in level2? 3:19 PM Sarah: no ResolvableScore

0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature") 1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info) 2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods) 3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format eve 4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data) 5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used) 3:20 PM oh, sorry i copied an d pasted before i realized that was the long version Heather: right, pulled it up too Sarah: but, could make it so it was Heather: so, by treating that as an ordered variable, we are making some assumptions that may not be true 3:21 PM Sarah: yes, like that my ranking is reflective of true difficulty of finding a dataset Heather: if we think about other ordered variables, people who think something is "very good" also think it is at least "good" right :) Sarah: yeah....but, i'm also grappling with the problem here that we have a perception of a good practice, but most of the data doesn't meet that 3:22 PM Heather: I'd say, perhaps we could improve a few things at once Sarah: i.e. we'd like to see depoistory and accessio mentioned Heather: by collapsing your variables Sarah: but most just give authors author  the ordered version i see isL  : Heather: into interpreable levels Sarah: author only  depository and author Heather: yeah, but I'd even try using other lingo for a minute Sarah: depository and author and accession 3:23 PM Heather: so "no information, can't find it" Sarah: but, i have very few of the later Heather: "could find it with extra research"  "could find it just with info provided in the paper"  or something like that Sarah: ok...  but i'm talking about those same things jsut by the criteria i'm defining them 3:24 PM Heather: then you have a codebook to know what criteria you use to apply those labels yeah but there aren't 6 that make sense to talk about when you stop talking about their criteria, do you know what I mean? Sarah: "no information, can't find it" = none of the below "could find it with extra research"depository or author ONLY 3:25 PM Heather: in some ways, the people reading the paper don't care if a citation includes the author and one of depository or... Sarah: "could find it just with info provided in the paper" depository and (author or accession) Heather: they care... can I FIND it :) or am I attributed  or whatever  yup Sarah: sorry, i'll put those in order  "no information, can't find it" = none of the below  "could find it with extra research"depository or author ONLY Heather: and I think that will help with the ordered interpretation Sarah: "could find it just with info provided in the paper" depository and (author or accession) Heather: and reducing the number of levels 3:26 PM (which will help with N in cells and maybe proportionalness) Sarah: and then use percentages to just state the paltry number of papers that give the accession number Heather: yup Sarah: rather than holding accession as the holy grail Heather: and ditto on attribution  make it what matters 3:27 PM so "the author is not attributed"  "the author is indirectly attributed"  "the author is directly attributed"  (and maybe this means you need another endpointn for the depository attribution?) 3:28 PM anyway... I wouldn't spend oodles of time reworking things into this framework Sarah: i don't think it will be bad Heather: because maybe it won't be practical or Todd won't like the direction or whatever..... but that's what my gut tells me. Sarah: i still like my original categories for display tables, but you're right about the meaning for stats maybe that will also keep todd happy 3:29 PM Heather: yes agreed! good point. :) And I don't want to put words in Todd's mouth, I don't know what he will think.... Sarah: no, i think we all think accession number (direct data attribution) is the holy grail but, that's just not a reality in this data Heather: yeah. so then can you define a midpoint or two between that and nothing 3:30 PM Sarah: one quick question on the attribution scale, Heather: yup? Sarah: would accession number (without an author name) be direct or indirect?  i say indirect  but it hurts when we want to show accession as the epitome Heather: yeah, I'd say that too. Sarah: of a good data citation 3:31 PM Heather: yeah, but you know what? when you put it that way, accession number isn't actually the epitome  of everything Sarah: yeah Heather: that is what genbank mostly does right? Sarah: yep, exactly Heather: and it comes under fire in terms of people not getting direct attribution Sarah: but it's not standard in the literature by any means Heather: so if your data reflects taht, probably all the better Sarah: a lot of people say " i searched genbank and used sequences by author a, b, and c" 3:32 PM Heather: do they really? I wouldn't have expected that. Sarah: ok, so can i run the attribution categories by you real quick? Heather: I've mostly seen "and used accession number A, B, C" yes I think I've got 10 more mins Sarah: "the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession 3:33 PM wait...that excludes author only "the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only or author only "the author is directly attributed" - author and accession hm....but "author only" is direct Heather: do you need "accession" on directly attributed? right. Sarah: "the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession or author only Heather: seems strange, but in terms of attribution per se, accession not needed Sarah: but, then that's not ordered 3:34 PM directly does not necessarily include the indirect Heather: good point thinking 3:35 PM Sarah: one addition: correspondence (ie. data set was obtained from my buddy so and so)"the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only or correspondence only "the author is directly attributed" - author and accession or author only Heather: well hrm I'm not quite sure what to think. Sarah: we could change it to "data directly attributed" 3:36 PM "the data is not attributed" - no author or acession "the data is indirectly attributed" - accession only or author only "the data is directly attributed" - author and accession or call it "data authorship" Heather: yeah, that works I think, doesn't it? 3:37 PM Sarah: that's more what we're interested in too....is the DATA being cited? still brings in the problem of author attribution as the current mode of tracking data Heather: yes, exactly nice ok, I have to run. Sarah: ok. thanks soooo much! Heather: I'm guessing we aren't out of the woods yet but making progress Sarah: i'm not used to categorical stats and that helped a bunch Heather: great 3:38 PM Sarah: ok, sure, will send through email or whatever and usually when you get an email from me, i'm available on chat for the next little while thanks! Heather: ok, good to know. I think I'll be AWOL tonight, but avail tomorrow. bye!

Chat Transcript July 23
1:06 PM Heather: Have you tried datadist with commas rather than +s so ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes) ________________________________________	7 minutes 1:14 PM Sarah: nope, was running them as pluses per the tutorial. will try commas while you are at lunch 1:15 PM golden. now it works ________________________________________	29 minutes 1:44 PM Heather: sweet! I'm back and ready for chatting whenever you are. Sarah: i'm good 1:45 PM i just reran the data altogether an d was attempting interpretation Heather: great chat now then? in no particular order, maybe we start with pvalues and the fact that we are looking at lots of them 1:46 PM so I think it definitely does mean that we need to use a threshold lower than 0.05 for each particular coefficient Sarah: ok, so account for type 1 by lowering the alpha level Heather: yup Sarah: and just be straightforward about that in the methods Heather: not quite sure the best practice will read up yes 1:47 PM I don't think that putting them in individually vs via factors makes a difference (unless they are doing something really smart for factors in the regression code, which they might be) 1:48 PM hey this reminds me... try anova(ologit4) does it do that in the tutorial? I think that it collapses all of the factors into overall p-values.... Sarah: it might have at the end, but i was getting lost after some of the diagnositcs and coefficient problems yesterday 1:49 PM Heather: I learned that in the Harrell (sp?) book doing my thesis relevant I think so let's keep it in mind as an alternate view of results Sarah: here's what i got: Wald Statistics Response: ResolvableScoreRevised

Factor Chi-Square d.f. P Journal 8.18 5 0.1464 YearCode 7.03 1 0.0080 DepositoryAbbrv 32.43 2 <.0001 BroaderDatatypes 6.29 5 0.2788 TOTAL 57.27 13 <.0001 Heather: right Sarah: and that's what i was seeing kind of like a chi sq for each factor 1:50 PM Heather: whereas just print(ologit4) breaks it down yup so I think a way that people interpret this is that they do the anova first to determine if the factor has an effect at some p=0.05/number of variables or something 1:51 PM and then for factors with an effect, they look at the individual coefficients and interpret them I mean the individual levels within the significant factors Sarah: makes sense Heather: cool regardless, it is all best reported in the spirit of hypothesis-generating 1:52 PM exploratory, etc Sarah: when i ran things individually, datatypes and journal were significant factors, but all together not so much Heather: interesting potentials for interpretation there Sarah: mostly, anything genetic related was coming out as significantly different Heather: yup, makes sense 1:53 PM so I have a few ideas about your classifications Sarah: but then this makes it look like the depository is the most influential factor in determining that Heather: yeah Sarah: ok, shoot Heather: so one idea about depository is to combine G and T, and keep E O and U all separate reason is that G and T are the same "kind" centralized, best practice, etc 1:54 PM the others are all different kinds what do you think? (whereas, foreshadowing to typeofdataset, I think I'd argue that GS and PT stay as their own individual types) Sarah: yeah agreed, 1:55 PM when i wrote that email this morning i thought, well that was obvious but hadn't thought of that before Heather: I think G+T, E, O, U all have the N to stand alone doesn't hurt that G and T have similar distributions in resolablescorerevised either, phewph Sarah: so....keep other and "not indicated" separate as well? Heather: yeah, I would I htink they are different and have enough N, it looks like 1:56 PM Sarah: agreed, b/c not indicated should have less resolve/attribution Heather: right. and "other" is interesting in and of itself 1:57 PM so with that in mind, let's think about typeofdataset if you are ready Sarah: ok Heather: so I think that GS should stand alone mostly due to our (perhaps unstated) hypothesis that genbank is its own universe or rather that data-that-can-go-into-genbank is its own universe 1:58 PM ditto data-that-can-go-into-treebase Sarah: yeah, but what about instead of using journal to define discipline, the datatype defines that? because otherwise the factor is redundant of depository i guess maybe not since depository now lumps genbank and treebase 1:59 PM Heather: hmmm. so maybe let's hold off thinking about discipline for now Sarah: but gs could count for both yeah, i'm making more of a mess by analyzing it too much Heather: because I think that is fuzzier than dataset Sarah: so....maybe: data that should go to a, b, and c so define the data by its destination? Heather: ok, you've lost me a bit so this variable Sarah: then bio and eco should both go to dryad oh, sorry Heather: TypeOfDataset 2:00 PM Sarah: yep Heather: has 10 different values and one question is whether to keep all 10 distinct in the stats analysis, or how to group them to that question, for what it is worth, to me the answer is a bit different than any of your proposals it would be somehting along these lines: 2:01 PM GS, GA+GO, GIS+XY, PT, EA, Bio+PA(+Eco probably) I can also see the value of keeping all 10 levels Sarah: but, ga should be deposited in treebase Heather: and mostly just shy away from that due to desire to keep number of vars down Sarah: so, with pt Heather: oh! my mistake. PT also goes in treebase is that right? yeah, with pt then 2:02 PM Sarah: pt w/ an associated ga Heather: definitely informing/biasing our groupings by our hypotheses, but I think that is ok yup. (you know better than I do about the details here....) Sarah: yeah, thats why i thought to maybe lump all genetic info Heather: what do you think? yeah, but does all genetic info go into genbank? 2:03 PM if so, then yes lump Sarah: no Heather: yeah, then I'd keep the genbankable stuff distinct Sarah: the problem is too, that all spatial data (gis, xy) don't have a happy place to go together nor any depository at all Heather: yeah Sarah: so, the main problem is, are we lumping by supposed depository or by type of data Heather: then maybe ideally they would stand alone Sarah: or use of the data? 2:04 PM Heather: the goal would be to lump by type of data Sarah: i.e. ga and pt are needed for a reanlaysis of a tree, but you could also use gs for that which many do Heather: I think by type rather than use Sarah: so....type of data i would do... raw gene: GS, GO processed gene: GA PT 2:05 PM Spatial: GIS, xy  earth species: bio, eco, paleo Heather: looks good to me, though I'll push back and ask whether we keep GS and GO separate admittedly GO doesn't have much N Sarah: yes Heather: but it woudl be really helpful to have a clear picture of just GS Sarah: it's the older stuff 2:06 PM like blots and arrays pre- fast sequencing Heather: yeah. for my information, woudl that include microarrays? stuff that would go into GEO, Arrayexpress? Sarah: hmmm...i don't think it what i saw i never saw those db databases Heather: ok. jsut curious I think there is some in this general domain... I saw some at the Evo conference 2:07 PM but not much, and maybe not in the specific journals you looked at cool, thanks, helps me sync it up with the bit of the field that I know. ok, so want to recap proposed groupings? Sarah: i think some people are moving that way to look at gene expression rather than just raw genes Heather: for TypeofDataset yeah... though the data is pretty messy. relative, analog, etc. 2:08 PM I think it is starting to fade out again in some areas. anyway. Sarah: ok...one other question... 2:09 PM gis is more "earth" spatial data and xy is more "species" (i,e, ocurrences and they would be found/posted in totally different places Heather: yeah Sarah: mostly i know that from my past bio work Heather: so ideally maybe we'd keep them separate  but their N is just really small 2:10 PM it makes me think in that case if we just go up a level of abstraction Sarah: but, is it better to lump them into earth and species or lump them together as spatial Heather: oh I see. yeah, so I have no idea.  I could see arguments either way 2:11 PM what do you thinik? Sarah: i think that i would like them posted together b/c of my experience in that field, but most people that use that data use either GIS or GIS+XY, usually not XY alone 2:12 PM so, GIS probably is distinct Heather: ok.  so what woudl that make your overall groupings look like? Sarah: but, on the other hand, most people usually cite xy and gis in this discipline since it's more for biology purposes hmmm... still undecided 2:13 PM raw gene: GS, GO processed gene: GA PT Spatial: GIS, xy earth species: bio, eco, paleo sorry i was editing that and accidentally released it raw gene: GS, GO processed gene: GA PT earth: GIS, Ea species: bio, eco, paleo, xy  b/c in these studies, xy was only given for species occurences 2:14 PM though in other arenas it could potentially be given for other info Heather: hrm wait Sarah: there maybe some gis instances that were non-earth, but most were climate Heather: I think I'd keep GS separate, right? Sarah: oh sorry, i was overly concerned with the earth/spatial problem Heather: yup Sarah: yeah, i guess, but is GO too small alone? 2:15 PM Heather: yeah, it kind of is, but I think it is better than the alternative of lumping it in with GS so a rule of thumb here could be Sarah: i'm not % sold b/c it's raw info that could be used in the same way as gs  sorry,100% 2:16 PM Heather: I know but we're focusing on datatype not use I think a rule of thumb could be keep them all separate unless there is a strong reason to combine them? so maybe obvious reason for Bio+PA ? or GA+PT? 2:17 PM Sarah: bio - both organismic Heather: or ? Sarah: both go to dryad ga-pt: both a step above gs both go to treebase Heather: yeah though in theory dryad takes everything that doesn't have a home somewhere else, so it kind of means misc :) Sarah: gs and go are the "same" in that sense, but go doesn't have a depository home 2:18 PM unless you count pdb which i have one instance of paleo could also go to paleodb Heather: ok  how about this  start with them all separately  ?  I think if we have to spend degrees of freedom, 2:19 PM spending them on typeofdataset is a good place  because we really do think it is a relevant variable Sarah: ok. we could also look at them by "typical depository" and "discipline/data type"  so, where they should go and what they are 2:20 PM Heather: ok. if that sort of distinction is clear to you, that might be a useful thing to try.  I'm rather out of my depth here  my main contribution is that I think there shoudl be some variable  that uniquely identifies "this dataset could have gone to genbank" or treebase 2:21 PM other than that, I defer to your experience Sarah: but, only problem is that right now we're talking about reuse, not sharing so is it important where the data should have come from? so, i agree for sharing but think it is different for reuse Heather: sorry, I was assuming you'd use the same framework for sharing too 2:22 PM Sarah: i think they might need to be tweaked for each b/c sharing and reusing are different things Heather: would all GS have come from genbank, or might some have come from elsewhere? clearly not all PT necessarily comes from treebase, does it. could be communication etc. Sarah: no, some gs come from your buddy or your previous study pt can be "extracted literature" Heather: right. ok, gotcha. Sarah: because you can reconstruct just the terminal nodes 2:23 PM Heather: sorry for being slow in thinking about it that way. Sarah: i think if you share, your data SHOULD go to a specific place Heather: yeah, hrm. Sarah: but if you're reusing, you might have different sources Heather: right Sarah: that are more fruitful than a blast search i.e. the paper where the guy complained about treebase Heather: yeah, so you know what? In that case I see your logic in keeping Genbank and Treebase separate in the Depositories category 2:24 PM Sarah: and instances where people use unpuslished data Heather: I thought about Sharing for so long, phd worth, I often default that way instead of reuse. retraining. 2:25 PM Sarah: well, i think depository could be y/n in reuse b/c in theory depositories should have recommendations about how to reuse their data unfortunately of which dryad is the only example oh, but in that same vein, we could not lump them to "see" if they have different policies (or unspoken traditions) Heather: yeah. hrm. well, take all of my comments and apply them when you think about how to group these things for sharing :) Sarah: sounds good i agree with most of what you said in the context of shargin 2:26 PM sharing Heather: and for reuse, I'm not sure. I could see arguments either way.  I guess it depends on our hypotheses, eh? Sarah: i'm planning on coding/scoring sharing today anyways and can give you the side by side comparison once it's done  so, reuse 2:27 PM i think datatype is more discipline (genera? type?) driven than depository Heather: there is an advantage to keeping the sharing and reuse analyses as parallel as we can Sarah: how so? if they have different questions, then...different scoring right? Heather: do you think that reuse citation behaviour is more discipline driven than depository or dtatype? Sarah: sorry, by discipline i meant datatype  just trying to call it type 2:28 PM rather than where is "should" have come from Heather: we want to code the things that we think drive reuse citaiton behaviour, right?  right, agreed, where it "shouyld" have come from is kind of moot  where it did come from is relevant  the journal is relevant 2:29 PM Sarah: but where it came from is taken care of by depository  so we should try to make datatype it's own variable Heather: right. sorry, I was taking a step back to go forward.  I htink maybe I want to punt on this conversation and leave it to you to propose something :) Sarah: ok, i need to think it over again 2:30 PM right now, i'm more inclined to stick with datatype (i.e. raw gene, processed gene, earth, etc) do data "genre" Heather: you raised lots of good Qs in your email and want to make sure we get to those too Sarah: ok Heather: so the idea of multiple datasets in one paper and how to handle that 2:31 PM is complicated, and it matters. so.... one thing that woudl help is if we defined what each of your datapoints actually means, to make sure we are consistent. Sarah: ideally, looking back, i would have enumerated every dataset in the paper and kept them all separate 2:32 PM but i was thinking on a article level, not dataset Heather: yeah. that's ok, that happens, learning. so what is a clean solution now. Sarah: i only dealt with it when things were obviously different Heather: where things were "how they cited/shared" them, right? 2:33 PM when the endpoint was obviously different? Sarah: yes or the datatype Heather: ok, so you kept all datatypes for each paper distinct? 2:34 PM Sarah: yes Heather: in that case, could you imagine one column for each datatype in each paper? agreed, it means the rows aren't independent Sarah: yes, i was approaching it that way previously Heather: but I don't think that will be a huge effect, and we can call it out as a limitation Sarah: but, became tricky in my mind for analysis Heather: because of independence, or otherwise? 2:35 PM Sarah: the factor i invented, but haven't used, to deal with it was "multiple datatypes", so I counted if the article has multiple types no, I didn't know how to deal with so many y/n columnes i.e. so one article could just use gs, another would use gs and ga etc 2:36 PM it wasn't making sense to me but maybe you know more, is that a better way to run it? Heather: ok, I'm a little confused. Sarah: also, there are resolvability and attribution difference s between datasets in the same paper Heather: yeah, right, that is its own disaster 2:37 PM so I think one way to deal with that is to code the "minimum" attribution practice used for any given dataset Sarah: so, that means if i kept it an article level, i would have to have "GS' y/n and "resolvable gs" y/n and "attributed gs" y/n for each datatype you mean any article? Heather: so if they did it 3 different ways, capture it (for the stats analysis) as the worst one. (for some definition of worse) 2:38 PM Sarah: but then, can we still look at if "is gs best at resolving"? Heather: yeah, maybe. I at least mean it for dataset\  right, so I think we don't wan to do it for article. but I think we do want to do it for dataset. Sarah: b/c the low score, say for earth 2:39 PM Heather: you know what? rather than minimum, it shold be maximum. Sarah: ok, i thought you were saying how to do it Heather: what is the best way they attributed this thing? Sarah: with article  so, you're talking about an article with two datatypes that are the same but cited differently  right? Heather: to start with I'm actually talking about the even easier case Sarah: sorry, Heather: that I think you mentioned 2:40 PM Sarah: datasets that are the same datatype but cited differently Heather: an article with a dataset that it attributes two different ways Sarah: well, so let's say to gs two to make it more clear...b/c each gs is a single dataset Heather: ok, now one step up maybe it has two gs datasets and it attributes one fully and another one poorly 2:41 PM a question is what to do in that case? right? Sarah: yep Heather: right. Sarah: but, add another problem one is self reuse and the other is someone else's dataset so the self reuse is almost guarunteed to be sloppier (i.e. no accession) Heather: yeah. hrm 2:42 PM Sarah: that's the main reason i separated them Heather: well what do you think? we can Sarah: and then i have 5-10 that are the situation you proposed Heather: we can't really have these be two different datapoints only when they do things differently Sarah: so i think if we stipulate self and datatype separations, those are nothing to sweat over agreed, those 5-10 that are separated for differences are what worry me but its also only 5-10 2:43 PM datapoints should i give you a specific example of one? Heather: yeah. so in that case I think I'd do something drastic and nonideal no it is ok I htink I udnerstand I think in those cases I'd make sure they are each only one row in the analysis 2:44 PM Sarah: and then give a max or min score? Heather: and pick either the best or the worst or the first or a random or something sample yeah. some standard approach for what you chose. call it out in the methods. Sarah: since it's only a few, do we consider just getting rid of them altogether? Heather: where "sample" I mean score... sample of the practices they used no, don't do that 2:45 PM that would bias things. we want them in, we just don't know exactly how :) Sarah: ok, so let me make sure i'm clear Heather: I think code their worst practice in each dataset type. what do you thinik? Sarah: self and datatype differences are ok to split  into two 2:46 PM but differences in citation practice only  should be kept together  and given a lump score Heather: actualy I think you need to keep self together too Sarah: of some sort  how come? 2:47 PM Heather: I think a row needs to be "one row for every datatype in each of these papers"  you can call out the self behaviour in tables and discussion, 2:48 PM Sarah: but, what about a self reuse + an external reuse where the author says "i used my data from a previous study (Author year) and some other sequences (with no reference to author, depository, etc) Heather: but in the stats it would put extra weight on the practices of authors who use their own work, for example.... Sarah: or vice versa where they give the acession for others but not themselves 2:49 PM i guess partly i was also wondering if we would/should eliminate self reuses like you've been doing Heather: I think you should treat that example the same way as "I used my buddy's data X and more data from all the people over there" in the stats 2:50 PM yeah, ok good question. that would solve this problem, wouldn['t it. at least removing them from the stats analysis another thought is that then they could be run separately, in a different analysis, one just on "self citations" 2:51 PM right now I'd go with that as a straw man. removing them from this analysis, defining this analysis as reuses of other people's data. 2:52 PM probably really reduces the N, eh? Sarah: yeah one idea is coding self reuses as a factor, but coding them 0, .5, and 1 so no self, some self, and all self 2:53 PM Heather: yeah, hrm. you know what? I think we mostly care about how people attribute other people's data. ok, well, maybe we leave this quesiton there too. further, but not fully resolved. try a few things, esp leaving them out, and we can talk about it again. I've got to run in 30mins and make sure we cover everything at least a little, if that is ok? Sarah: ok. one comment though do, i think resolvability is lower with self, but attribution is higher 2:54 PM so i wanted to show that, but maybe that's more of a discussion tidbit yep, Heather: ok. so I think those are good points to know and make. so yes, let's show them somehow, I agree Sarah: we can move on sorry for taking up so much time1 Heather: nah it is fine! 2:55 PM so one thing I wanted to perhapse tease out of resolvability is the idea of where the attribution is made maybe this is a third type of score? Sarah: most of it's in the methods Heather: 0 = no citation 1 = a citation somewhere 2 = findable through full text search of paper (so includes biblio but not suppl tables) 3 = findable in biblio only (so could be found in scopus, isi web of science) (what about captions only?) Sarah: i don't think it's all that interesting Heather: yeah, so I think it could be Sarah: i didn't consider biblio only Heather: the reason is that if it is in the methods, for people to find it they have to be able to search full text 2:56 PM hard in this day and age Sarah: i did consider captions only...but do those come up in full text or not....i've been meaning to ask that for awhile Heather: biblio Sooooooo much easier, and similar to aricticle citations and suppl info is the hardest of all I think captions do come up in full text searches, but suppl info doesn't 2:57 PM did you capture info if citation was in the biblio? (by biblio only I didn't mean that info was only in the biblio... rather that Sarah: yeah, that's why i was considering a supplemental caption to be "needs research" for resolvability Heather: by looking in just the biblio you could find a relevant citation.. as with articles) Sarah: i did consider y/n biblio 2:58 PM so - reference somewhere, findable in full text, findable in biblio also Heather: yes that would be very useful Sarah: for? Heather: for makign points about how hard these things are to find given our current bibliometrics tools Sarah: ok Heather: people can use scopus or ISI to trace citation histories for articles now 2:59 PM Sarah: i don't have any data attributions in the biblio save a few urls Heather: and it would be sweet if they could use those or similar tools to trace dataset citation histories but we're really far away from that right now yeah. exactly. I think that data would be useful for Dryad, for example. Sarah: i didn't see a single doi or accession in the biblio so i was thinking i would just state that, not analyze it Heather: they are proposing that dois shoudl go into biblios 3:00 PM yeah. so maybe drop level 3 if it really never happens :) but makign the distinction between full text and suppl info is also helpful  because full text at least you have some hope finding it with googlel scholar.  supple info, give up. 3:01 PM Sarah: i thought my current resolvability ratings did that  i.e.  not resolvable = no author, accession, or depository anywhere Heather: right, so I think maybe I was imaginign gthat resolvability wouldn't be WHERE was the attribution, but rather what was in the attribution Sarah: resolavable with extra research = author, accession, or depository only OR just mentioned in supplementary 3:02 PM Heather: sorry, I dove into the last bit first Sarah: and fully resolvable = Heather: so if you had the paper, had the suppl info, could you pinpoint the dataset Sarah: author, accession and depository found in the full text (including tables) Heather: how resolvable is it if you have everything. Sarah: no, Heather: the new scale is more along the lines of how "discoverable is the reuse citation" or something Sarah: i'm assuming supplementary doesn't always include everything 3:03 PM partly b/c i didn't always look at it unless it wasn't clear in the full text so, it's resolvable with work becuase i would have to decide it it's worth it to track down the supplemental information and risk that it might be an output table, not the raw sequences or whatever 3:04 PM i don't understand how discoverable is that different Heather: hrm, so I'm proposing that we remove the "where" from the resolvability scale Sarah: or how attribution and resolvablity haven't covered it what's the raionale for keeping it separate? one thing i can think of Heather: here's how I think about them: 3:05 PM Sarah: is that some appendices have a separte lit cited that i';m not sure isi tracks but, this was only a few incidents so more anecdotal in my mind Heather: nope, ISI doens't track it there 3:06 PM so I'm imagining these scales are driven by our intended uses and goals of citations one is that people get credit so the attribution scale is measuring to what degree authors get direct credit for their shared data 3:07 PM another goal is that people can find the dataset that you said you used the ease with which they can do that would be tracked by the resolvability scale a third goal is that people can monitor how datasets are reused over time, across a field for that they need to be able to find instances of dataset reuse citations 3:08 PM and for that they need to be in places that are easily mined so the new scale, "discoverability" or something, would measure how possible that is does that help? make sense? Sarah: but, the problem is that i don't think i have an icidents of that, so discoverability is essentially the same as attribution 3:09 PM Heather: so pretty much everything was in the methods? like 95%+ would you say? Sarah: yes Heather: and not ever in the biblio? Sarah: but not necesarily in the biblio well, the author was in the biblio but not the accession so that's why it's the same as attributino 3:10 PM and the author wasn't always in the biblio i.e. they said "we went to genbank and got a bunch of data" Heather: so sometimes did they say the author name in the text but didn't cite the author's paper? Sarah: no unless correspondence 3:11 PM they would say "data obtained from so and so" and then just mention them in acknolwegment sometimes but this is liek 5 cases Heather: yes. so that would be attributed but not discoverable yeah, I hear you. and how many times was the attribution just in suppl materials? Sarah: 5-10 not a lot 3:12 PM but enough to make me upset i mean, b/c i'm looking for attribution and wanting to see it Heather: yeah. hrm. well, you are right then, may not be enough to do trends on. though too bad. future work :) I would though take thoughts of "where" out of resolvability I think 3:13 PM to me that talks about how hard it is to pinpoint the dataset, given that you have the reusing paper and its suppl info Sarah: so, if i can find it in supplemental, it's still ok?  or a "2"  if it has all the info Heather: I think so. I know you didn't mine suppl data specifically 3:14 PM Sarah: ok, do you have other things we should discuss, or should i ask an interpretation question regarding stats? Heather: though if I understand you correctly, that distinction where you can find it only in suppl doesn't happen very often Sarah: probably another 5-10 3:15 PM Heather: ok, and how is that different than what we were talking about before?  in those 5-10 the accession is in the suppl but the author name is in the body? 3:16 PM Sarah: usually, or they just have a blanket statement "we got a bunch of data, see appendix" 3:17 PM Heather: yeah. so that would still count as resolvable. though for what it is worth, it would rank really low on my nice-to-have discoverability scale. ok, yes, shoot on your questions Sarah: ok. so i posed this breifly earlier mostly the discrepancies between the factors run separetly vs. all together 3:18 PM Heather: yeah. so I think we need to read up more on what factors run all together is actually doing. Sarah: mostly, years weren't significantly different before and now they are and datatype and journal were signifcant for molecular ecology and genes but now they aren't and only genbank is significant as a depository 3:19 PM Heather: I think the Hmisc/Design libraries do some smart things with factors behind the scenes, but I don't know what Sarah: also, in general, the factors still have one state used as the base and not displayed Heather: so I'm not surprised that it changes the resutls yes. that makes sense, right? Sarah: i think so they are still compared to each other not to everything 3:20 PM back to multi vs. single factor we're striving for multi factor in the long run right? just want to verify Heather: when you mean multifactor, what do you mean? Sarah: i mean all the factors run together Heather: multivariate? lots of different covariates all at once? where some of those covariates are binary, some are continuous, some are factors? Sarah: and do the separate models (factor by factor) have any meaning? 3:21 PM all are factors but that might change Heather: right Sarah: if depository becomes binary Heather: if you look at the one factor at a time, so just have y ~ onefactor rather than y~ onefactor + twofactor you are doing univariate analysis 3:22 PM and it is a good idea to do that people often do it first for y~ onefactor then y~twofactor and look at each then they put them together y~ onefactor + twofactor Sarah: which is what i did by deafult Heather: and often things that were significant in univariate analysis all of a sudden aren't significant any more 3:23 PM Sarah: yeo p i expected that Heather: right ok so ask your question again :) Sarah: so, do the univariate results still have meaning? Heather: yes. Sarah: i'm just not used to dealing with so many covariates Heather: though they need to be interpreted in the proper context 3:24 PM so, for example, let's think about Nic's data Sarah: my past experience has been with data that is best to look at in one way or the other but not both sorry to interupt, proceed about nic;s Heather: lots of studies including his have found that high impact journals tend to have strong data sharing policies 3:25 PM let's say he has three variables, policy strength, impact factor, subdiscliple  it is also well known that impact factor and subdisciple are correlated  so in univariate analysis, policy strength ~ IF is significant  policy strength ~ subdisciplien is sign 3:26 PM but maybe in multivariate analysis policy ~ IF + subd only IF is sig and suddenly subd isn't any more because really all the signal that was in subd is captured in IF Sarah: ok so, that's what were seeing here 3:27 PM univariate = Heather: yes, probably Sarah: depository: sig, journal: sig, datatype: sig but in mulivariate just depository = sig so in my overall discussion of the results, 3:28 PM i could say "depository influences resolvability significantly..." and then go on to say "ME is sig when considering journals along and GS is sig when considering datatypes alone" that was super rough and probably doesn't make that much sense 3:29 PM Heather: hrm, yes, I'm not 100% sure Sarah: trying to give the multivariate as the overarching, and the univariate as specific nuances Heather: yes. or, the univariate to inform what variables you put in your mulitivariate in some cases 3:30 PM if there is no signal in univariate, usuallt won't be a signal in multivariate Sarah: i'm planning on using the remainder of my day to do some writeup, so maybe that will help to see if i've got my ducks in a row so to speak with interpretation well, that's what's weird about year Heather: whereas signal in univariate often goes away in multivariate because of confounding/collinearity/etc Sarah: no signal in univariate but signal in multivariate Heather: yeah. cool. yeah, that is weird. it can happen though Sarah: but, i can see that from a model building perspective Heather: one tool for situations like that is to stratify 3:31 PM and do some tables or graphs just with year one way then year the other way Sarah: that if i'm looking at all my data and want the best predictive linear model, some factors are better when considering all the factors Heather: and see if things go in the same direction Sarah: what do you mean? or you can point me to an example later if you're short on time Heather: yes, will send you an example 3:32 PM Sarah: ok, anything else we should hash out now? i'm planning on sending you either a draft and/or revised scoring tonight and then whichever of the two i don't get to today on monday Heather: great. send to todd too just to keep him in the loop Sarah: i'll send the draft to everyone Heather: ok, good, will continue on monday great 3:33 PM Sarah: haven't gotten any feedback from todd...is that expected? Heather: yeah Sarah: should i be more direct? Heather: I think he' Sarah: and, what do you think of his comment on pooling data Heather: s reading and will jump in if/when he feels necessary but otherwise will lurk Sarah: he said to keep the 2000/2010 for all journals and the 2005-2009 for sys/amnat separate there not all that different when i run them pooled 3:34 PM but, i'm not recalling specifics Heather: really? so I interpreted his comments the other way. I think he said: yes, go ahead and concatenate (note that pool actually means something else) Sarah: "But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret. I'm not sure you would not want to use them to ask the same questions." Heather: and add a variable to each row that said what the data colleciton method was Sarah: yeah, that was at the beginning to tease out an effect 3:35 PM Heather: I have to run, but I disagree with his difficult to interpret comment to me he said that there is no methodological reason not to, no great big red flag to peer-reviewers 3:36 PM Sarah: ok, i'll ponder Heather: so I think you should do it Sarah: for now i'm just running 2000/2010 Heather: because I think it is easy to interpret Sarah: and checking for discrepancies how so? Heather: you just have to make "year" be a real year or number of years ago or something Sarah: or shoot me an email about it later b/c i don't feel resolved i'll ponder on that Heather: ok! bye for now will talk more later Sarah: and let you know how different the data is ok. thanks again Heather: asyncronoously, and with a weekend break ;)

Chat Transcript July 27 (Morning)
Heather: Hi Sarah Sarah: hi Heather: Now a good time to chat? Sarah: yep Heather: How How's it going? Sarah: good thanks and yourself? Heather: Anything in particular you'd like to brainstorm about? Sarah: the analysis is going good too Heather: Good thanks. Lots of stats with Nic.... 9:56 AM Sarah: just want your opinion on the various categories to make sure i'm justifying them right in my mind Heather: ok got your spreadsheet looking at it Sarah: then i'll have a stats "party" today and will probably have a few more quesions when trying to interpret Heather: which in particular do you want to think through? great Sarah: um...many factors vs. few 9:57 AM i.e. where we spend degrees of freedom and where we do't Heather: (I love the sound of a stats party, I gotta say) right Sarah: also, i need help identifying which categorizations answer what speicfic question Heather: so what is your idea about combining datasets at this point, so that we know how many datapoints we are working with? ok Sarah: things that i can talk myself through but make more sense while brainstorming with someone else Heather: yes, I hear you. 9:58 AM Someone I worked with woudl ask other people to "be his cat" = the real value isn't actually in what they say, just that they are listening and forcing him to say it aloud Sarah: yeah, usually my husband is, but we mostly talk about his work in the evenings these days anyways, Heather: exactly. ok Sarah: so, i think dividing the "data types" into the "genres" makes more sense beause 9:59 AM then we're looking at distinct information from depository that gets rid of a directly confounding variable and allows us to ask the data things from a different angle rather than being repeititive Heather: ok, so let me make sure I understand you are thinking of having a gentre covariate 10:00 AM and also a depository-defined covariate? Sarah: well, depository would be the other variable no matter....that's the one in the 2nd set of columns Heather: ok, gotcha Sarah: but, i'm saying define data type as "genre" and define depository as depository 10:01 AM Heather: ok, sounds reasonable. and you have about 5 genre classifications, is that right? Sarah: yes Heather: ok Sarah: more driven by the inherent data type i.e. analysis unit 10:02 AM what we call "scales" in biology Heather: ok Sarah: meaning, levels of biological variation the other alternative is the data-defined discipline Heather: and do you think these genres might also have some commonalities in terms of culture, research culture? 10:03 AM Sarah: hmm? referring to discipline? Heather: while admittedly being still very diverse within each genre Sarah: or referring to "genre"? 10:04 AM oh, does genre have commonalities to what? Heather: I was refering to genre, asking if that might also represent a classification of research culture. so, do you think that people who do organismic studies 10:05 AM might have similar thoughts/experiences about data sharing, compared to people who do gene-type studies obviously there is a lot of diversity within, I'm just trying to dig out what our interpretation of genre would be Sarah: yes 10:06 AM and they would be using and transmitting/sharing them for relatively similar purposes Heather: ok, good Sarah: phylo is the only weird one with that it almost should be lumped with gene Heather: ok 10:07 AM Sarah: that's why i thought maybe "data-defined discipline" might be good they could be lumped b/c most phylo trees are the result of genetic analysis but, on the flip side, genes could be used for many other analyses and phylo cannot Heather: right. 10:08 AM hmm. well there aren't many PT, right? Sarah: not reuse, but shared that's the other issue Heather: so there is a downside to having them all by their lonesome oh, true Sarah: if we're trying to keep the categories the same for reuse and shared, i would think phylo separate 10:09 AM Heather: ok. Sarah: partly b/c people are good about posting genes but not phylo Heather: so Sarah I think that mostly I don't have much good insight into these groupings, and it probably easy to overthink them. Sarah: partly b/c they are a different "genre" of data that is reused in a different way Heather: I'd base it on what you want to be able to say. and then just pick something and run with it. 10:10 AM because I don't know there is a "best" answer, and/or I don't know how we'll know. Sarah: ok. Heather: one cut-to-the-chase approach is to drive to degrees of freedom and datapoints how many degrees of freedom does your straw-man analysis have? how many datapoints are you looking at? 10:11 AM Sarah: both perspectives interest me (genre and data-defined discipline) in terms of the question, but they are about equal in pros and cons...genre is slightly more relevant um....270 data points 5-8 factors Heather: that is for reuse? Sarah: yes just reuse Heather: for combining the snapshot and timeseries, or not? Sarah: and 2-5 character states for each factor combined Heather: ok. 10:12 AM Sarah: so 25 degrees of freedom low end, 40 high end Heather: so a rule of thumb in regression is to have 30 datapoints per degrees of freedom obviously rules of thumb are just that and it totally depends on the size of the effect you are studying, etc Sarah: ok, that's a good reference point though Heather: but it does help to ground it exactly at least 10 datapoints per df, at least. 10:13 AM Sarah: b/c i'm also still deciding if we should look at funder, journal discipline, etc i haven't felt as driven to investigate those Heather: so I think that means you need to stay away from your high end Sarah: and they might need to get the cut for more important questions Heather: I'd say you don't have enough journals to get at journal discipline the funder effect is probably too weak Sarah: agreed Heather: so I'd let those go for now, in terms of stats 10:14 AM useful to have extracted! but not for main analysis Sarah: that's what I was feeling, we just had talked so much about it that i felt somewhat obligated to at least run it and see but, at the same time didn't think my data would support anything conclusive Heather: yeah. well, you can, but not in main (or first) run. nice-to-haves exploratory analysis yeah so I think given that one always thinks of more things to spend df on later (interactions!?!? arg) 10:15 AM I'd really drive to sticking with rough-grained levels Sarah: so 3 factor states over 5-8 because then we could delve in deeper later Heather: right Sarah: or discuss the details in percents/tables/figures Heather: right 10:16 AM Sarah: so, if i went with "data defined discipline" (i.e. classifying data types by our overarching disciplines), and Evolution was significant, then I could dive in with a univariate analysis or table of the Genre classifications 10:17 AM to say what's making that difference Heather: yes, true, as exploratory analysis. post-hoc analysis, etc. Sarah: ok Heather: yes, I think that is your best bet. Otherwise it is tempting to be overly ambitious, and 270 datapoints (while a lot in some ways!) isn't a lot in other ways. 10:18 AM Sarah: yep agreed...and more fun in some ways for analysis Heather: good. Sarah: I like the big picture + fill in details where it matters approach Heather: about how many sharing datapoints? more I would guess? great Sarah: yes for sure i haven't parsed it all out yet Heather: ok, good. Sarah: that's on the ticket for today 10:19 AM Heather: do you feel ok with a plan for how to deal with self citations in reuse? Sarah: since I've figured it out for reuse, it should be cinch yes i haven't run the stats to test for artifacts yet, but that's on the to-do today as well Heather: good 10:20 AM Nic has been experimenting with posting R code and results on his OWW pages I think you've seen it? Sarah: oh yeah, sorry I've been bad about that Heather: no, that's ok, alas I understand just wanted to point you there because he's done lots of the learning Sarah: at the end of the day, i always intend to ok 10:21 AM Heather: so might as well learn from him about how to link/embed gists, post figures, etc. Sarah: ok Heather: yeah, it definitely is a habit that is hard to start Sarah: it seems like his code also codes the data into binary/factors, etc, whereas i've been doing most of that in excel where i'm more comfortable manipulating the data but I'll try and post some prelim results and figures today 10:22 AM Heather: even if your (my) heart is in the right place... just takes some doing when it isn't in your natural workflow. yes, I think either way works there. I pointed him at how to do it with R because wanted to give some easy concrete examples about how R works. Also can help decrease typbos and bugs in some ways 10:23 AM but advantages to teasing it out in Excel too. Makes your dataset a bit more transparent and stand-alone. ok, can I be a cat in other ways? Sarah: yeah...on the depository 10:24 AM again, the question is how much to collapse i think for our purposes, just depository YN answers are major question of if depositories are being used in the ways we anticipate so, "are depositories the better way for sharing and reusing data?" 10:25 AM Heather: yes. though definitely an advantage to keep genbank data separatable, because I think we'd also like insight into whether genbank is a special case 10:26 AM Sarah: but could that be another secondary univariate analysis or table/figure? especially with reuse, there are only like 4 treebase reuses and the rest of the depository ones are genbank the story is a little different in sharing 10:27 AM Heather: yeah. in sharing I'd err on calling it out explicitly in the main analysis though I could probably be talked out of it so that would be my inclination. do with it what you will. 10:28 AM Sarah: ok, i'll ponder more, but it's good to know your opinion regardless, i should keep "not indicated" as a sepearate category? right? b/c not indicated is a lot different than retrieval or sharing from a url Heather: hmmmm I think yes. because it isn't yes, and it isn't no. 10:29 AM Sarah: yeah...and typically it's associated with other "sloppy" practices Heather: yeah. and it is interesting in and of itself. Sarah: ok. i think that's I'll i have for right now Heather: ok! Sarah: this has been helpful for me to rationalize things "out loud" 10:30 AM Heather: in the call today I think I'll just ask each of you to summarize where you are and what you see as your next steps and if there is anything you need from other interns/mentors good! I'm glad anything else worth covering? Sarah: ok...i'll try and do another run through my analysis so i can voice questions/results better. is today a call or chat? 10:31 AM Heather: I'd ideally like to do some "end of internship" conversations, but I'm afraid I don't have much insight into how it works from here on out. we're going to try a chat with everyone. wish us luck Sarah: ok. i'm totally open to post-internship chats Heather: ok cool Sarah: especially if we're essentially passing the datasets off to you Heather: ok, talk to you later Sarah! Sarah: ok. thanks again. 10:32 AM Heather: yeah, right! and I don't know that you are? I don't know. ???. Sarah: well, regardless, i'm not planning on just dropping this the official day the internship ends Heather: cool, I'm glad. ok, talk to you later. 10:33 AM Sarah: ok. talk to you at noon.

Chat Transcript July 27 (Afternoon)
1:15 PM Sarah: i'm good whenever you are Heather: Me too. Ok, so topic to cover are the univariate/multivariate approach anything else? Sarah: yep. 1:16 PM so, i ran univariate on most of the factors and states somewhat by accident when i was trying to figure out oridinal regression Heather: ok :) Sarah: then we talked about how things that were significant there didn't turn up sig in the multivariate 1:17 PM Heather: right, true Sarah: and how the multivar was better for distinguishing things like journal vs. depository vs. data type which overlap in someways (i.e. Molecular Ecology = genbank = gene sequence) Heather: right 1:18 PM so I think there may be a middle ground that is worth considering  so our hypotheses (and prelim observations) about your data suggest that there are a few vars that are indeed important  year  if it is genbank  are those the main ones? 1:19 PM if it is sysbio? Sarah: yeah, so  year  journal  depository  dataetype  and possibly add open access to the mix Heather: well, actually before you run with that  and flush it out to the whole columns with all of their possible variants 1:20 PM a different approach is to focus primarily on the univariate analysis but make it very selectively mutlivariate by including year and is.genbank and is.sysbio or something Sarah: hmm? that's the r coding for it all, right Heather: ok. so the idea woudl be 1:21 PM for each of your factors Sarah: so, asking about a specific question Heather: do a univariate analysis with all of the gory level details of that factor and your dependant variable which I understand you did Sarah: i.e. sysbio vs. genbank vs. year Heather: I think so. I mean have one variable for the factor 1:22 PM plus another binary variable for is it genbank? another for is it sysbio? and a third for year Sarah: but why is that better than year vs. journal vs. depository? Heather: so attempting to keep it "univariate" but bowing to the fact that you have some dominant trends (at least we think you do) I'm not sure what you mean by vs in that sentence? Sarah: ok + = vas +vs sorry 1:23 PM Heather: ok. yes plus right Sarah: major typing problems Heather: so journal uses up extra degrees of freedom, relative to just "is it sysbio" 1:24 PM So I'm not necesarily reocmmending this, but an approach would be look at each factor in turn with all of its gory level detail and just enough multivariate stuff to admit that there were major changes by year that would probably overwhelm more delicate trends (and maybe genbank/sysbio/a few other things) 1:25 PM and then see how all of those gory levels show us in univariate analysis Sarah: ok. i still see that as a supplementary analysis to the bigger multivar Heather: and just pull the interesting things into the multivariate. Sarah: i.e. picking out trends i've noticed or that the multivar allude to and then going into more detail Heather: yeah, so I think the univariate->mulivariate treats the multivariate as the exploratory or supplementary analysis Sarah: well, i guess that's what i mean 1:26 PM multivar to explore and establish big picture trends Heather: yeah. so for what it is worth, I think the univariate-> multivariate path is more common Sarah: then univar (more detailed factor states and specific trends like is.genbank + is.sysbio) oh, i thought you told me the reverse in our previous discussion Heather: I did. Sarah: both ways make sense 1:27 PM Heather: well jsut a sec let me make sure I don;'t make it more confusing what is most normal is to do univariate first and if any thing just isn't interesting in univariate, then don't look at it any more if things are interesting in univariate, then pull them together in a multivariate 1:28 PM Sarah: i see what you're saying Heather: that is the normal approach Sarah: from what i remember, though, datatype and journal and depository were interesting in univar Heather: appologies that I didn't make that clear in the first place Sarah: that's why i continued to include them but, year wasn't but it ended up being sig in multivar so should I not include year then? Heather: I sort of ran with the multivariate, and it no doubt would have been useful for me to step back and ask more questions about univariate first. Sarah: in multivar i mearn 1:29 PM Heather: well, one question I have Sarah: b/c we think it's probably a meaningful factor Heather: is about whether what we know from what you've done before was done with the combined datasets or just snapshot or just timeseries Sarah: just snapshot i think 1:30 PM Heather: I think before we make decisions on what we've seen before it will make sense to post the code and results and stuff (in part to help me keep it straight) Sarah: yeah sure Heather: and in part to help make it clear what data we used to run what Sarah: apologies again Heather: no no apologies Sarah: i'll structure it with annotation too 1:31 PM Heather: if you only knew how many analysses I've run, meaning to post the code and not getting to it..... Sarah: which has been my main hold up i mean, it's messy right now but close to being intelligible Heather: I think everything you've done has been really useful and will be useful again Sarah: yeah, but i'd like a clean version that actually demonstrates the anlaysis not just me playing around Heather: but probably a good chance to go through systematically 1:32 PM Sarah: exactly Heather: right, yeah exactly I think we should do that before making decisions on what we've seen before so part of the systematically is probably to look at everythign in a univariate way before a multivariate way 1:33 PM Sarah: yeah...this is what i'm picturing Heather: and for that you could either do strictly univariate Sarah: 1. univariate with the more detailed factor states Heather: or decide that you'll slip in a few dominating variables Sarah: 2. multivariate with the relevant factors and their more general/broad factor states 1:34 PM 3. interaction between reuse and sharing Heather: yup sounds good Sarah: 4. univariate of dominant suspected trends (i,e. sysbio + genbank + 2010) or whatever Heather: ok one comment on 2 Sarah: i.e. the ones we think are driving the trends Heather: 2. Sarah: ok Heather: I think what Todd was suggesting is that you use the results of 1. 1:35 PM to inform whether you use general/broad trends or specific trends for each factor individually Sarah: ok Heather: so if there is a factor that seems to correlate strongly with the output in univariate analysis then throw all the levels into the multivariate analysis! Sarah: so maybe 1 = exploratory 1a = multivar with broad factors Heather: that suggests it is a good way to spend degreees of freedome 1:36 PM Sarah: 1b = univariate of each individuall with specific factors then 2 = Heather: for other factors that look more boring, just include them broadly (or not at all) Sarah: multivar with relevant factors by boring, do you mean things like open access, funder, etc that we weren't planning to explore initially Heather: by boring in that case I meant 1:37 PM its levels didn't correlate with the outcome very strongly at all Sarah: hmm? its = ? 1:38 PM Heather: ok, so for each factor you have (including OA and funder too, if you want) do a detailed univariate analysis of detailed levels 1:39 PM if the output variable (reuse patterns etc) correlates with the detailed levels in an interesting way then plan to a) include that factor in multivariate anaalysis and b) keep a fine resolution of its levels if the factor didn't correlate with the output variable at all, perhaps don't plan to include it in the multivariate anlysis at all 1:40 PM Sarah: but, if it doesn't correlate at the specific leve, should i also run it at the general level? or is that just fishing/mining? Heather: if the factor did correlate with the output level, but only as a general trend or only with very broad resolution, then maybe a) include it in the multivariate analysis, but b) collapse its levels to something more broad than you looked at in the univariate analysis 1:41 PM if it doesn't correlate at the specific lvel at all, or even much, in univariate analysis then looking at it in a general way won't make it correlate either now if there is a trend in the specific levels, but it isn't significant 1:42 PM then combining multiple subtrends may make it become significant... but yeah I wouldn't do that on purpose, for that goal 1:43 PM except maybe as a sidenote, saying "looked at separately x and y were weakly related to x. 1:44 PM in post-hoc analysis, combining them into one category did demonstrate that the genearal concept had a statistically significant association" or something like that am I just confusing you more? Sarah: no, i think i'm following i spent the morning recoding my data, so i'll dive back into analysis with all this in mind 1:45 PM Heather: it is probably easiest to talk this through in concrete ok Sarah: should i also consider keeping factors and factor states (specific vs. broad) the same for reuse and sharing? Heather: sorry again, I feel like I led you down a multivariate-first path that was not the best choice. or not obviously the best choice, anyway Sarah: i.e. if something is sig for reuse but not sharing Heather: and in stats it is always worth doing the obvious first unless there is a reason not to :) 1:46 PM Sarah: should it still be included for both Heather: yeah, good question. So at Sarah: to keep things analogous Heather: this point I'd say let the data tell you. keep the fine grain resolution the same for both (as it is)  and include that resolution in the univariate analysis  and then see what it shows 1:47 PM Sarah: ok.....but in terms of keeping things analogous, does that supercede factor/resolution decisions?  or not? 1:48 PM Heather: good question. I think it depends. I think that one might be better to answer in a concrete situation than in theory. 1:49 PM Sarah: ok. well, i'll run things today and post the code later.  i like to keep the outputs in the code as annotations  does that bother you? Heather: ok. Sarah: in terms of readability? Heather: what do you mean as annotations? oh I see Sarah: with an "#" in front Heather: as comments? Sarah: yes Heather: no, doesn't bother me. 1:50 PM if you put the code as a gist I think it will even colourcode them for you differently might make it easier to read Sarah: ok, i prefer that so i don't have to rerun things but it makes reading through the R file a little different Heather: when embeded on OWW Sarah: yeah, i don't know how to make things pretty in R as well as i should yep do you/todd/etc want an email besides the RSS? Heather: that's ok. I'm not a R expert either. 1:51 PM No, the RSS is fine Sarah: ok. well, i think i'm good to go thanks again for the long conversations and help Heather: No problem! I wish I always knew the best answers :) Ok, talk to you later! 1:52 PM Sarah: thanks. bye!

Chat Transcript July 29 Morning
12:50 PM Heather: Hi Sarah 12:51 PM had a quick look at your code and I have one suggestion at this point.... Have you thought about making your year variable be a number rather than a factor? 12:55 PM Sarah: yep 12:56 PM that occured to me as i ran the code i coded it that way at some point, just need to dig it out of the excel Heather: yup I think it can be easy to get out of R too 12:57 PM Sarah: i'm not as versatile in R do you know a simple code of the top of your head to convert a factor to a number 1:00 PM Heather: maybe this: 2010 - as.numeric(substr(YearCode, 2, 5)) 1:02 PM lrm(ResolvableScoreRevised_Max ~ Journal+rcs(YearNum, 3)) 1:03 PM and it calculates a coeffieient for each of 3 subintervals, allowing nonlinearity the nice thing is in the anova they all collapse down again

1:03 PM Heather: anyway, not sure you want to go there, just wanted to let you know it exists

12:58 PM Heather: I think as.numeric 12:59 PM but I'm not sure that's actually what you want in this case since you don't have examples of all the years Sarah: yeah Heather: the factor numbers don't quite line up with the years Sarah: but i could change the numbers on some 1:00 PM i know R had problem with the big 2000s for some reason when i first ran it so i was just using the last number (0-10) 1:01 PM Heather: btw in case you are worried that the year is actually nonlinear (and modeling it as an integer like this assumes it is linear) one of the nice functions in the Design library is rcs 1:02 PM you use it like this: 1:05 PM btw I like your qlogis function stuff, I hadn't seen that before. Makes for useful summary plots. Sarah: it was all in that tutorial you sent Heather: cool. 1:06 PM I think I'll hold off on digging further into your code now, since it would be kind of an undirected dig Sarah: ok. and I'm currently compiling a concatenated excel table on the univariates that makes it easier to read anyways Heather: I'll wait till you give it a bit of structure, probably by pulling it into a (ROUGH) pub framework ok Sarah: i'll post that with my updated code later today 1:07 PM yeah...i'm working towards the writeup Heather: another idea on that is just to save all of the univariate plots as pngs and embed them in an OWW page Sarah: still floating a bit though with the unresolved anlaysis Heather: maybe on your diary page for today or something Sarah: i've got them saved as jpg, so definitely could Heather: that way they are all there in a row? Sarah: and i can track nics code for how he did it yes to the previous question Heather: yeah, I'd play with embedding your jpgs on an OWW page. that would help me anyway :) Sarah: so a table for pvalues, one for coefficient, etc 1:08 PM then can see everything side by side Heather: yeah, something like that. I also really like confidence intervals Sarah: around the coefficients? Heather: (and have some custom code for adding them to a summary plots, but it doesn't seem to easily work on the fancy qlogit function version) 1:09 PM Sarah: suggested code? Heather: yeah, around the coefficients so the easy stuff that I've been doing with Nic has been using glm, not in the Design library 1:10 PM Sarah: i need to do some glm for my binomial stuff, so i can look in nic's code for that Heather: confint(ologit4) Sarah: ok great Heather: or for Nic's stuff exp(confint(ologit4))  but confint doesn't seem to do the same magic with Design as glm 1:11 PM though there must be a way Sarah: and if you could send me your custom code for plotting it, I could work on adapting it to Design Heather: I don't have it at my fingertips Sarah: ok. that's a starting point though Heather: yeah, you know what? Let's call that out of scope because it is really ugly code to work with and I don't want you falling down that rathole Design is great, but its R-code is NOT well commented! (or variable names well chosen, or....) Sarah: whatever, but i can make a note of it in my code at least 1:12 PM Heather: that said, yup, hold on here it is  http://gist.github.com/485585 1:13 PM I basically just typed plot.summary.formula.response into R, copied it, and added stuff to add the CIs then call it like this s = summary(y ~ x) when would normally say plot(s) now I say plot.summary.formula.response.CIs(s) 1:14 PM Sarah: ok. i'll at least give it a try and keep the note with all the info in the code 1:15 PM Heather: yeah. though like I said, don;t spend time on it. I'll probably dig into it sometime in the next month or so to get it to work, because the ability to superimpose plots of y>=1 and y>=2 is sweet, and I think I'll want it in the future :) 1:16 PM Sarah: ok. sounds good. Heather: great. if you run into problems with the gist or embedding, ask me or nic. 1:17 PM talk more later today or tomorrow, then Sarah: sounds good. thanks Heather: yup, no prob. bye!