# DataONE:Notebook/Summer 2010/2010/07/22

## Email:Scoring and Stats questions_Dataone

22 messages Sarah Walker Judson <walker.sarah.3@gmail.com> Wed, Jul 21, 2010 at 5:15 PM To: Heather Piwowar <hpiwowar@gmail.com> Heather -

I'm running into some hurdles with my data analysis. First, I want to confirm my scoring categories with you and second, as I got deeper into the statistics, I realized that most of my experience has been with continuous, not categorical data...so, I need your opinion on some things. And just to warn you , this is a rather lenghty email. I can send my Excel and R files if it's easier, but those are messy at best. On second though, I'll post them to OWW (Excel here, R here)

So, first, the scoring categories: For each, I have a binomial (YN) and ordinal (scored) version. I'm still not sure which is best for each aspect. I'd like your opinion on binomial vs. ordinal and the scoring levels I've proposed. I'm more inclined to scoring because it gives a more detailed picture of what is happening in the data, but I also worry that I've made too many categories. At this point, these are just for reuse, I'm planning on coding Sharing tomorrow in a similar manner.

1. Resolvability (Could the dataset be retrieved from the information provided?) ResolvableYN

  1=Y=Depository and Accession
0=N=lacking one or the other or both


ResolvableScore 0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature")

1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info)
2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods)
3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format even though genbank was enver mentioned anywhere in the paper)
4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data)
5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used)


1=Y=author and Accession (biblio assumed)
0=N=lacking one or the other or both



AttributionScore*I have two alternatives for this: one that doesn't worry about "self" citations and counts them the same as others (i.e. combines 6&7 and 4&5), and another that throws out all the "self" citations (and cuts my already small sample size by a lot!)

0=no author/biblio or accession
1=self citation, other reuse (Justification: Author refers to a previous review paper of theirs, but not the original data authors…this assumes that the original data authors are attributed in the previous paper)
2=organization or URL only (Justification: data collectors/project acknolwedged, but not specific individuals or relevant publications)
3= accession only (Justification: Data acknolwedged)
4=author/biblio, but self
5=author/biblio only, not self (Justification: original data author acknowledged and this is the currently accepted mode of attribution
6=author + accession, but self
7=author + accession, not self (Justfication: attribution to author and data…and this is the mode of attribution we hope for)


3. Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

And now, the stats. While in Knoxville, we talked about doing an ordinal regression. However, as I tried it out and read up on it, I think it's the wrong match for this data. I think it is primarily used for survey analysis where you are looking for an interaction of two variables both ranked on an ordinal scale (i.e. do people that "strongly agree" with a political issue also classify themselves as "strongly conservative"). Case in point, this R example. Maybe I'm reading the literature wrong, but I don't think this is the right fit...let me know what you think (and if you have a good resource!)

Instead, I think I should be using either chi-squared or a linear model for categorical data (i.e. binomial or Poisson distribution). I've run both and am a little stuck on interpretation, but can figure that out. I wanted to get your opinion on using chisq vs. a Poisson glm. The main pros for chi = easy and p-value output; the main con = low resolution (i.e. it says something is up but not what). The main pros for Poisson = higher resoluation (i.e. specifies what factors are most significant/influential) and has ability to look at multiple factors at once (then arriving at the "best" model of combined explanatory factors based on AIC); the main con for Poisson = I'm not super up to speed on the interpretation.

Here is a summary of some of my R analysis (I can send you the R history or TinnR file if you want) on the Resolvability aspect to give you an idea of the outputs.

1. Tables
1. > b=table(ResolvableYN,DatasetType);b
2. DatasetType
3. ResolvableYN Bio Ea Eco GA GIS GO GS PA PT XY
4. 0 22 35 10 5 10 5 33 19 8 5
5. 1 0 0 0 0 0 0 17 0 1 0
8. ResolvableYN EA Eco G O PT S
9. 0 35 10 43 41 8 15
10. 1 0 0 17 0 1 0
#> cc=table(ResolvableScore,DatasetType);cc

1. DatasetType
2. ResolvableScore Bio Ea Eco GA GIS GO GS PA PT XY
3. 0 6 15 3 0 4 0 3 4 1 3
4. 1 13 10 6 2 2 4 12 10 5 0
5. 2 1 8 0 2 2 1 6 3 0 0
6. 3 0 0 0 0 0 0 4 0 0 0
7. 4 2 2 1 1 2 0 8 2 2 2
8. 5 0 0 0 0 0 0 17 0 1 0
11. ResolvableScore EA Eco G O PT S
12. 0 15 3 3 10 1 7
13. 1 10 6 18 23 5 2
14. 2 8 0 9 4 0 2
15. 3 0 0 4 0 0 0
16. 4 2 1 9 4 2 4
17. 5 0 0 17 0 1 0
1. Chi-Squared
1. > chisq.test(table (ResolvableScore,DatasetType))
2. Pearson's Chi-squared test
3. data: table(ResolvableScore, DatasetType)
4. X-squared = 98.1825, df = 45, p-value = 7.922e-06
5. Warning message:
6. In chisq.test(table(ResolvableScore, DatasetType)) :
7. Chi-squared approximation may be incorrect
1. Linear Model-Poisson (alternative = binomial or zero inflated for "Resolvable YN")
1. > poisson = glm(ResolvableScore~DatasetType,data=a,family=poisson)
2. > summary(poisson)
3. Call:
4. glm(formula = ResolvableScore ~ DatasetType, family = poisson,
5. data = a)
6. Deviance Residuals:
7. Min 1Q Median 3Q Max
8. -2.47386 -1.37229 -0.04478 0.60369 2.29458
9. Coefficients:
10. Estimate Std. Error z value Pr(>|z|)
11. (Intercept) 0.04445 0.20851 0.213 0.8312
12. DatasetTypeEa -0.07344 0.26998 -0.272 0.7856
13. DatasetTypeEco -0.04445 0.37878 -0.117 0.9066
14. DatasetTypeGA 0.64870 0.37879 1.713 0.0868 .
15. DatasetTypeGIS 0.29202 0.33898 0.861 0.3890
16. DatasetTypeGO 0.13787 0.45842 0.301 0.7636
17. DatasetTypeGS 1.07396 0.22364 4.802 1.57e-06 ***
18. DatasetTypePA 0.18916 0.29180 0.648 0.5168
19. DatasetTypePT 0.64870 0.31470 2.061 0.0393 *
20. DatasetTypeXY 0.42555 0.41042 1.037 0.2998
21. ---
22. Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
23. (Dispersion parameter for poisson family taken to be 1)
24. Null deviance: 283.03 on 169 degrees of freedom
25. Residual deviance: 212.36 on 160 degrees of freedom
26. AIC: 566.94
27. Number of Fisher Scoring iterations: 5
1. Ordered Logit Model (library MASS, function polr)
1. polr(formula = as.ordered(ResolvableScore) ~ DatasetType, data = a)
2. Coefficients:
3. Value Std. Error t value
4. DatasetTypeEa -0.1696435 0.4978595 -0.3407457
5. DatasetTypeEco -0.1249959 0.6788473 -0.1841296
6. DatasetTypeGA 1.3932769 0.8160008 1.7074454
7. DatasetTypeGIS 0.2656731 0.7333286 0.3622839
8. DatasetTypeGO 0.6023778 0.8112429 0.7425369
9. DatasetTypeGS 2.3786340 0.4917395 4.8371833
10. DatasetTypePA 0.3831136 0.5557081 0.6894152
11. DatasetTypePT 1.0288666 0.7284316 1.4124410
12. DatasetTypeXY -0.2736481 1.1134912 -0.2457569
13. Intercepts:
14. Value Std. Error t value
15. 0|1 -0.6708 0.3879 -1.7294
16. 1|2 1.2789 0.4015 3.1855
17. 2|3 2.0552 0.4237 4.8511
18. 3|4 2.2097 0.4285 5.1573
19. 4|5 3.3495 0.4775 7.0144
20. Residual Deviance: 483.3118
21. AIC: 511.3118

Sincerely, Sarah Walker Judson

P.S. I'm planning to post this to OWW, just thought email would be the best mode of communication for these questions at this point. Sarah Walker Judson <walker.sarah.3@gmail.com> Wed, Jul 21, 2010 at 7:23 PM To: Heather Piwowar <hpiwowar@gmail.com> And here's a bunch of pivot tables that help show the trends....helps put the stats/scoring, etc. in perspective.

Sincerely, Sarah Walker Judson [Quoted text hidden]

DataSets_Pivot.xls 1296K Heather Piwowar <hpiwowar@gmail.com> Wed, Jul 21, 2010 at 9:22 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com>, Todd Vision <tjv@bio.unc.edu> Sarah,

I wasn't quite sure where to respond on OWW, so feel free to copy my email comments over. Also, CCing Todd because he knows his stats and I want to make sure I send you down the right path.

Thanks for the full summary... it made it very easy to understand your question and think about the alternatives.

I hear you on the ordered having more information in it, so let's try to find some statistics to take advantage of that. Your levels make sense. You do have finely grained levels... this may be a problem if it makes the data too sparse for good estimates. Your crosstabs suggest it is pretty sparse across your covariates of interest. I'd go with it as is for now (I saw a paper that argues for maintaining lots of levels even in small datasets), but keep the fact that it is sparse in mind... an obvious fix is to collapse some levels if it seems necessary.

Stats. I'm not very familiar with Poisson in this context either. I could see how chi squared could be applied, but agree that it seems to leave a lot of the power on the table. And its output isn't very informative, as you said.

So let's go back and think about ordinal logistic regression again. I think that may still be quite appropriate. Here's the best document that I could find... this group does great R writeups. http://www.ats.ucla.edu/stat/r/dae/ologit.htm

What do you think? I think that your levels are definitely analogous to a likert scale, or soft drink sizes.

This example helps with gut feel as well: http://www.uoregon.edu/~aarong/teaching/G4075_Outline/node27.html

The two approaches in R seem to be MASS::polr and Hmisc::lrm

I'd probably go with the latter because that is what the cool tutorial above uses :)

What do you think? After reading these refs do you still feel like it isn't appropriate? If so, let's talk it through.... I'm avail on chat tomorrow most of the day (multitasking with a remote meeting) so feel free to initiate chat whenever. Heather

On Wed, Jul 21, 2010 at 5:15 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote: [Quoted text hidden]

Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 12:12 PM To: hpiwowar@gmail.com Cc: Todd Vision <tjv@bio.unc.edu> Heather -

Thank you very much for the prompt help!

The UCLA link was very helpful and interesting....cool stuff. I ran my data following the tutorial. I didn't have any problems running it, but my data clearly violates a number of the assumptions:

1. Small cells/empty cells: because of the number of categories, I had many zeros or small values in my crosstabs. They warn against this saying the model either won't run at all or be unstable....I'm not clear what they mean by "unstable". Mine ran, but I don't know if we can trust the results. (See attached "OrdinalLogisticOutput" for results.)

2. Proportional odds assumption: My data did not hold up to either the parallel slopes or plot tests of this (see attached .txt and .jpg).

3. Sample size. There example was with 400, plus it was mostly binary, so the samples weren't splayed out among many categories. Mine is 170 (for just the 2000/2010 comparision....I have about 100 more if I pool the 2000/2010 "snapshots" and the Time Series) and distributed among a lot of potential categories.

In general, I'm still concerned about the nature of my data in this analysis. They give two examples at the top that match my data, but then the one they use as an example is with more progressive/scalar categories (i.e. your parents had no education, the next "logical" step is that they did get an education). Mine on the other hand is A, B, and C which have no relation to each other....i.e. journal 1, journal 2, and journal 3 or datatype A, B, and C. I don't know if I'm articulating that well, but from their example, I think my type of data would work, but I'm unclear how I would interpret the results. Especially the coefficients....for example, the UCLA example says "So for pared [parent education level], we would say that for a one unit increase in pared (i.e., going from 0 to 1), we expect a 1.05 increase in the expect value of apply [likelihood of applying to grad school] on the log odds scale, given all of the other variables in the model are held constant." I don't know how you would interpret this from journal to journal or datatype to datatype.

Also, I would get the books they recommend to figure out the best approach/interpretation, but I'm operating out of the world's smallest library (seriously, smaller than my apartment) and don't know other ways to obtain the books besides the limited previews on google books. Even more so, I'm almost positive it would take me over a week to get them short of a road trip to LA or thereabouts.

I'm on gchat (just invisible) if you want to hash things out now. I decided to email so I could articulate and ponder over my thoughts better. Thanks again for your help now and throughout this project!

Sincerely, Sarah Walker Judson [Quoted text hidden]

3 attachments Test_ProporitionalOddsAssumption.jpeg 52K ParallelSlopeTable.txt 5K OrdinalLogisticOutput.txt 2K Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 12:36 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Hi Sarah,

Nice job on the fast analysis and thoughtful interpretation (or interpretation attempts, as the case may be).

I will read and think and respond in the next few hours.

fwiw I do have "Frank E. Harrell, Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. Springer, New York, 2001. [335-337]" at home (it is a great book in many ways! recommended) and would be happy to zoom any relevant pages to you.... will see if that is helpful.

More soon, Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 2:14 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

I'm going to write in email too, to help organize my thoughts.

These responses are going to sound like I'm arguing for the ordinal regression. I'm not, per se.... just trying to fully trying to see if it can work before we go to something else.

1. Small cells/empty cells Agreed, a potential problem. I think it would be a potential problem for most statistical techniques because hard to estimate from little information. That said, there are some algorithms that are designed to deal well with this, like Fisher's test for chi squared.

I think by "unstable" they mean very sensitive to individual datapoints. One way to test this is to do a loop wherein you exclude a datapoint, recompute, see if it changed anything drastically. I don't think this makes sense to do first, but we can plan to do it at the end if we are worried about potential instability.

2. Proportional odds assumption

I'm going to do a bit more reading here. There are related algorithms for non-proportional ordinal regressions, though no obvious best choices in R....

another idea might be to collapse the levels into 3 or 4 and see if they are more proportional then (since the chance of having 6 things happen to be proportional is lower than 3 things, more sensitive to outliers...)

3. Sample size I'd add in the extra 100. Also, there is no suggestion that their sample size was a minimum.... That said, I agree, we are trying to estimate a lot of parameters based on not very much data. Rules of thumb are always tricky, and they depend on estimates of effect size, which of course we don't know yet. That said, a rule of thumb is to have 30 datapoints for every multivariate coefficient you are trying to estimate. 6 levels takes up 6, 6*30=180, and that is before estimating anything for your covariates.

So maybe another argument to collapse your levels down to 3 for now???

4. Factor/categorical variables. Yup, your journals and subdiscliplines are factors. I don't believe this will cause a problem. I would model them with dummy variables (one variable for each of your journals and subdisciplines, binary 0/1). Of course that is a lot of covariates, but I think that is the only way to have interpretable results.

A bit on dummy variables here: http://www.psychstat.missouristate.edu/multibook/mlt08m.html

I know that the Design library often does smart things with factor variables, too.... so before you create dummy variables you could try redefining your journal variable as a factor, feed that in, and see what it does.... "If you have constructed those variables as factors, the regression functions in R will interpret them correctly, i.e. as though the dummies were in there. " as per here.

ok, I'm going to end my stream of consciousness there, do a bit more reading, then find you for an interactive chat. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 2:24 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

This link may be of interest too. It is the parallel tutorial to the "ordered" regression one.... this is for regressing against categories/factors where there is no order. Not what you are doing and so it probably loses a lot of power, but it definitely doesn't have a proportional odds assumption!

could be informative to give it a try on your data for fun if easy, pretending that your levels were all distinct unrelated labels?

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 3:49 PM To: hpiwowar@gmail.com Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely,


Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 4:09 PM To: hpiwowar@gmail.com Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson [Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Thu, Jul 22, 2010 at 6:31 PM To: hpiwowar@gmail.com Heather -

I've got notes from my reattempts this afternoon, so expect a lengthy email following this (but possibly not until tomorrow morning).

Main success: getting results I understand and generating meaningful questions (I feel like I'm on the verge of having something to report) Main problem: can't get all the factors to run at once.

So, that's what i'm hoping for help on....I'm planning on writing a more detailed email about successes and further questions, but for now I'll just barf my code into this email b/c maybe you've run into this problem before. The input file is also attached. My notes on the error are at the bottom as part of the code. My apologies for the mess...mostly, my husband's complaining that I'm still on the computer rather than eating dinner, so I'll send the long version later.

a=read.csv("ReuseDatasetsSnap.csv") attach(a) names(a) str(a) xtabs(~ Journal+ResolvableScoreRevised) xtabs(~ YearCode+ResolvableScoreRevised) xtabs(~ DepositoryAbbrv+ResolvableScoreRevised) xtabs(~ DepositoryAbbrvOtherSpecified+ResolvableScoreRevised) xtabs(~ TypeOfDataset+ResolvableScoreRevised) xtabs(~ BroaderDatatypes+ResolvableScoreRevised) library(Design) ddist4<- datadist(Journal+YearCode+DepositoryAbbrv+BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4)

1. modify to match output before running-->#Y repeats
sf <- function(y)
c('Y>=0'=qlogis(mean(y >= 0)),'Y>=1'=qlogis(mean(y >= 1)),
'Y>=2'=qlogis(mean(y >= 2)))


s <- summary(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, fun=sf) s text(Stop!)#modify to match output before running -->which, xlim plot(s, which=1:3, pch=1:3, xlab='logit', main=' ', xlim=c(-2.3,1.7))

2. fewer than 2 non-missing observations for Journal + YearCode + DepositoryAbbrv + BroaderDatatypes
4. 1: In Ops.factor(Journal, YearCode) : + not meaningful for factors
5. 2: In Ops.factor(Journal + YearCode, DepositoryAbbrv) :
6. + not meaningful for factors
7. 3: In Ops.factor(Journal + YearCode + DepositoryAbbrv, BroaderDatatypes) :
8. + not meaningful for factors
1. i got this error before when I was running in it non-factor, but it cleared up when I either ran less variables at once or coded to dummy variables (1,2,3,4, etc) instead of letters (ea, eco, bio, etc)
2. internet searches primarily turning up code that i don't understand or a few dicussion forms that don't make sense to me
3. search terms used: "datadist" & "not meaningful for factors"l; "datadist" and "fewer than 2 non-missing"
4. don't get this problem when running each factor separately. i ran most separately to practice interpretation. main problem is that ME (journal) is correlated with genbank (depository) and gene (datatype) = each comes out significantly "better" when run as a separate model (factor by factor)... this is where a multiple factor model (which isn't working) would come in handy...to (maybe) tease these apart (i.e. is publishing in ME, reusing from genbank, or using a gene what determines resolvability/attribution)

Sincerely, Sarah Walker Judson [Quoted text hidden]

ReuseDatasetsSnap.csv 259K Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 8:08 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah,

Hmmmm interesting. Possibly related to zeros in your crosstabs? I may have some ideas. No time now but will dig in tomorrow morning.

Heather

Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:17 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah,

Good question. I'd say yes, go for it. I agree, it helps to flush out the story. I like the first option better... a linear combination of the other two, then, isn't it? Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:22 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> So based on my gut and what Todd said, I think combine them. Include a binary variable for snapshotYN... we hope that that one is not significant, but it will help catch things if it is. And you have a variable for year already, right?

All of this said, once we get the stats working in general, it is probably worth an email to the data_citation list summarizing your approach and prelim interpretations, so they can give feedback if anything is out of wack methodologically. Heather

[Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:23 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Sarah, I'm busy till 10am but will have time after that to dig in. Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Fri, Jul 23, 2010 at 10:33 AM To: hpiwowar@gmail.com Heather -

Thanks so much again for your help! Here's the promised "long version"....sorry it took me awhile to get it to you. I apologize for the length and don't necessarily expect lengthy responses to each portion, writing this out is helping me think through it all and hopefully helping you become more acquainted with my data. I indicate the most important questions with a double asterix.

First, attached (PrelimOutputs.txt) are some preliminary results with my interp written under each output. Even though I know we want to be running multiple factors at once (I think) I found this to be a useful exercise to familiarize myself with the statistic and running it in R. It started to reveal some support for trends I was expecting to see (i.e. gene sequences are more resolvable and attributed), so that's promising.

• Like I said before, the main problem I'm having is running all the factors together. I don't understand the error I'm getting and can't find much help on it (at least not that I can understand). This would be especially helpful for distinguishing if publishing in Molecular Ecology, using a gene sequence, or utilizing Genbank is most influential in having a resolvable/attributable data citation. But at the same time, these are all correlated so it might just be more of a mess because of multicollinearity.

I have some more specific questions in general and about the attribution/resolvability scoring.

In general: - You mentioned "subdiscpline" as a factor yesterday. Were you referring to what I call "data type" or the discipline of the journal? Concerning the later, many of the journals are classified (according to ISI) by usually two of our three major disciplines. I.E. American Naturalist is classified as Ecology and EvoBio, GCB is classified as Environmental Science and Ecology. Few have just one. I coded this for now as binary for each discipline, but given the existing problems with multiple factors, this might be too much to add. Also, I tend to think most of the journals belong to one discipline more strongly than the other...i.e. I would say AmNat is Ecology, Sysbio is EvoBio, and GCB is Environ Sci, etc. This would also reduce the number of factors for this category. Thoughts?

• - By testing so many factors and character states, aren't we pretty prone to Type 1 error? How do we "prevent" this? Does running factors separately vs. combined help at all?

- For some papers I have multiple datasets per paper. During data collection, I had them all pooled and separated by commas to indicate nuances. Primarily, I only split an article into multiple datasets if they were different datatypes OR if one dataset was a self reuse and the other was acquired via another mechanism. There are about 5-10 incidences where a dataset was split even though they were the same datatype because one was attributed/resolvable and the other was not (i.e. they were acquired in different ways). Will this lead to independence problems? (P.S. I have some preliminary sentences about this for the methods if this doesn't make sense, let me know if you need it).

• - For some of my factors, I have both a "broad" and a "specified" classification. I'm more inclined to the broad for stats, but always hate to toss resolution. Right now I'm most inclined to keep datatypes broad and depository specific. Here are the classifications for comparison.

Datatypes - Specified (*how data was collected) Bio = organismic, living Paleo = organismic, fossil Eco = community (multi-species) GS = gene sequence GA = gene alignment GO = other gene (blots, protein) Ea = earth (soil, weather, etc) GIS = layers XY = coordinates PT = phylogenetic tree

Datatypes - Broad (*What i am currently using) G = gene (Gs, Ga, Go) O = organismic (living and fossil = Bio and Paleo) S= spatial (GIS, XY) Eco = community (multi-species) Ea = earth (soil, weather, etc) PT = phylogenetic tree

Dataypes - Broader still (haven't attempted) Ecology = organismic (Eco, Bio, Paleo) Environ Sci= spatial & earth EvoBio = gene (PT, GS, GA, GO)

Depository - Specified (*currently using) G = Genbank T = treebase U = url or database (non-depository) E = extracted literature O = other (correspondence, not indicated)

Depository - Broad (*results similar to above) G = Genbank T = treebase O = other (url, extracted, correspondence, not indicated)

Depository - Binary (haven't attempted) D = Depository (i.e. people can both deposit and extract data = genbank, treebase) O= other

Resolvability:

- I'm having a little problem that will probably require recoding where I only counted a depository reference if it was in the body of the text, but not in supplementary appendices or even a table caption. I think I started counting a depository reference later in data collection if it was in the table caption, but still not if it was in the supplementary caption. I want to get your opinion on how this should be coded in the resolvability categories: 0="no information, can't find it" = none of the below 1="could find it with extra research"= depository or author or accession ONLY 2="could find it just with info provided in the paper" = depository and (author and/or accession)

• I think a table with Genbank mentioned in the table caption and accessions given therein should be a "2". However, I think Genbank mentioned in the header of an appendix followed by accession (i.e. same table as previous but in supplementary information) should be counted as a "1" because you would have to track down the supplementary information, which in the case of sysbio and other articles is difficult. Again, this is considering that genbank was never mentioned in the body of the paper, but the authors said something like "additional information about sequences is provided in appendix a". This gives the reader no guarantee that when they dig up the appendix that it will actually have accession numbers, as it may just describe which taxa each sequence was from or a museum voucher number for the specimen. So that's my bias, just want to see if it's justified in your mind. It will require some recoding no matter.

a little bit of a problem with f/ac


Attribution: Another quick question about scoring which like the above requires lengthy text to explain. Here are my final scoring categories and explanations:

0= "the data is not attributed" - no author or acession (no author also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all)

1 = "the data is indirectly attributed" - accession only or author only (author only also = a self citation (i.e. previous review paper) but other reuse...i.e. original data authors not attributed at all) - this still includes self reuses of previous data. In the discussion, I would then talk about what % self reuse occurs as a caveat/modifier about this information

2 = "the data is directly attributed" - author and accession - regardless of self. I think if the author reused their own data and gave the accession number, that's great (it happened so rarely, so I appreciated it when it did....it seemed less like personal aggrandizement to rack up a citation and more of open data sharing...of "hey, you can use this data too" rather than, "please go read my other publications to see if you want my data and maybe can dig up how to get it in those other papers because i don't feel like explaining it here"

So, my explanations probably show my bias. I think it's ok to include self reuses partly because the sample size is small already and partly because some people legitimately reuse their own data. However, I don't think it should count when they cite themself but really used other data (what I call self citation/other reuse....meaning they refer to their previous collection of data and vaguely state that it was from external sources, but give no credit to the original data authors in the current paper and they might in the previous paper, but I don't think we can assume that). So, again, do my biases/categories seem justified? Should we just throw out self reuses altogether as you've been doing. Also, I should note that as I mentioned above (in "independence"), self reuses and other reuses from the same article were separated for analysis.

Well, hopefully you survived all that. Thanks again for your diligence and continued help!!

Sincerely, Sarah Walker Judson [Quoted text hidden]

PrelimOutputs.txt 21K Sarah Walker Judson <walker.sarah.3@gmail.com> Fri, Jul 23, 2010 at 10:59 AM To: hpiwowar@gmail.com One crazy idea about the multiple factor problem. It worked when I ran everything together as dummy variables (not binary like the link you sent, but 0,1,2,3, etc)...that was how I ran it before our chat yesterday. I could numerically code/rank the journals, datatypes, etc according to their coefficients when run separately, then run all the factors together to maybe get at which is the most influential (journal vs. datatype vs. year). I dunno if that even works at all, but it's the only plausible work around I can think of given I don't know a ton about this method. It's probably totally unconventional, but I thought I might as well mention it.

Sincerely, Sarah Walker Judson

P.S. I'm on gchat most of the day (i.e until 6pm), but will be invisible as usual. [Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 1:10 PM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Hi Sarah. I'm going to go have lunch and then come back and chat.

Question: have you tried datadist with commas rather than +s?

These lines seem to run successfully: ddist4<- datadist(Journal,YearCode,DepositoryAbbrv,BroaderDatatypes) #can't get this to run with all the factors at once, or even two at a time options(datadist='ddist4') ologit4<- lrm(ResolvableScoreRevised ~ Journal+YearCode+DepositoryAbbrv+BroaderDatatypes, data=a, na.action=na.pass) print(ologit4) ok, more chatting later,

Heather

Sarah Walker Judson <walker.sarah.3@gmail.com> Fri, Jul 23, 2010 at 6:27 PM To: hpiwowar@gmail.com, Todd Vision <tjv@bio.unc.edu> Heather -

Recombined datasets I recombined any dataset that was the same article but different citation practice (i.e. self and other, or different sources). I made some binary and factor codes to indicate this if we deem it relevant or important to test for that artifact. I have the min and max score for attribution and resolvability for each and they can be relatively easily modified in later updates.

Now, I'm wondering, should min, max, or average score be used? I've ruled out average to keep things ordinal, but I'm not sure of the (dis)advantages to min and max. Could you enlighten me?

Dealing with Self reuses I recombined datasets that were self (or different practices) into a single dataset to avoid independence problems. Now, the main problem I'm running into is how to use the Self reuses. First, the way I've coded it, and then the resulting options

Coding: 0= no self (all use of someone else's data) 1 = self citation, other reuse (citing a previous study/review, but not mentioning the original data authors) 2 = some self reuse, some other (some of the data was recycled from a personal previous study, but the gaps were filled in by other sources....common with gene sequences...authors use their own data and then fill in outgroups or missing taxa from a blast search) 3 = all self reuse (recycling data from a previous study to look at a new question)

Options: 1. Keep Self reuses in the dataset but use the above coding as a factor (maybe should do this to help us determine if we throw them out or not) 2. Throw out some level of self reuses....but the question is, which level? I think a score of 1 counts as a bad reuse of someone else's data...using it but not crediting them. A score of 2 is also debatable for inclusion...could keep the "non-self" portion and toss the self portion (i.e. then don't have to deal with min and max scores...basically go back to my split ones, but just use the non-self). A score of 3 should definitely be tossed...just recycling old data (but I stand by my point that option 1 should at least be considered b/c I think authors citing their own data are sloppier than when citing someone elses). I'm inclined to toss 3 only (but analyze it separately!) because I think a score of 2 and 1 still hold meaningful information about using other people's data. In terms of sample size, cutting 3's only, we lose 36 from the whole dataset (originally 270 now 245 with recombined datasets), cutting 2's = another 14, cutting 1' = another 9.

Snapshot vs. Time Series vs. Pooled At some quick glances (not considering the elimination of self reuses or classification changes yet), the results are coming out pretty much the same when run any way. For resolvability, 2010 always comes out significantly different than 2000, Journals not sig, Datatypes not sig (again, without change in categories yet), and Depositories sig (Genbank and treebase the same, other sig diff = "worse" resolvability and attribution).

Depository and Datatype reclassifications I haven't finished this yet and am hoping to hash it out this weekend or Monday morning. I'm leaning towards the following classification, but haven't pondering it long and hard yet. I'm open to arguments or new ideas on the matter.

DATATYPE: Reuse: "Genera" of data raw gene: GS, GO processed gene: GA PT earth: GIS, Ea species: bio, eco, paleo, xy

Sharing: where the data "should" go (but this could be a separate measure...like "data should go to depository X, Y, or Z" and then "data did go to depository X, Y, or Z") gene: GS (genbank) processed gene: GA PT (treebase) earth: GIS, Ea (daac - does daac handle both of these? if not, split) species: bio, eco, xy (dryad) fossil: pa (paleodb) other gene: GO (actually, I don't think any of these were shared)

I haven't come to a conclusion about Depository yet, the main debate was Depository YN vs. Depository and the Other categories vs. Depository types (specify genbank and treebase) and other categories. I think the middle....so answering the question about resolvability/attribution if it comes from any depository, from a url/db (non-depository), extracted literature, other (correspondence), or not indicated. That's where I'm leaning on that.

Well, these aren't complete thoughts, but I figured I should spit out all of the above so if you had time this weekend, but not Monday I could still get your input.

Thanks again for all your help. Will keep you updated on my progress. Hoping to have a draft Mon/Tues, but am struggling to put words to paper before the stats/methods we've discussed are more solid.

Sincerely, Sarah Walker Judson [Quoted text hidden] Heather Piwowar <hpiwowar@gmail.com> Mon, Jul 26, 2010 at 10:09 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Sarah,

Sounds like good progress hashing out the issues. Some thoughts below.

I'm on chat all day... a lot of my focus will be on helping Nic with mutivariate R stuff today. That said, if there is something that you need help thinking through and it is holding you back then send me a chat or email and we'll work though it. Heather

Now, I'm wondering, should min, max, or average score be used? I've ruled out average to keep things ordinal, but I'm not sure of the (dis)advantages to min and max. Could you

I don't know that one is better than the other, but it changes interpretation, for example from "when all of an author's genbank citations include an author name" to "when at least one of the genbank citations include an author name." So I think it just depends on which you'd rather know something about.

Dealing with Self reuses I think I like option 1 for now. keeps in the most data, lets us tell a story about "citing data" no matter where the data, etc. That said, I could believe that workign through it may provoke a different opinion. Maybe tomorrow I can step through your R analysis with you and we could go over some of these questions with the data in front of us?

Snapshot vs. Time Series vs. Pooled For pooled, which I'm going to call "combined" because I'm familiar with pooled analysis being something specific and different, you are including all rows from snapshot, all rows from time series, a column YN for whether snapshot or time series and a numeric column for the year, is that right? in that scenario is the year significant? is the snapshotYN significant? curious :)

Heahter

Sarah Walker Judson <walker.sarah.3@gmail.com> Mon, Jul 26, 2010 at 6:08 PM To: hpiwowar@gmail.com Cc: Todd Vision <tjv@bio.unc.edu> Heather -

Here (attached) are my thoughts on possibilities for factor classifications of data type and depository. I organized my thoughts in an Excel table and thought that would also be the easiest way to send it. I give my rationale for each and highlight the ones I'm leaning towards. Let me know what you think by email or proposing a chat time tomorrow (I'm open anytime). Thanks!

Sincerely, Sarah Walker Judson [Quoted text hidden]

DataAndDepositoryClassification.xls 10K Heather Piwowar <hpiwowar@gmail.com> Tue, Jul 27, 2010 at 7:26 AM Reply-To: hpiwowar@gmail.com To: Sarah Walker Judson <walker.sarah.3@gmail.com> Cc: Todd Vision <tjv@bio.unc.edu> Great, Sarah, let's chat about it today. I have a few meetings and I don't know how long they'll run, so I'll initiate a chat when I am free. Maybe 10 or 10:30 pacific, or else after our group meeting. Heather

[Quoted text hidden] Sarah Walker Judson <walker.sarah.3@gmail.com> Tue, Jul 27, 2010 at 9:16 AM To: hpiwowar@gmail.com Sounds good. I'm available at either time.

Sincerely, Sarah Walker Judson [Quoted text hidden]

## Email: Advice on data collections in stats

3 messages Heather Piwowar <hpiwowar@gmail.com> Thu, Jul 22, 2010 at 8:07 PM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu>, Sarah Walker Judson <walker.sarah.3@gmail.com> Todd,

Could do with some stats advice.

Sarah collected data in two different ways: randomly and consecutively. My guess is that she can concatenate these for her main analysis... maybe with a binary variable indicating the type of data collection to hopefully catch artifacts.

That said, I'm a bit unsure and I don't want to lead her down the wrong path.

What do you think? Heather

Forwarded message ----------

From: Sarah Walker Judson <walker.sarah.3@gmail.com> Date: Thu, Jul 22, 2010 at 4:09 PM Subject: Re: Scoring and Stats questions_Dataone To: hpiwowar@gmail.com

Also,

you mentioned that maybe I should pool my "snapshot" (2000/2010) and "time series" (2005-2009 for sysbio and amnat only) to get a bigger sample size. the former was collected sequential and the later randomly. i'm a bit worried this affects assumptions about data collection, but don't now if this is as strict of an assumption in this arena as in biology. i was thinking of running both separately and then pooling and then choosing one as the focus (probably the 2000/2010) on for reporting and stating if/if not the other sets produced similar results.

thoughts?

thanks.

Sincerely, Sarah Walker Judson

On Thu, Jul 22, 2010 at 3:49 PM, Sarah Walker Judson <walker.sarah.3@gmail.com> wrote: Sorry, one question I forgot (but isn't urgent since I have a lot to chew on): should I even attempt the "ideal citation" score, or just worry about the resolvability and attribution components?

To reiterate (it's kind of a combination of resolvability and attribution): Ideal (previously "Good") citation score Ideal_CitationYN* This came out the same as my Knoxville calculation of author+depository+accession 1=Y=Resolvable + Attribution (adding the two previous yes and no categories) 0=N=lacking one or the other or both

Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") 2=resolvable ("Yes" in "ResolvableYN") 3=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only 2=author only 3=accession only 4=depository and author 5=depository and accession 6=author and accession 7=depository, author, and accession

To adapt to an ordinal scale, it could either be: Ideal_CitationScoreSimple 0=not resolvable or attributed 1=attributed ("Yes" in "AttributionYN") or resolvable ("Yes" in "ResolvableYN") 2=attributed and resolvable

Ideal_CitationScoreGoodGradient 0=none 1=depository only or author only or accession only 2=(depository and author) or (depository and accession) or (author and accession) 3=depository, author, and accession

Thoughts?

This might be another "out of the scope of this project" or it might be redundant of resolvability and attribution or it might be essential...I dunno. As I rethink, I think it's probably redundant and not needed, but I originally liked it as an overall metric (i.e. are the citations both resolvable and attributable).....alternatively, is there a way to crosstab and analyze these? Is that maybe the best route?

Again, no rush...I've got plenty to work on.

Thanks!!!

Sincerely,


Sarah Walker Judson

Todd Vision <tjv@bio.unc.edu> Fri, Jul 23, 2010 at 4:41 AM To: "hpiwowar@gmail.com" <hpiwowar@gmail.com> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> I haven't been following the discussion closely enough to be sure, but a general approach would be to combine, test for heteregeneity and, in its absence, accept the stats on the combined sample. But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret. I'm not sure you would not want to use them to ask the same questions.

Todd [Quoted text hidden] --

Todd Vision

Associate Professor Department of Biology University of North Carolina at Chapel Hill

Associate Director for Informatics National Evolutionary Synthesis Center http://www.nescent.org

Heather Piwowar <hpiwowar@gmail.com> Fri, Jul 23, 2010 at 7:14 AM Reply-To: hpiwowar@gmail.com To: Todd Vision <tjv@bio.unc.edu> Cc: Sarah Walker Judson <walker.sarah.3@gmail.com> Thanks Todd, that's helpful. Heather

## Chat Transcript July 22

2:39 PM Heather: Hi Sarah!

 You around?
Sarah: yep...trying to work on what you sent
Heather: cool.
I don't have any definitive answers, just ideas.
Sarah: ok


2:40 PM Heather: what do you think? based on what you've seen, do you think continuing down this path a bit more makes sense?

 or turn to the chi-sq or poisson or something else?
Sarah: yeah, i think it will work, it's just a steeper learning curve for me than i expected


2:41 PM my main question before proceeding on, is: what exactly are we trying to test with this?

Heather: yeah, agreed.
it feels like good stuff to know
Sarah: i.e. mainly trying to say which factor is most influential?
Heather: but always a bit hard to be learning it when you actually need to use it, yesterday.
Sarah: or just trying to get a measure of significance to stamp on the results?


2:42 PM to me, the percentage table breakdowns really show what's going on

Heather: right good question.
so what the percentage table doesn't give us is a multivariate look that tries to hold potential confounders constant


2:43 PM Sarah: agreed

Heather: so we are trying to figure out which factors are important
which ones aren't
Sarah: still agreed
Heather: yeah, I think that is mostly it :)
maybe also


2:44 PM an estimate of prevalence and percentages in some factors, independently of confounders

 for example
journal impact factors vary a lot by subdiscipline


2:45 PM and there is lots of prior evidence (mine, nic's, etc) that high impact factors

 correlate with stronger policies, and probably more sharing
Sarah: i wasn't considering throwing in high impact factors b/c the journals were already selected by that criteria and there are only 6 journals, so i didn't think that would be an informative variable


2:46 PM Heather: so it would be ideal if we could decouple impact factor from rates of sharing

 yeah, I hear you. true, you only have 6 journals to work from.
Sarah: i would use it more as explanatory in the discussion to maybe say why one journal was better or worse....or running impact factor as a variable if journal was a significant predictor
Heather: right.


2:47 PM so then let me change my example

Sarah: ok, sorry to get off track on details
Heather: to be figuring out the relative rates of sharing in any given journal
Sarah: so say, journal vs. datatype
Heather: right
assuming the mix of datatypes was the same


2:48 PM (which obviously it isn't)

 so I'd say the goals are
Sarah: yeah, it's highly coorelated with journal
Heather: right
(which of course will make it hard for the stats to tease it out
but that's life)
so I'd say the goals are
a) which factors are important


2:49 PM b) what are the relative levels of sharing, independent of other variables

 that sync with what you think?
Sarah: could you explain what you mean by b)?
just percentages by journal/datatype, etc?
Heather: yeah. let's see, my head needs to get more into your data.


2:50 PM so, for example, we could want to say that

 when data is sequence data, the odds that it will be shared


2:51 PM Sarah: are better than if it's ecological

Heather: at a really high level of best-practice
are 1.5 times more than if it were ecological.
yes, exactly.
so you would have to choose a "baseline"


2:52 PM (or I think when you define the variable as a factor, it chooses a baseline for you? I forget)

Sarah: hmmm...i'm not clear
Heather: .... independent of whatever journal it is published in


2:53 PM Sarah: (and on a technical note, I'm having trouble defining factors in Design)

Heather: ok.... oh, I think that is easy.
can you just say as.factor(x) or factor(x), or does that not work?


2:54 PM Sarah: as.factor hasn't been working, in the documentation it says: "In addition to model.frame.default being replaced by a modified version, [. and [.factor are replaced by versions which carry along the label attribute of a variable."

Heather: hmmmm. apparently not easy.
I can look in my code.
Sarah: basically, when the only way I could get my data to behave in Design, i substituted numbers
but that makes things act like ordinal scales


2:55 PM Heather: yeah, and that is probably making the results strange

Sarah: let me dig up the output differences real quickl
journal straight:
Coef S.E. Wald Z P


y>=1 1.5626 0.4291 3.64 0.0003 y>=2 -0.4607 0.4010 -1.15 0.2506 y>=3 -1.2100 0.4127 -2.93 0.0034 y>=4 -1.3597 0.4164 -3.27 0.0011 y>=5 -2.4611 0.4617 -5.33 0.0000 Journal=EC -1.5524 0.6321 -2.46 0.0140 Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127

 journal coded as a dummy number (1,2,3,4,5 &6):
Coef S.E. Wald Z P


y>=1 2.84154 0.42461 6.69 0.0000 y>=2 0.90291 0.36971 2.44 0.0146 y>=3 0.18469 0.36705 0.50 0.6148 y>=4 0.04250 0.36783 0.12 0.9080 y>=5 -1.01183 0.39524 -2.56 0.0105 JournalCode -0.55736 0.09414 -5.92 0.0000 2:56 PM Heather: oh, leaving the journal straight, do you mean you have 5 different binary variables?

Sarah: no, i don't but i think Design is interpretting it as such
my columns look like:
Journal
ME
GBC
ME
SB
etc


2:57 PM or

 Journal
sorry,
JournalCode
Heather: right. so I'm guessing then maybe Design is interpreting it as a factor already?
Sarah: 1
2
1
3
Heather: try str(yourdataframe)
and see what datatype R thinks the Journal column is?


2:58 PM Sarah: yep

 it's coming through as a factor
Heather: ok, good.
confusing for you, but good.
Sarah: but, that type of output doesn't make a lick of sense to me
Heather: ok.


2:59 PM Sarah: well, i guess it just will give a ton of covariates like you were saying

Heather: right
Sarah: which seems like you would have type1 error problems
Heather: one for each journal
in the list in the results that looks like this
Journal=EC -1.5524 0.6321 -2.46 0.0140


Journal=GCB -1.3413 0.5323 -2.52 0.0117 Journal=ME 1.7319 0.5661 3.06 0.0022 Journal=PB -0.6605 0.5230 -1.26 0.2066 Journal=SB 0.7495 0.4725 1.59 0.1127

 does that include all of your journals, or is it missing one?


3:00 PM Sarah: ummm...it's missing Amnat

Heather: right
Sarah: I might have just not copied it
Heather: so I think that means it used Amnat as the base
good question... go have a look....
Sarah: oh
Heather: I'm thinking it might not be there
it is kind of like having a state column


3:01 PM well nevermind that analogy was just going to make things worse.

Sarah: it's in the table (there was a chance it didn't have reuse)
Heather: can you see, is it actually missing Amnat?
Sarah: it's in the crosstab
Heather: in the regression results?
but it is in the input data?


3:02 PM Sarah: yep, it's in the raw table and the crosstab

Heather: yup.
Sarah: ResolvableScore


Journal 0 1 2 3 4 5 AN 3 9 4 0 4 0 EC 10 4 2 0 2 0 GCB 13 13 3 0 1 0 ME 0 6 2 2 4 8 PB 9 13 4 0 4 0 SB 4 19 8 2 7 10

Heather: so that means that it is using it as the base case
those results mean,


3:03 PM or rather this line

 Journal=PB -0.6605 0.5230 -1.26 0.2066
means that


3:04 PM "whether the journal was amnat or PB made no difference in how we'd predict the citation-quality score"

 (or whatever it was that regression was regressing on)
whereas
Journal=EC -1.5524 0.6321 -2.46 0.0140
means


3:05 PM "being in the journal EC made a difference to the citation-quality score, compared to being in the journal Amnat, p=0.01"

 and if you want to see how big a difference
Sarah: ok...that makes more sense
Heather: we have to look at the coefficients and decode them
but they would tell us something like


3:06 PM Sarah: gut would say EC (ecology) is worse than Amnat and ME (molecular eco) is better

 so, yeah, EC is neg and ME positive
Heather: "being in the journal EC made a dataset 1.4 times more likely to have a quality score 1 level higher than an equivalent study published in amnat"
or something liek that
oh, whichever, I didn't try to make my guess very realistic :)
Sarah: makes sense for the others based on what i know about the data


3:07 PM Heather: and I'd have to reread the ordered tutorial to make 100% sure I'm getting my summary blurb right

Sarah: so then, can i force which journal (or factor) is the base?
Heather: about "1 level higher" etc because I'm not very used to this ordinal stuff
but that is the general idea
Sarah: i.e. my impression of which is "worst" or "best"
Heather: good question
maybe
I think so
Sarah: or, should I let the stats remove my bias?
Heather: if you do levels(dataframe\$Journal) what does it say?
Sarah: or determine "worst" from the pivot tables
Heather: urm, I think mathematically it doesn't matter


3:08 PM Sarah: just for interp

Heather: so there is an advantage to picking one that is easy to interp
exactly
I wonder how it picks it now?
might be the level with the most N
which would probably be a good call regardless
Sarah: that code didn't work
i'm getting an error "dataframe not found"...oh sorry, i need to insert my data object there
whoops


3:09 PM just a sec

 yeah, so AN is the first, but they are arranged alphabetically, not in order of encounter in the raw table
Heather: interesting.
so I'm guessing it might use levels()[0]
Sarah: i mean, it may correlated with sample size, but i don't think so
Heather: as the base?


3:10 PM Sarah: i'm not familiar with levels, sorry

Heather: ok
Sarah: so i can't make an intelligent stab at that
Heather: no problem
so a factor is a vector
Sarah: but, i can figure it out to spare you the time, or just interp the way it comes out
it makes a lot more sense now
Heather: well, hrm no
levels is like the "codebook" that is uses to code factors


3:11 PM try ?levels

 and for what it is worth it isn't the most intuitive part of R to me
either
so I'd maybe skip trying to force that for now
let it pick what it wants to pick
and later when/if we decide this is the way we want to go and you see an


3:12 PM opportunity to really improve the interpretation by forcing it, figure it out then...

 anyway, your call, but that's what I'd do.
so right now to interp your results


3:13 PM you'll have to see what is left out of the results output, or check out levels() for each of your categorical variables

 or some combo
does that make sense?
or enough sense?


3:14 PM Sarah: yep... a lot more sense than before

Heather: cool
Sarah: i thought the output with all the journal types list out was the wrong way
Heather: so for what it is worth
Sarah: because the example use all binary coding
rather than A, B, C, etc
Heather: your y output variable could be coded as a factor as well
if you wanted to


3:15 PM Sarah: well, i tried to order it in a way that was somewhat "bad to god"

Heather: and then you can do a multinomial regression on that unordered-factor-category y variable,
Sarah: *good
Heather: like in the last tutorial I sent.
right! and mostly that is a great idea
the only reason you could, maybe, treat it like a factor instead
is to get around the "proportional odds" stuff


3:16 PM Sarah: ok...i don't know how that will come out with this new way

Heather: by seeing how it behaves if you just remove all semblence or order.
Sarah: ok.
i'll try this again and maybe give that a shot
Heather: right, I don't know either. And I'm not necessarily really recommending it.....
except maybe.....
kind of like how we always use two-sided p-values


3:17 PM we think we know which way the interaction will go

 and so we could, in theory, use a one-sided p-value
but maybe we are wrong
and we should use stat tests that reflect that
Sarah: hmmm...i'm not following


3:18 PM Heather: let me back up for a minute and ask a question to make sure

 I'm on the right page, because I forget
for your "best practice" levels, does everything that meets the criteria to be in level 3 also meet the criteria to be in level2?


3:19 PM Sarah: no

 ResolvableScore


0=no Depository or Accession or Author (Justification: you know they used data but not exactly how it was obtained = probably couldn't find it again…i.e. "data was obtained from the published literature") 1=Author Only (Justification: you could track down the original paper which might contain an accession or extractable info) 2=Depository or Database Only (Justification: You might be able to look up the same species/taxon and find the information per the criteria in the methods) 3=Accession Only (Justification: Accession number given but depository not specified = you would probably be able to infer which depository it came from based on format, just as I was usually able to tell that they were genbank sequences by the format eve 4=Depository and Author (Justification: Although no accession given, many depositories also have a search option for the author/title of the original article which connects to the data) 5=Depository and Accession (Justification: "Best" resolvability….unique id and known depository = should find exact data that was used) 3:20 PM oh, sorry i copied an d pasted before i realized that was the long version

Heather: right, pulled it up too
Sarah: but, could make it so it was
Heather: so, by treating that as an ordered variable,
we are making some assumptions that may not be true


3:21 PM Sarah: yes, like that my ranking is reflective of true difficulty of finding a dataset

Heather: if we think about other ordered variables, people who think something is "very good" also think it is at least "good"
right :)
Sarah: yeah....but, i'm also grappling with the problem here that we have a perception of a good practice, but most of the data doesn't meet that


3:22 PM Heather: I'd say, perhaps we could improve a few things at once

Sarah: i.e. we'd like to see depoistory and accessio mentioned
Sarah: but most just give authors
author
the ordered version i see isL
:
Heather: into interpreable levels
Sarah: author only
depository and author
Heather: yeah, but I'd even try using other lingo for a minute
Sarah: depository and author and accession


3:23 PM Heather: so "no information, can't find it"

Sarah: but , i have very few of the later
Heather: "could find it with extra research"
"could find it just with info provided in the paper"
or something like that
Sarah: ok...
but i'm talking about those same things jsut by the criteria i'm defining them


3:24 PM Heather: then you have a codebook to know what criteria you use to apply those labels

 yeah
but there aren't 6 that make sense to talk about
when you stop talking about their criteria, do you know what I mean?
Sarah: "no information, can't find it" = none of the below
"could find it with extra research"depository or author ONLY


3:25 PM Heather: in some ways, the people reading the paper don't care if a citation includes the author and one of depository or...

Sarah: "could find it just with info provided in the paper" depository and (author or accession)
Heather: they care... can I FIND it :)
or am I attributed
or whatever
yup
Sarah: sorry, i'll put those in order
"no information, can't find it" = none of the below
"could find it with extra research"depository or author ONLY
Heather: and I think that will help with the ordered interpretation
Sarah: "could find it just with info provided in the paper" depository and (author or accession)
Heather: and reducing the number of levels


3:26 PM (which will help with N in cells and maybe proportionalness)

Sarah: and then use percentages to just state the paltry number of papers that give the accession number
Heather: yup
Sarah: rather than holding accession as the holy grail
make it what matters


3:27 PM so "the author is not attributed"

 "the author is indirectly attributed"
"the author is directly attributed"
(and maybe this means you need another endpointn for the depository attribution?)


3:28 PM anyway... I wouldn't spend oodles of time reworking things into this framework

Sarah: i don't think it will be bad
Heather: because maybe it won't be practical
or Todd won't like the direction or whatever.....
but that's what my gut tells me.
Sarah: i still like my original categories for display tables, but you're right about the meaning for stats
maybe that will also keep todd happy


3:29 PM Heather: yes agreed! good point.

 :) And I don't want to put words in Todd's mouth, I don't know what he will think....
Sarah: no, i think we all think accession number (direct data attribution) is the holy grail
but, that's just not a reality in this data
Heather: yeah. so then can you define a midpoint or two between that and nothing


3:30 PM Sarah: one quick question on the attribution scale,

Heather: yup?
Sarah: would accession number (without an author name) be direct or indirect?
i say indirect
but it hurts when we want to show accession as the epitome
Heather: yeah, I'd say that too.
Sarah: of a good data citation


3:31 PM Heather: yeah, but you know what? when you put it that way, accession number isn't actually the epitome

 of everything
Sarah: yeah
Heather: that is what genbank mostly does right?
Sarah: yep, exactly
Heather: and it comes under fire in terms of people not getting direct attribution
Sarah: but it's not standard in the literature by any means
Heather: so if your data reflects taht, probably all the better
Sarah: a lot of people say " i searched genbank and used sequences by author a, b, and c"


3:32 PM Heather: do they really? I wouldn't have expected that.

Sarah: ok, so can i run the attribution categories by you real quick?
Heather: I've mostly seen "and used accession number A , B, C"
yes
I think I've got 10 more mins
Sarah: "the author is not attributed" - no author or acession


"the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession 3:33 PM wait...that excludes author only

 "the author is not attributed" - no author or acession


"the author is indirectly attributed" - accession only or author only "the author is directly attributed" - author and accession

 hm....but "author only"
is direct
Heather: do you need "accession" on directly attributed?
right.
Sarah: "the author is not attributed" - no author or acession


"the author is indirectly attributed" - accession only "the author is directly attributed" - author and accession or author only

Heather: seems strange, but in terms of attribution per se, accession not needed
Sarah: but, then that's not ordered


3:34 PM directly does not necessarily include the indirect

Heather: good point
thinking


3:35 PM Sarah: one addition: correspondence (ie. data set was obtained from my buddy so and so)"the author is not attributed" - no author or acession "the author is indirectly attributed" - accession only or correspondence only "the author is directly attributed" - author and accession or author only

Heather: well hrm I'm not quite sure what to think.
Sarah: we could change it to "data directly attributed"


3:36 PM "the data is not attributed" - no author or acession "the data is indirectly attributed" - accession only or author only "the data is directly attributed" - author and accession

 or call it "data authorship"
Heather: yeah, that works I think, doesn't it?


3:37 PM Sarah: that's more what we're interested in too....is the DATA being cited?

 still brings in the problem of author attribution as the current mode of tracking data
Heather: yes, exactly
nice
ok, I have to run.
Sarah: ok. thanks soooo much!
Heather: I'm guessing we aren't out of the woods yet
but making progress
Sarah: i'm not used to categorical stats and that helped a bunch
Heather: great


3:38 PM Sarah: ok, sure, will send through email or whatever and usually when you get an email from me, i'm available on chat for the next little while

 thanks!
Heather: ok, good to know. I think I'll be AWOL tonight, but avail tomorrow.
bye!


## Chat Transcript July 23

1:06 PM Heather:

 Have you tried datadist with commas rather than +s
so


________________________________________ 7 minutes 1:14 PM Sarah: nope, was running them as pluses per the tutorial. will try commas while you are at lunch 1:15 PM golden. now it works ________________________________________ 29 minutes 1:44 PM Heather: sweet!

 I'm back and ready for chatting whenever you are.
Sarah: i'm good


1:45 PM i just reran the data altogether an d was attempting interpretation

Heather: great
chat now then?
and the fact that we are looking at lots of them


1:46 PM so I think it definitely does mean that we need to use a threshold lower than 0.05 for each particular coefficient

Sarah: ok, so account for type 1 by lowering the alpha level
Heather: yup
Sarah: and just be straightforward about that in the methods
Heather: not quite sure the best practice
yes


1:47 PM I don't think that putting them in individually vs via factors makes a difference

 (unless they are doing something really smart for factors in the regression code, which they might be)


1:48 PM hey this reminds me...

 try anova(ologit4)
does it do that in the tutorial?
I think that it collapses all of the factors into overall p-values....
Sarah: it might have at the end, but i was getting lost after some of the diagnositcs and coefficient problems yesterday


1:49 PM Heather: I learned that in the Harrell (sp?) book doing my thesis

 relevant I think
so let's keep it in mind as an alternate view of results
Sarah: here's what i got:
Wald Statistics Response: ResolvableScoreRevised


Factor Chi-Square d.f. P Journal 8.18 5 0.1464 YearCode 7.03 1 0.0080 DepositoryAbbrv 32.43 2 <.0001 BroaderDatatypes 6.29 5 0.2788 TOTAL 57.27 13 <.0001

Heather: right
Sarah: and that's what i was seeing
kind of like a chi sq for each factor


1:50 PM Heather: whereas just print(ologit4) breaks it down

 yup
so I think a way that people interpret this is that they do the anova() first
to determine if the factor has an effect
at some p=0.05/number of variables or something


1:51 PM and then for factors with an effect, they look at the individual coefficients and interpret them

 I mean the individual levels within the significant factors
Sarah: makes sense
Heather: cool
regardless, it is all best reported in the spirit of hypothesis-generating


1:52 PM exploratory, etc

Sarah: when i ran things individually, datatypes and journal were significant factors, but all together not so much
Heather: interesting potentials for interpretation there
Sarah: mostly, anything genetic related was coming out as significantly different
Heather: yup, makes sense


Sarah: but then this makes it look like the depository is the most influential factor in determining that
Heather: yeah
Sarah: ok, shoot
Heather: so one idea about depository
is to combine G and T, and keep E O and U all separate
reason is that G and T are the same "kind"
centralized, best practice, etc


1:54 PM the others are all different kinds

 what do you think?
(whereas, foreshadowing to typeofdataset, I think I'd argue that GS and PT stay as their own individual types)
Sarah: yeah
agreed,


1:55 PM when i wrote that email this morning i thought, well that was obvious but hadn't thought of that before

Heather: I think G+T, E, O, U all have the N to stand alone
doesn't hurt that G and T have similar distributions in resolablescorerevised either, phewph
Sarah: so....keep other and "not indicated" separate as well?
Heather: yeah, I would I htink
they are different
and have enough N, it looks like


1:56 PM Sarah: agreed, b/c not indicated should have less resolve/attribution

Heather: right.
and "other" is interesting in and of itself


1:57 PM so with that in mind, let's think about typeofdataset

 if you are ready
Sarah: ok
Heather: so I think that GS should stand alone
mostly due to our (perhaps unstated) hypothesis
that genbank is its own universe
or rather that data-that-can-go-into-genbank is its own universe


1:58 PM ditto data-that-can-go-into-treebase

Sarah: yeah, but what about instead of using journal to define discipline, the datatype defines that?
because otherwise the factor is redundant of depository
i guess maybe not since depository now lumps genbank and treebase


1:59 PM Heather: hmmm. so maybe let's hold off thinking about discipline for now

Sarah: but gs could count for both
yeah, i'm making more of a mess by analyzing it too much
Heather: because I think that is fuzzier than dataset
Sarah: so....maybe: data that should go to a, b, and c
so define the data by its destination?
Heather: ok, you've lost me a bit
so this variable
Sarah: then bio and eco should both go to dryad
oh, sorry
Heather: TypeOfDataset


2:00 PM Sarah: yep

Heather: has 10 different values
and one question is whether to keep all 10 distinct in the stats analysis, or how to group them
to that question, for what it is worth, to me the answer is a bit different than any of your proposals
it would be somehting along these lines:


2:01 PM GS, GA+GO, GIS+XY, PT, EA, Bio+PA(+Eco probably)

 I can also see the value of keeping all 10 levels
Sarah: but, ga should be deposited in treebase
Heather: and mostly just shy away from that due to desire to keep number of vars down
Sarah: so, with pt
Heather: oh! my mistake. PT also goes in treebase is that right?
yeah, with pt then


2:02 PM Sarah: pt w/ an associated ga

Heather: definitely informing/biasing our groupings by our hypotheses, but I think that is ok
yup. (you know better than I do about the details here....)
Sarah: yeah, thats why i thought to maybe lump all genetic info
Heather: what do you think?
yeah, but does all genetic info go into genbank?


2:03 PM if so, then yes lump

Sarah: no
Heather: yeah, then I'd keep the genbankable stuff distinct
Sarah: the problem is too, that all spatial data (gis, xy) don't have a happy place to go together
nor any depository at all
Heather: yeah
Sarah: so, the main problem is, are we lumping by supposed depository or by type of data
Heather: then maybe ideally they would stand alone
Sarah: or use of the data?


2:04 PM Heather: the goal would be to lump by type of data

Sarah: i.e. ga and pt are needed for a reanlaysis of a tree,
but you could also use gs for that
which many do
Heather: I think by type rather than use
Sarah: so....type of data
i would do...
raw gene: GS, GO
processed gene: GA PT


2:05 PM Spatial: GIS, xy

 earth
species: bio, eco, paleo
Heather: looks good to me, though I'll push back and ask
whether we keep GS and GO separate
admittedly GO doesn't have much N
Sarah: yes
Heather: but it woudl be really helpful to have a clear picture of just GS
Sarah: it's the older stuff


2:06 PM like blots and arrays

 pre- fast sequencing
Heather: yeah.
for my information, woudl that include microarrays?
stuff that would go into GEO, Arrayexpress?
Sarah: hmmm...i don't think it what i saw
i never saw those db
databases
Heather: ok. jsut curious
I think there is some in this general domain... I saw some at the Evo conference


2:07 PM but not much, and maybe not in the specific journals you looked at

 cool, thanks, helps me sync it up with the bit of the field that I know.
ok, so want to recap proposed groupings?
Sarah: i think some people are moving that way to look at gene expression rather than just raw genes
Heather: for TypeofDataset
yeah... though the data is pretty messy.
relative, analog, etc.


2:08 PM I think it is starting to fade out again in some areas. anyway.

Sarah: ok...one other question...


2:09 PM gis is more "earth" spatial data and xy is more "species" (i,e, ocurrences

 and they would be found/posted in totally different places
Heather: yeah
Sarah: mostly i know that from my past bio work
Heather: so ideally maybe we'd keep them separate
but their N is just really small


2:10 PM it makes me think in that case if we just go up a level of abstraction

Sarah: but, is it better to lump them into earth and species or lump them together as spatial
Heather: oh I see. yeah, so I have no idea.
I could see arguments either way


2:11 PM what do you thinik?

Sarah: i think that i would like them posted together b/c of my experience in that field, but most people that use that data use either GIS or GIS+XY, usually not XY alone


2:12 PM so, GIS probably is distinct

Heather: ok.
so what woudl that make your overall groupings look like?
Sarah: but, on the other hand, most people usually cite xy and gis in this discipline
since it's more for biology purposes
hmmm...
still undecided


2:13 PM raw gene: GS, GO processed gene: GA PT Spatial: GIS, xy earth species: bio, eco, paleo

 sorry
i was editing that and accidentally released it
raw gene: GS, GO


processed gene: GA PT earth: GIS, Ea species: bio, eco, paleo, xy

 b/c in these studies, xy was only given for species occurences


2:14 PM though in other arenas it could potentially be given for other info

Heather: hrm wait
Sarah: there maybe some gis instances that were non-earth, but most were climate
Heather: I think I'd keep GS separate, right?
Sarah: oh sorry,
i was overly concerned with the earth/spatial problem
Heather: yup
Sarah: yeah, i guess, but is GO too small alone?


2:15 PM Heather: yeah, it kind of is, but I think it is better than the alternative of lumping it in with GS

 so a rule of thumb here could be
Sarah: i'm not % sold b/c it's raw info that could be used in the same way as gs
sorry,100%


2:16 PM Heather: I know but we're focusing on datatype not use

 I think a rule of thumb could be keep them all separate
unless there is a strong reason to combine them?
so maybe obvious reason for Bio+PA ?
or GA+PT?


2:17 PM Sarah: bio - both organismic

Heather: or ?
ga-pt: both a step above gs
both go to treebase
Heather: yeah though in theory dryad takes everything that doesn't have a home somewhere else, so it kind of means misc :)
Sarah: gs and go are the "same" in that sense, but go doesn't have a depository home


2:18 PM unless you count pdb which i have one instance of

 paleo could also go to paleodb
Heather: ok
?
I think if we have to spend degrees of freedom,


2:19 PM spending them on typeofdataset is a good place

 because we really do think it is a relevant variable
Sarah: ok. we could also look at them by "typical depository" and "discipline/data type"
so, where they should go and what they are


2:20 PM Heather: ok. if that sort of distinction is clear to you, that might be a useful thing to try.

 I'm rather out of my depth here
my main contribution is that I think there shoudl be some variable
that uniquely identifies "this dataset could have gone to genbank" or treebase


2:21 PM other than that, I defer to your experience

Sarah: but, only problem is that right now we're talking about reuse, not sharing
so is it important where the data should have come from?
so, i agree for sharing but think it is different for reuse
Heather: sorry, I was assuming you'd use the same framework for sharing too


2:22 PM Sarah: i think they might need to be tweaked for each

 b/c sharing and reusing are different things
Heather: would all GS have come from genbank, or might some have come from elsewhere?
clearly not all PT necessarily comes from treebase, does it. could be communication etc.
Sarah: no, some gs come from your buddy or your previous study
pt can be "extracted literature"
Heather: right. ok, gotcha.
Sarah: because you can reconstruct just the terminal nodes


2:23 PM Heather: sorry for being slow in thinking about it that way.

Sarah: i think if you share, your data SHOULD go to a specific place
Heather: yeah, hrm.
Sarah: but if you're reusing, you might have different sources
Heather: right
Sarah: that are more fruitful than a blast search
i.e. the paper where the guy complained about treebase
Heather: yeah, so you know what? In that case I see your logic in keeping Genbank and Treebase separate in the Depositories category


2:24 PM Sarah: and instances where people use unpuslished data

Heather: I thought about Sharing for so long, phd worth, I often default that way instead of reuse. retraining.


2:25 PM Sarah: well, i think depository could be y/n in reuse b/c in theory depositories should have recommendations about how to reuse their data

 unfortunately of which dryad is the only example
oh, but in that same vein, we could not lump them to "see" if they have different policies (or unspoken traditions)
Heather: yeah. hrm. well, take all of my comments and apply them when you think about how to group these things for sharing :)
Sarah: sounds good
i agree with most of what you said in the context of shargin


2:26 PM sharing

Heather: and for reuse, I'm not sure. I could see arguments either way.
I guess it depends on our hypotheses, eh?
Sarah: i'm planning on coding/scoring sharing today anyways and can give you the side by side comparison once it's done
so, reuse


2:27 PM i think datatype is more discipline (genera? type?) driven than depository

Heather: there is an advantage to keeping the sharing and reuse analyses as parallel as we can
Sarah: how so?
if they have different questions, then...different scoring right?
Heather: do you think that reuse citation behaviour is more discipline driven than depository or dtatype?
Sarah: sorry, by discipline i meant datatype
just trying to call it type


2:28 PM rather than where is "should" have come from

Heather: we want to code the things that we think drive reuse citaiton behaviour, right?
right, agreed, where it "shouyld" have come from is kind of moot
where it did come from is relevant
the journal is relevant


2:29 PM Sarah: but where it came from is taken care of by depository

 so we should try to make datatype it's own variable
Heather: right. sorry, I was taking a step back to go forward.
I htink maybe I want to punt on this conversation and leave it to you to propose something :)
Sarah: ok, i need to think it over again


2:30 PM right now, i'm more inclined to stick with datatype (i.e. raw gene, processed gene, earth, etc)

 do data "genre"
Heather: you raised lots of good Qs in your email and want to make sure we get to those too
Sarah: ok
Heather: so the idea of multiple datasets in one paper
and how to handle that


2:31 PM is complicated, and it matters. so....

 one thing that woudl help is if we defined what each of your datapoints actually means, to make sure we are consistent.
Sarah: ideally, looking back, i would have enumerated every dataset in the paper and kept them all separate


2:32 PM but i was thinking on a article level, not dataset

Heather: yeah. that's ok, that happens, learning.
so what is a clean solution now.
Sarah: i only dealt with it when things were obviously different
Heather: where things were "how they cited/shared" them, right?


2:33 PM when the endpoint was obviously different?

Sarah: yes or the datatype
Heather: ok, so you kept all datatypes for each paper distinct?


2:34 PM Sarah: yes

Heather: in that case, could you imagine one column for each datatype in each paper?
agreed, it means the rows aren't independent
Sarah: yes, i was approaching it that way previously
Heather: but I don't think that will be a huge effect, and we can call it out as a limitation
Sarah: but, became tricky in my mind for analysis
Heather: because of independence, or otherwise?


2:35 PM Sarah: the factor i invented, but haven't used, to deal with it was "multiple datatypes", so I counted if the article has multiple types

 no, I didn't know how to deal with so many y/n columnes
i.e. so one article could just use gs, another would use gs and ga
etc


2:36 PM it wasn't making sense to me

 but maybe you know more, is that a better way to run it?
Heather: ok, I'm a little confused.
Sarah: also, there are resolvability and attribution difference
s
between datasets in the same paper
Heather: yeah, right, that is its own disaster


2:37 PM so I think one way to deal with that

 is to code the "minimum" attribution practice used for any given dataset
Sarah: so, that means if i kept it an article level, i would have to have "GS' y/n and "resolvable gs" y/n and "attributed gs" y/n for each datatype
you mean any article?
Heather: so if they did it 3 different ways, capture it (for the stats analysis) as the worst one. (for some definition of worse)


2:38 PM Sarah: but then, can we still look at if "is gs best at resolving"?

Heather: yeah, maybe. I at least mean it for dataset\
right, so I think we don't wan to do it for article. but I think we do want to do it for dataset.
Sarah: b/c the low score, say for earth


2:39 PM Heather: you know what? rather than minimum, it shold be maximum.

Sarah: ok, i thought you were saying how to do it
Heather: what is the best way they attributed this thing?
Sarah: with article
so, you're talking about an article with two datatypes that are the same but cited differently
right?
Sarah: sorry,
Heather: that I think you mentioned


2:40 PM Sarah: datasets that are the same datatype but cited differently

Heather: an article with a dataset that it attributes two different ways
Sarah: well, so let's say to gs
two
to make it more clear...b/c each gs is a single dataset
Heather: ok, now one step up maybe it has two gs datasets and it attributes one fully and another one poorly


2:41 PM a question is what to do in that case?

 right?
Sarah: yep
Heather: right.
one is self reuse and the other is someone else's dataset
so the self reuse is almost guarunteed to be sloppier (i.e. no accession)
Heather: yeah.
hrm


2:42 PM Sarah: that's the main reason i separated them

Heather: well what do you think?
we can
Sarah: and then i have 5-10 that are the situation you proposed
Heather: we can't really have these be two different datapoints only when they do things differently
Sarah: so i think if we stipulate self and datatype separations, those are nothing to sweat over
agreed, those 5-10 that are separated for differences are what worry me
but its also only 5-10


2:43 PM datapoints

 should i give you a specific example of one?
Heather: yeah. so in that case I think I'd do something drastic and nonideal
no it is ok I htink I udnerstand
I think in those cases I'd make sure they are each only one row in the analysis


2:44 PM Sarah: and then give a max or min score?

Heather: and pick either the best or the worst or the first or a random or something sample
yeah. some standard approach for what you chose.
call it out in the methods.
Sarah: since it's only a few, do we consider just getting rid of them altogether?
Heather: where "sample" I mean score... sample of the practices they used
no, don't do that


2:45 PM that would bias things.

 we want them in, we just don't know exactly how :)
Sarah: ok, so let me make sure i'm clear
Heather: I think code their worst practice in each dataset type. what do you thinik?
Sarah: self and datatype differences are ok
to split
into two


2:46 PM but differences in citation practice only

 should be kept together
and given a lump score
Heather: actualy I think you need to keep self together too
Sarah: of some sort
how come?


2:47 PM Heather: I think a row needs to be "one row for every datatype in each of these papers"

 you can call out the self behaviour in tables and discussion,


2:48 PM Sarah: but, what about a self reuse + an external reuse where the author says "i used my data from a previous study (Author year) and some other sequences (with no reference to author, depository, etc)

Heather: but in the stats it would put extra weight on the practices of authors who use their own work, for example....
Sarah: or vice versa where they give the acession for others but not themselves


2:49 PM i guess partly i was also wondering if we would/should eliminate self reuses like you've been doing

Heather: I think you should treat that example the same way as
"I used my buddy's data X and more data from all the people over there"
in the stats


2:50 PM yeah, ok good question. that would solve this problem, wouldn['t it.

 at least removing them from the stats analysis
another thought is that then they could be run separately, in a different analysis, one just on "self citations"


2:51 PM right now I'd go with that as a straw man. removing them from this analysis, defining this analysis as reuses of other people's data. 2:52 PM probably really reduces the N, eh?

Sarah: yeah
one idea is coding self reuses as a factor, but coding them 0, .5, and 1
so no self, some self, and all self


2:53 PM Heather: yeah, hrm. you know what? I think we mostly care about how people attribute other people's data.

 ok, well, maybe we leave this quesiton there too. further, but not fully resolved.
try a few things, esp leaving them out, and we can talk about it again.
I've got to run in 30mins and make sure we cover everything at least a little, if that is ok?
Sarah: ok. one comment though do, i think resolvability is lower with self, but attribution is higher


2:54 PM so i wanted to show that, but maybe that's more of a discussion tidbit

 yep,
Heather: ok. so I think those are good points to know and make. so yes, let's show them somehow, I agree
Sarah: we can move on
sorry for taking up so much time1
Heather: nah it is fine!


2:55 PM so one thing I wanted to perhapse tease out of resolvability is the idea of where the attribution is made

 maybe this is a third type of score?
Sarah: most of it's in the methods
Heather: 0 = no citation


1 = a citation somewhere 2 = findable through full text search of paper (so includes biblio but not suppl tables) 3 = findable in biblio only (so could be found in scopus, isi web of science) (what about captions only?)

Sarah: i don't think it's all that interesting
Heather: yeah, so I think it could be
Sarah: i didn't consider biblio only
Heather: the reason is that if it is in the methods, for people to find it they have to be able to search full text


2:56 PM hard in this day and age

Sarah: i did consider captions only...but do those come up in full text or not....i've been meaning to ask that for awhile
Heather: biblio Sooooooo much easier, and similar to aricticle citations
and suppl info is the hardest of all
I think captions do come up in full text searches, but suppl info doesn't


2:57 PM did you capture info if citation was in the biblio?

 (by biblio only I didn't mean that info was only in the biblio... rather that
Sarah: yeah, that's why i was considering a supplemental caption to be "needs research" for resolvability
Heather: by looking in just the biblio you could find a relevant citation.. as with articles)
Sarah: i did consider y/n biblio


2:58 PM so - reference somewhere, findable in full text, findable in biblio also

Heather: yes
that would be very useful
Sarah: for?
Heather: for makign points about how hard these things are to find given our current bibliometrics tools
Sarah: ok
Heather: people can use scopus or ISI to trace citation histories for articles now


2:59 PM Sarah: i don't have any data attributions in the biblio save a few urls

Heather: and it would be sweet if they could use those or similar tools to trace dataset citation histories
but we're really far away from that right now
yeah. exactly.
I think that data would be useful for Dryad, for example.
Sarah: i didn't see a single doi or accession in the biblio
so i was thinking i would just state that, not analyze it
Heather: they are proposing that dois shoudl go into biblios


3:00 PM yeah. so maybe drop level 3 if it really never happens :)

 but makign the distinction between full text and suppl info is also helpful
because full text at least you have some hope finding it with googlel scholar.
supple info, give up.


3:01 PM Sarah: i thought my current resolvability ratings did that

 i.e.
not resolvable = no author, accession, or depository anywhere
Heather: right, so I think maybe I was imaginign gthat resolvability wouldn't be WHERE was the attribution, but rather what was in the attribution
Sarah: resolavable with extra research = author, accession, or depository only OR just mentioned in supplementary


3:02 PM Heather: sorry, I dove into the last bit first

Sarah: and fully resolvable =
Heather: so if you had the paper, had the suppl info, could you pinpoint the dataset
Sarah: author, accession and depository found in the full text (including tables)
Heather: how resolvable is it if you have everything.
Sarah: no,
Heather: the new scale is more along the lines of how "discoverable is the reuse citation" or something
Sarah: i'm assuming supplementary doesn't always include everything


3:03 PM partly b/c i didn't always look at it

 unless it wasn't clear in the full text
so, it's resolvable with work becuase i would have to decide it it's worth it to track down the supplemental information
and risk that it might be an output table, not the raw sequences
or whatever


3:04 PM i don't understand how discoverable is that different

Heather: hrm, so I'm proposing that we remove the "where" from the resolvability scale
Sarah: or how attribution and resolvablity haven't covered it
what's the raionale for keeping it separate?
one thing i can think of
Heather: here's how I think about them:


3:05 PM Sarah: is that

 some appendices have a separte lit cited
that i';m not sure isi tracks
but, this was only a few incidents
so more anecdotal in my mind
Heather: nope, ISI doens't track it there


3:06 PM so I'm imagining these scales are driven by our intended uses and goals of citations

 one is that people get credit
so the attribution scale is measuring to what degree authors get direct credit for their shared data


3:07 PM another goal is that people can find the dataset that you said you used

 the ease with which they can do that would be tracked by the resolvability scale
a third goal is that people can monitor how datasets are reused over time, across a field
for that they need to be able to find instances of dataset reuse citations


3:08 PM and for that they need to be in places that are easily mined

 so the new scale, "discoverability" or something, would measure how possible that is
does that help? make sense?
Sarah: but, the problem is that i don't think i have an icidents of that, so discoverability is essentially the same as attribution


3:09 PM Heather: so pretty much everything was in the methods?

 like 95%+ would you say?
Sarah: yes
Heather: and not ever in the biblio?
Sarah: but not necesarily in the biblio
well, the author was in the biblio but not the accession
so that's why it's the same as attributino


3:10 PM and the author wasn't always in the biblio

 i.e. they said "we went to genbank and got a bunch of data"
Heather: so sometimes did they say the author name in the text but didn't cite the author's paper?
Sarah: no
unless
correspondence


3:11 PM they would say "data obtained from so and so"

 and then just mention them in acknolwegment
sometimes
but this is liek 5 cases
Heather: yes. so that would be attributed but not discoverable
yeah, I hear you.
and how many times was the attribution just in suppl materials?
Sarah: 5-10
not a lot


3:12 PM but enough to make me upset

 i mean, b/c i'm looking for attribution and wanting to see it
Heather: yeah. hrm. well, you are right then, may not be enough to do trends on. though too bad. future work :)
I would though take thoughts of "where" out of resolvability I think


3:13 PM to me that talks about how hard it is to pinpoint the dataset, given that you have the reusing paper and its suppl info

Sarah: so, if i can find it in supplemental, it's still ok?
or a "2"
if it has all the info
Heather: I think so. I know you didn't mine suppl data specifically


3:14 PM Sarah: ok, do you have other things we should discuss, or should i ask an interpretation question regarding stats?

Heather: though if I understand you correctly, that distinction where you can find it only in suppl doesn't happen very often
Sarah: probably another 5-10


3:15 PM Heather: ok, and how is that different than what we were talking about before?

 in those 5-10 the accession is in the suppl but the author name is in the body?


3:16 PM Sarah: usually, or they just have a blanket statement "we got a bunch of data, see appendix" 3:17 PM Heather: yeah. so that would still count as resolvable. though for what it is worth, it would rank really low on my nice-to-have discoverability scale.

 ok, yes, shoot on your questions
Sarah: ok. so i posed this breifly earlier
mostly the discrepancies between the factors run separetly vs. all together


3:18 PM Heather: yeah. so I think we need to read up more on what factors run all together is actually doing.

Sarah: mostly, years weren't significantly different before and now they are
and datatype and journal were signifcant for molecular ecology and genes
but now they aren't and only genbank is significant as a depository


3:19 PM Heather: I think the Hmisc/Design libraries do some smart things with factors behind the scenes, but I don't know what

Sarah: also, in general, the factors still have one state used as the base and not displayed
Heather: so I'm not surprised that it changes the resutls
yes. that makes sense, right?
Sarah: i think
so they are still compared to each other
not to everything


3:20 PM back to multi vs. single factor

 we're striving for multi factor in the long run right?
just want to verify
Heather: when you mean multifactor, what do you mean?
Sarah: i mean all the factors run together
Heather: multivariate? lots of different covariates all at once?
where some of those covariates are binary, some are continuous, some are factors?
Sarah: and do the separate models (factor by factor) have any meaning?


3:21 PM all are factors

 but that might change
Heather: right
Sarah: if depository becomes binary
Heather: if you look at the one factor at a time, so just have y ~ onefactor rather than y~ onefactor + twofactor
you are doing univariate analysis


3:22 PM and it is a good idea to do that

 people often do it first
for y~ onefactor then y~twofactor
and look at each
then they put them together y~ onefactor + twofactor
Sarah: which is what i did by deafult
Heather: and often things that were significant in univariate analysis
all of a sudden aren't significant any more


3:23 PM Sarah: yeo

 p
i expected that
Heather: right
Sarah: so, do the univariate results still have meaning?
Heather: yes.
Sarah: i'm just not used to dealing with so many covariates
Heather: though they need to be interpreted in the proper context


3:24 PM so, for example, let's think about Nic's data

Sarah: my past experience has been with data that is best to look at in one way or the other but not both
sorry to interupt, proceed about nic;s
Heather: lots of studies including his have found that high impact journals tend to have strong data sharing policies


3:25 PM let's say he has three variables, policy strength, impact factor, subdiscliple

 it is also well known that impact factor and subdisciple are correlated
so in univariate analysis, policy strength ~ IF is significant
policy strength ~ subdisciplien is sign


3:26 PM but maybe in multivariate analysis policy ~ IF + subd only IF is sig and suddenly subd isn't any more

 because really all the signal that was in subd is captured in IF
Sarah: ok
so, that's what were seeing here


3:27 PM univariate =

Heather: yes, probably
Sarah: depository: sig, journal: sig, datatype: sig
but in mulivariate
just depository = sig
so in my overall discussion of the results,


3:28 PM i could say "depository influences resolvability significantly..." and then go on to say "ME is sig when considering journals along and GS is sig when considering datatypes alone"

 that was super rough and probably doesn't make that much sense


3:29 PM Heather: hrm, yes, I'm not 100% sure

Sarah: trying to give the multivariate as the overarching, and the univariate as specific nuances
Heather: yes. or, the univariate to inform what variables you put in your mulitivariate
in some cases


3:30 PM if there is no signal in univariate, usuallt won't be a signal in multivariate

Sarah: i'm planning on using the remainder of my day to do some writeup, so maybe that will help to see if i've got my ducks in a row so to speak with interpretation
well, that's what's weird about year
Heather: whereas signal in univariate often goes away in multivariate because of confounding/collinearity/etc
Sarah: no signal in univariate
but signal in multivariate
Heather: yeah. cool. yeah, that is weird. it can happen though
Sarah: but, i can see that from a model building perspective
Heather: one tool for situations like that is to stratify


3:31 PM and do some tables or graphs just with year one way then year the other way

Sarah: that if i'm looking at all my data and want the best predictive linear model, some factors are better when considering all the factors
Heather: and see if things go in the same direction
Sarah: what do you mean?
or you can point me to an example later if you're short on time
Heather: yes, will send you an example


3:32 PM Sarah: ok,

 anything else we should hash out now?
i'm planning on sending you either a draft and/or revised scoring tonight
and then whichever of the two i don't get to today on monday
Heather: great. send to todd too just to keep him in the loop
Sarah: i'll send the draft to everyone
Heather: ok, good, will continue on monday
great


3:33 PM Sarah: haven't gotten any feedback from todd...is that expected?

Heather: yeah
Sarah: should i be more direct?
Heather: I think he'
Sarah: and, what do you think of his comment on pooling data
Heather: s reading and will jump in if/when he feels necessary
but otherwise will lurk
Sarah: he said to keep the 2000/2010 for all journals and the 2005-2009 for sys/amnat separate
there not all that different when i run them pooled


3:34 PM but, i'm not recalling specifics

Heather: really? so I interpreted his comments the other way.
I think he said: yes, go ahead and concatenate (note that pool actually means something else)
Sarah: "But the population of study for the two sets sounds sufficiently different (ie the identity of journals, as opposed to the random/sequential distinction) that a combined analysis would be difficult to interpret. I'm not sure you would not want to use them to ask the same questions."
Heather: and add a variable to each row that said what the data colleciton method was
Sarah: yeah, that was at the beginning to tease out an effect


3:35 PM Heather: I have to run, but I disagree with his difficult to interpret comment

 to me he said that there is no methodological reason not to, no great big red flag to peer-reviewers


3:36 PM Sarah: ok, i'll ponder

Heather: so I think you should do it
Sarah: for now i'm just running 2000/2010
Heather: because I think it is easy to interpret
Sarah: and checking for discrepancies
how so?
Heather: you just have to make "year" be a real year or number of years ago or something
Sarah: or shoot me an email about it later b/c i don't feel resolved
i'll ponder on that
Heather: ok! bye for now will talk more later
Sarah: and let you know how different the data is
ok. thanks again
Heather: asyncronoously, and with a weekend break ;)


## Chat Transcript July 27 (Morning)

Heather: Hi Sarah

Sarah: hi
Heather: Now a good time to chat?
Sarah: yep
Heather: How
How's it going?
Sarah: good thanks
and yourself?
Heather: Anything in particular you'd like to brainstorm about?
Sarah: the analysis is going good too
Heather: Good thanks. Lots of stats with Nic....


9:56 AM Sarah: just want your opinion on the various categories to make sure i'm justifying them right in my mind

Heather: ok
looking at it
Sarah: then i'll have a stats "party" today and will probably have a few more quesions when trying to interpret
Heather: which in particular do you want to think through?
great
Sarah: um...many factors vs. few


9:57 AM i.e. where we spend degrees of freedom

 and where we do't
Heather: (I love the sound of a stats party, I gotta say)
right
Sarah: also, i need help identifying which categorizations answer what speicfic question
Heather: so what is your idea about combining datasets at this point, so that we know how many datapoints we are working with?
ok
Sarah: things that i can talk myself through but make more sense while brainstorming with someone else
Heather: yes, I hear you.


9:58 AM Someone I worked with woudl ask other people to "be his cat"

 = the real value isn't actually in what they say, just that they are listening and forcing him to say it aloud
Sarah: yeah, usually my husband is, but we mostly talk about his work in the evenings these days
anyways,
Heather: exactly. ok
Sarah: so, i think dividing the "data types" into the "genres" makes more sense beause


9:59 AM then we're looking at distinct information from depository

 that gets rid of a directly confounding variable and allows us to ask the data things from a different angle
rather than being repeititive
Heather: ok, so let me make sure I understand
you are thinking of having a gentre covariate


10:00 AM and also a depository-defined covariate?

Sarah: well,
depository would be the other variable no matter....that's the one in the 2nd set of columns
Heather: ok, gotcha
Sarah: but, i'm saying define data type as "genre" and define depository as depository


10:01 AM Heather: ok, sounds reasonable.

 and you have about 5 genre classifications, is that right?
Sarah: yes
Heather: ok
Sarah: more driven by the inherent data type
i.e. analysis unit


10:02 AM what we call "scales"

 in biology
Heather: ok
Sarah: meaning, levels of biological variation
the other alternative is the data-defined discipline
Heather: and do you think these genres might also have some commonalities
in terms of culture, research culture?


10:03 AM Sarah: hmm?

 referring to discipline?
Heather: while admittedly being still very diverse within each genre
Sarah: or referring to "genre"?


10:04 AM oh, does genre have commonalities to what?

Heather: I was refering to genre, asking if that might also
represent a classification of research culture. so, do you think that people who do organismic studies


10:05 AM might have similar thoughts/experiences about data sharing, compared to people who do gene-type studies

 obviously there is a lot of diversity within, I'm just trying to dig out what our interpretation of genre would be
Sarah: yes


10:06 AM and they would be using and transmitting/sharing them for relatively similar purposes

Heather: ok, good
Sarah: phylo is the only weird one with that
it almost should be lumped with gene
Heather: ok


10:07 AM Sarah: that's why i thought maybe "data-defined discipline" might be good

 they could be lumped b/c most phylo trees are the result of genetic analysis
but, on the flip side, genes could be used for many other analyses and phylo cannot
Heather: right.


10:08 AM hmm.

 well there aren't many PT, right?
Sarah: not reuse, but shared
that's the other issue
Heather: so there is a downside to having them all by their lonesome
oh, true
Sarah: if we're trying to keep the categories the same for reuse and shared, i would think phylo separate


10:09 AM Heather: ok.

Sarah: partly b/c people are good about posting genes but not phylo
Heather: so Sarah I think that mostly I don't have much good insight into these groupings, and it probably easy to overthink them.
Sarah: partly b/c they are a different "genre" of data that is reused in a different way
Heather: I'd base it on what you want to be able to say.
and then just pick something and run with it.


10:10 AM because I don't know there is a "best" answer, and/or I don't know how we'll know.

Sarah: ok.
Heather: one cut-to-the-chase approach is to drive to degrees of freedom and datapoints
how many degrees of freedom does your straw-man analysis have?
how many datapoints are you looking at?


10:11 AM Sarah: both perspectives interest me (genre and data-defined discipline) in terms of the question, but they are about equal in pros and cons...genre is slightly more relevant

 um....270 data points
5-8 factors
Heather: that is for reuse?
Sarah: yes
just reuse
Heather: for combining the snapshot and timeseries, or not?
Sarah: and 2-5 character states for each factor
combined
Heather: ok.


10:12 AM Sarah: so 25 degrees of freedom low end, 40 high end

Heather: so a rule of thumb in regression is to have 30 datapoints per degrees of freedom
obviously rules of thumb are just that
and it totally depends on the size of the effect you are studying, etc
Sarah: ok, that's a good reference point though
Heather: but it does help to ground it
exactly
at least 10 datapoints per df, at least.


10:13 AM Sarah: b/c i'm also still deciding if we should look at funder, journal discipline, etc

 i haven't felt as driven to investigate those
Heather: so I think that means you need to stay away from your high end
Sarah: and they might need to get the cut for more important questions
Heather: I'd say you don't have enough journals to get at journal discipline
the funder effect is probably too weak
Sarah: agreed
Heather: so I'd let those go for now, in terms of stats


10:14 AM useful to have extracted! but not for main analysis

Sarah: that's what I was feeling, we just had talked so much about it that i felt somewhat obligated to at least run it and see
but, at the same time didn't think my data would support anything conclusive
Heather: yeah. well, you can, but not in main (or first) run. nice-to-haves exploratory analysis
yeah
so I think given that one always thinks of more things to spend df on later
(interactions!?!? arg)


10:15 AM I'd really drive to sticking with rough-grained levels

Sarah: so 3 factor states over 5-8
because then we could delve in deeper later
Heather: right
Sarah: or discuss the details in percents/tables/figures
Heather: right


10:16 AM Sarah: so, if i went with "data defined discipline" (i.e. classifying data types by our overarching disciplines), and Evolution was significant, then I could dive in with a univariate analysis or table of the Genre classifications 10:17 AM to say what's making that difference

Heather: yes, true, as exploratory analysis. post-hoc analysis, etc.
Sarah: ok
Heather: yes, I think that is your best bet. Otherwise it is tempting to be overly ambitious, and 270 datapoints (while a lot in some ways!) isn't a lot in other ways.


10:18 AM Sarah: yep agreed...and more fun in some ways for analysis

Heather: good.
Sarah: I like the big picture + fill in details where it matters approach
Heather: about how many sharing datapoints? more I would guess?
great
Sarah: yes
for sure
i haven't parsed it all out yet
Heather: ok, good.
Sarah: that's on the ticket for today


10:19 AM Heather: do you feel ok with a plan for how to deal with self citations in reuse?

Sarah: since I've figured it out for reuse, it should be cinch
yes
i haven't run the stats to test for artifacts yet, but that's on the to-do today as well
Heather: good


10:20 AM Nic has been experimenting with posting R code and results on his OWW pages

 I think you've seen it?
Heather: no, that's ok, alas I understand
just wanted to point you there because he's done lots of the learning
Sarah: at the end of the day, i always intend to
ok


10:21 AM Heather: so might as well learn from him about how to link/embed gists, post figures, etc.

Sarah: ok
Heather: yeah, it definitely is a habit that is hard to start
Sarah: it seems like his code also codes the data into binary/factors, etc, whereas i've been doing most of that in excel where i'm more comfortable manipulating the data
but I'll try and post some prelim results and figures today


10:22 AM Heather: even if your (my) heart is in the right place... just takes some doing when it isn't in your natural workflow.

 yes, I think either way works there.
I pointed him at how to do it with R because wanted to give some easy concrete examples about how R works.
Also can help decrease typbos and bugs in some ways


10:23 AM but advantages to teasing it out in Excel too. Makes your dataset a bit more transparent and stand-alone.

 ok, can I be a cat in other ways?
Sarah: yeah...on the depository


10:24 AM again, the question is how much to collapse

 i think for our purposes, just depository YN answers are major question of if depositories are being used in the ways we anticipate
so, "are depositories the better way for sharing and reusing data?"


10:25 AM Heather: yes. though definitely an advantage to keep genbank data separatable, because I think we'd also like insight into whether genbank is a special case 10:26 AM Sarah: but could that be another secondary univariate analysis or table/figure?

 especially with reuse, there are only like 4 treebase reuses and the rest of the depository ones are genbank
the story is a little different in sharing


10:27 AM Heather: yeah. in sharing I'd err on calling it out explicitly in the main analysis

 though I could probably be talked out of it
so that would be my inclination. do with it what you will.


10:28 AM Sarah: ok, i'll ponder more, but it's good to know your opinion

 regardless, i should keep "not indicated" as a sepearate category?
right?
b/c not indicated is a lot different than retrieval or sharing from a url
Heather: hmmmm I think yes. because it isn't yes, and it isn't no.


10:29 AM Sarah: yeah...and typically it's associated with other "sloppy" practices

Heather: yeah. and it is interesting in and of itself.
Sarah: ok.
i think that's I'll i have for right now
Heather: ok!
Sarah: this has been helpful for me to rationalize things "out loud"


10:30 AM Heather: in the call today I think I'll just ask each of you to summarize where you are and what you see as your next steps

 and if there is anything you need from other interns/mentors
anything else worth covering?
Sarah: ok...i'll try and do another run through my analysis so i can voice questions/results better.
is today a call or chat?


10:31 AM Heather: I'd ideally like to do some "end of internship" conversations, but I'm afraid I don't have much insight into how it works from here on out.

 we're going to try a chat with everyone. wish us luck
Sarah: ok. i'm totally open to post-internship chats
Heather: ok cool
Sarah: especially if we're essentially passing the datasets off to you
Heather: ok, talk to you later Sarah!
Sarah: ok. thanks again.


10:32 AM Heather: yeah, right! and I don't know that you are? I don't know. ???.

Sarah: well, regardless, i'm not planning on just dropping this the official day the internship ends
ok, talk to you later.


10:33 AM Sarah: ok. talk to you at noon.

## Chat Transcript July 27 (Afternoon)

1:15 PM Sarah: i'm good whenever you are

Heather: Me too.
Ok, so topic to cover
are the univariate/multivariate approach
anything else?
Sarah: yep.


1:16 PM so, i ran univariate on most of the factors and states somewhat by accident when i was trying to figure out oridinal regression

Heather: ok :)
Sarah: then we talked about how things that were significant there didn't turn up sig in the multivariate


1:17 PM Heather: right, true

Sarah: and how the multivar was better for distinguishing things like journal vs. depository vs. data type which overlap in someways (i.e. Molecular Ecology = genbank = gene sequence)
Heather: right


1:18 PM so I think there may be a middle ground

 that is worth considering
so our hypotheses (and prelim observations) about your data suggest that there are a few vars that are indeed important
year
if it is genbank
are those the main ones?


1:19 PM if it is sysbio?

Sarah: yeah, so
year
journal
depository
dataetype
Heather: well, actually before you run with that
and flush it out to the whole columns with all of their possible variants


1:20 PM a different approach is to focus primarily on the univariate analysis

 but make it very selectively mutlivariate
by including year and is.genbank and is.sysbio or something
Sarah: hmm?
that's the r coding for it all, right
Heather: ok. so the idea woudl be


1:21 PM for each of your factors

Sarah: so, asking about a specific question
Heather: do a univariate analysis with all of the gory level details of that factor and your dependant variable
which I understand you did
Sarah: i.e. sysbio vs. genbank vs. year
Heather: I think so. I mean have one variable for the factor


1:22 PM plus another binary variable for is it genbank? another for is it sysbio? and a third for year

Sarah: but why is that better than year vs. journal vs. depository?
Heather: so attempting to keep it "univariate" but bowing to the fact that you have some dominant trends (at least we think you do)
I'm not sure what you mean by vs in that sentence?
Sarah: ok
+
= vas
+vs
sorry


1:23 PM Heather: ok. yes plus

 right
Sarah: major typing problems
Heather: so journal uses up extra degrees of freedom, relative to just "is it sysbio"


1:24 PM So I'm not necesarily reocmmending this, but an approach would be

 look at each factor in turn
with all of its gory level detail
and just enough multivariate stuff to admit that there were major changes by year that would probably overwhelm more delicate trends
(and maybe genbank/sysbio/a few other things)


1:25 PM and then see how all of those gory levels show us in univariate analysis

Sarah: ok. i still see that as a supplementary analysis to the bigger multivar
Heather: and just pull the interesting things into the multivariate.
Sarah: i.e. picking out trends i've noticed or that the multivar allude to
and then going into more detail
Heather: yeah, so I think the univariate->mulivariate treats the multivariate as the exploratory or supplementary analysis
Sarah: well, i guess that's what i mean


1:26 PM multivar to explore and establish big picture trends

Heather: yeah. so for what it is worth, I think the univariate-> multivariate path is more common
Sarah: then univar (more detailed factor states and specific trends like is.genbank + is.sysbio)
oh, i thought you told me the reverse in our previous discussion
Heather: I did.
Sarah: both ways make sense


1:27 PM Heather: well jsut a sec

 let me make sure I don;'t make it more confusing
what is most normal is to do univariate first
and if any thing just isn't interesting in univariate, then don't look at it any more
if things are interesting in univariate, then pull them together in a multivariate


1:28 PM Sarah: i see what you're saying

Heather: that is the normal approach
Sarah: from what i remember, though, datatype and journal and depository were interesting in univar
Heather: appologies that I didn't make that clear in the first place
Sarah: that's why i continued to include them
but, year wasn't
but it ended up being sig in multivar
so should I not include year then?
Heather: I sort of ran with the multivariate, and it no doubt would have been useful for me to step back and ask more questions about univariate first.
Sarah: in multivar i mearn


1:29 PM Heather: well, one question I have

Sarah: b/c we think it's probably a meaningful factor
Heather: is about whether what we know from what you've done before
was done with the combined datasets or just snapshot or just timeseries
Sarah: just snapshot
i think


1:30 PM Heather: I think before we make decisions on what we've seen before

 it will make sense to post the code and results and stuff
(in part to help me keep it straight)
Sarah: yeah sure
Heather: and in part to help make it clear what data we used to run what
Sarah: apologies again
Heather: no no apologies
Sarah: i'll structure it with annotation too


1:31 PM Heather: if you only knew how many analysses I've run, meaning to post the code and not getting to it.....

Sarah: which has been my main hold up
i mean, it's messy right now but close to being intelligible
Heather: I think everything you've done has been really useful
and will be useful again
Sarah: yeah, but i'd like a clean version that actually demonstrates the anlaysis
not just me playing around
Heather: but probably a good chance to go through systematically


1:32 PM Sarah: exactly

Heather: right, yeah exactly
I think we should do that before making decisions on what we've seen before
so part of the systematically
is probably to look at everythign in a univariate way before a multivariate way


1:33 PM Sarah: yeah...this is what i'm picturing

Heather: and for that you could either do strictly univariate
Sarah: 1. univariate with the more detailed factor states
Heather: or decide that you'll slip in a few dominating variables
Sarah: 2. multivariate with the relevant factors and their more general/broad factor states


1:34 PM 3. interaction between reuse and sharing

Heather: yup sounds good
Sarah: 4. univariate of dominant suspected trends
(i,e. sysbio + genbank + 2010)
or whatever
Heather: ok
one comment on 2
Sarah: i.e. the ones we think are driving the trends
Heather: 2.
Sarah: ok
Heather: I think what Todd was suggesting is that you use the results of 1.


1:35 PM to inform whether you use general/broad trends

 or specific trends
for each factor individually
Sarah: ok
Heather: so if there is a factor that seems to correlate strongly with the output in univariate analysis
then throw all the levels into the multivariate analysis!
Sarah: so maybe 1 = exploratory
1a =
Heather: that suggests it is a good way to spend degreees of freedome


1:36 PM Sarah: 1b = univariate of each individuall with specific factors

 then 2 =
Heather: for other factors that look more boring, just include them broadly (or not at all)
Sarah: multivar with relevant factors
by boring, do you mean things like open access, funder, etc
that we weren't planning to explore initially
Heather: by boring in that case I meant


1:37 PM its levels didn't correlate with the outcome very strongly at all

Sarah: hmm?
its = ?


1:38 PM Heather: ok, so for each factor you have

 (including OA and funder too, if you want)
do a detailed univariate analysis of detailed levels


1:39 PM if the output variable (reuse patterns etc) correlates with the detailed levels in an interesting way

 then plan to a) include that factor in multivariate anaalysis
and b) keep a fine resolution of its levels
if the factor didn't correlate with the output variable at all, perhaps don't plan to include it in the multivariate anlysis at all


1:40 PM Sarah: but, if it doesn't correlate at the specific leve, should i also run it at the general level?

 or is that just fishing/mining?
Heather: if the factor did correlate with the output level, but only as a general trend or only with very broad resolution, then maybe a)
include it in the multivariate analysis, but b) collapse its levels to something more broad than you looked at in the univariate analysis


1:41 PM if it doesn't correlate at the specific lvel at all,

 or even much,
in univariate analysis
then looking at it in a general way won't make it correlate either
now if there is a trend
in the specific levels, but it isn't significant


1:42 PM then combining multiple subtrends may make it become significant...

 but yeah I wouldn't do that on purpose, for that goal


1:43 PM except maybe as a sidenote, saying "looked at separately x and y were weakly related to x. 1:44 PM in post-hoc analysis, combining them into one category did demonstrate that the genearal concept had a statistically significant association"

 or something like that
am I just confusing you more?
Sarah: no, i think i'm following
i spent the morning recoding my data, so i'll dive back into analysis with all this in mind


1:45 PM Heather: it is probably easiest to talk this through in concrete

 ok
Sarah: should i also consider keeping factors and factor states (specific vs. broad)
the same for reuse and sharing?
Heather: sorry again, I feel like I led you down a multivariate-first path that was not the best choice.
or not obviously the best choice, anyway
Sarah: i.e. if something is sig for reuse but not sharing
Heather: and in stats it is always worth doing the obvious first unless there is a reason not to :)


1:46 PM Sarah: should it still be included for both

Heather: yeah, good question. So at
Sarah: to keep things analogous
Heather: this point I'd say let the data tell you.
keep the fine grain resolution the same for both (as it is)
and include that resolution in the univariate analysis
and then see what it shows


1:47 PM Sarah: ok.....but in terms of keeping things analogous, does that supercede factor/resolution decisions?

 or not?


1:48 PM Heather: good question. I think it depends. I think that one might be better to answer in a concrete situation than in theory. 1:49 PM Sarah: ok. well, i'll run things today and post the code later.

 i like to keep the outputs in the code as annotations
does that bother you?
Heather: ok.
Heather: what do you mean as annotations?
oh I see
Sarah: with an "#" in front
Sarah: yes
Heather: no, doesn't bother me.


1:50 PM if you put the code as a gist I think it will even colourcode them for you differently

 might make it easier to read
Sarah: ok, i prefer that so i don't have to rerun things but it makes reading through the R file a little different
Heather: when embeded on OWW
Sarah: yeah, i don't know how to make things pretty in R as well as i should
yep
do you/todd/etc want an email besides the RSS?
Heather: that's ok. I'm not a R expert either.


1:51 PM No, the RSS is fine

Sarah: ok.
well, i think i'm good to go
thanks again for the long conversations and help
Heather: No problem! I wish I always knew the best answers :)
Ok, talk to you later!


1:52 PM Sarah: thanks. bye!

## Chat Transcript July 29 Morning

12:50 PM Heather: Hi Sarah 12:51 PM had a quick look at your code and I have one suggestion at this point....

 Have you thought about making your year variable be a number rather than a factor?


12:55 PM Sarah: yep 12:56 PM that occured to me as i ran the code

 i coded it that way at some point, just need to dig it out of the excel
Heather: yup
I think it can be easy to get out of R too


12:57 PM Sarah: i'm not as versatile in R

 do you know a simple code of the top of your head to convert a factor to a number


1:00 PM Heather: maybe this:

 2010 - as.numeric(substr(YearCode, 2, 5))


1:02 PM lrm(ResolvableScoreRevised_Max ~ Journal+rcs(YearNum, 3)) 1:03 PM and it calculates a coeffieient for each of 3 subintervals, allowing nonlinearity

 the nice thing is in the anova() they all collapse down again


1:03 PM Heather: anyway, not sure you want to go there, just wanted to let you know it exists

12:58 PM Heather: I think as.numeric 12:59 PM but I'm not sure that's actually what you want in this case

 since you don't have examples of all the years
Sarah: yeah
Heather: the factor numbers don't quite line up with the years
Sarah: but i could change the numbers on some


1:00 PM i know R had problem with the big 2000s for some reason when i first ran it

 so i was just using the last number (0-10)


1:01 PM Heather: btw in case you are worried that the year is actually nonlinear

 (and modeling it as an integer like this assumes it is linear)
one of the nice functions in the Design library is rcs()


1:02 PM you use it like this: 1:05 PM btw I like your qlogis function stuff, I hadn't seen that before. Makes for useful summary plots.

Sarah: it was all in that tutorial you sent
Heather: cool.


1:06 PM I think I'll hold off on digging further into your code now, since it would be kind of an undirected dig

Sarah: ok. and I'm currently compiling a concatenated excel table on the univariates that makes it easier to read anyways
Heather: I'll wait till you give it a bit of structure, probably by pulling it into a (ROUGH) pub framework
ok
Sarah: i'll post that with my updated code later today


1:07 PM yeah...i'm working towards the writeup

Heather: another idea on that is just to save all of the univariate plots as pngs and embed them in an OWW page
Sarah: still floating a bit though with the unresolved anlaysis
Heather: maybe on your diary page for today or something
Sarah: i've got them saved as jpg, so definitely could
Heather: that way they are all there in a row?
Sarah: and i can track nics code for how he did it
yes to the previous question
Heather: yeah, I'd play with embedding your jpgs on an OWW page. that would help me anyway :)
Sarah: so a table for pvalues, one for coefficient, etc


1:08 PM then can see everything side by side

Heather: yeah, something like that. I also really like confidence intervals
Sarah: around the coefficients?
Heather: (and have some custom code for adding them to a summary plots, but it doesn't seem to easily work on the fancy qlogit function version)


1:09 PM Sarah: suggested code?

Heather: yeah, around the coefficients
so the easy stuff that I've been doing with Nic has been using glm, not in the Design library


1:10 PM Sarah: i need to do some glm for my binomial stuff, so i can look in nic's code for that

Heather: confint(ologit4)
Sarah: ok great
Heather: or for Nic's stuff exp(confint(ologit4))
but confint doesn't seem to do the same magic with Design as glm


1:11 PM though there must be a way

Sarah: and if you could send me your custom code for plotting it, I could work on adapting it to Design
Heather: I don't have it at my fingertips
Sarah: ok. that's a starting point though
Heather: yeah, you know what? Let's call that out of scope because it is really ugly code to work with
and I don't want you falling down that rathole
Design is great, but its R-code is NOT well commented!
(or variable names well chosen, or....)
Sarah: whatever, but i can make a note of it in my code at least


1:12 PM Heather: that said, yup,

 hold on
here it is
http://gist.github.com/485585


1:13 PM I basically just typed

 plot.summary.formula.response
then call it like this
s = summary(y ~ x)
when would normally say
plot(s)
now I say
plot.summary.formula.response.CIs(s)


1:14 PM Sarah: ok. i'll at least give it a try and keep the note with all the info in the code 1:15 PM Heather: yeah. though like I said, don;t spend time on it. I'll probably dig into it sometime in the next month or so to get it to work, because the ability to superimpose plots of

 y>=1 and y>=2 is sweet, and I think I'll want it in the future :)


1:16 PM Sarah: ok. sounds good.

Heather: great. if you run into problems with the gist or embedding, ask me or nic.


1:17 PM talk more later today or tomorrow, then

Sarah: sounds good. thanks
Heather: yup, no prob. bye!