DataONE:Notebook/Summer 2010/2010/07/26

{| width="800"
 * style="background-color: #EEE"|[[Image:owwnotebook_icon.png|128px]] DataONE summer internships 2010
 * style="background-color: #F2F2F2" align="center"|  |Main project page
 * style="background-color: #F2F2F2" align="center"|  |Main project page


 * colspan="2"|
 * colspan="2"|

{| width="800"
 * style="background-color: #cdde95;" align="center"|
 * style="background-color: #cdde95;" align="center"|




 * align="center" style="background-color: #e5edc8;" |

title=Search this Project


 * colspan="2" style="background-color: #F2F2F2;" align="right"|Customize your entry pages 
 * colspan="2"|
 * colspan="2"|
 * colspan="2"|

Chat, Heather and Nic
From: Nicholas Weber  Date: 2010/7/26 Subject: Chat with Nicholas Weber To: hpiwowar@gmail.com

9:53 AM Heather: Hi Nic, how's it going? 19 minutes 10:12 AM Nicholas: Hi 10:13 AM It's ok, I tried what you suggested by making the publishers into categories, but nothing they did not have a p-value <.05 7 minutes 10:21 AM Heather: Hi Nic No problem at this point all p-values should be taken with a grain of salt anyway because it is not clear that we have all the relevant variables in our equation do you have some more time now or later? 10:22 AM thinking we can go over how to compute predicted probabilites, and then what variables should be in your equation :) What do you think ? do you have other questions first, or ? (oh, probably lunch time for you, eh? tell me when will be a good time) 10:23 AM Nicholas: hi sorry no I was just in R trying to figure something out 10:24 AM now is good, or whenever, I'll turn my volume on so I hear it if you ping me Heather: ok now it good for me too Nicholas: one quick question Heather: yes 10:26 AM Nicholas: when I was trying to create a variable out of the is.Ecology i kept getting NA's instead of O's Heather: ok, what is the code you are trying Nicholas: I was doing : > is.Ecology=0 > > is.Ecology[sub("*Ecology*", ISI.Category)] = 1 nope sorry grep instead of sub 10:27 AM is.Ecology[grep("*Ecology*", ISI.Category)] = 1 is set is.Ecology to 0 initially 10:28 AM Heather: ok, what is your exact code? are you doing Nicholas: is.Ecology=0 is.Ecology[grep("*Ecology*", ISI.Category)] = 1 10:29 AM but then when I try is.Ecology I get NA instead of 0 Heather: yes, ok so I think this part is a bit tricky the problem is that when you say Nicholas: so when I try to do "mylogit = glm(requests~Impact.Factor + is.Ecology, family=binomial(link="logit"), na.action=na.pass)" Heather: is.Ecology=0 Nicholas: I get an error Heather: it doesn't know if you want that to just be a number, or a "vector" or what 10:30 AM Nicholas: oh Heather: when you say mydataframe$is.Ecology = 0 then it knows that you want every row in your data frame to have a 0 in it but when you use the attach then add a new variable R doesn't know if you want that to be a "data frame" sort of variable, or just a single number 0, or what so it defaults into a single number 0, I think 10:31 AM there are a few fixes Nicholas: shoo,t I think I remember reading that in the intro too ok, sorry Heather: is.Ecology[!grep("*Ecology*", ISI.Category)] = 0 nope sorry I meant to type that into R first to try it out, and it doesn't seem to work 10:32 AM ummm hmmm, there are lots of fixes, I'm just trying ot think of the nicest how about is.Ecology = rep(0, length(ISI.Category)) 10:33 AM yeah, then your line is.Ecology[grep("*Ecology*", ISI.Category)] = 1 does that seem to work? Nicholas: yup that worked Heather: rep means repeat 10:34 AM so repeat the number 0 many times, the same number of times items in ISI.Category make sense? same number of times AS items in .... Nicholas: yeah Heather: cool ok, probabilities then talking variables to in the tutorial, 10:35 AM we'll skip the section about deviance residuals for now and log likelihood and jump to the red part called "using predicted probabilities" Nicholas: ok 10:36 AM Heather: try these three lines: newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25)) newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response") newdata1 you get a table with some probability looking things ? Nicholas: yes 10:37 AM Heather: ok, so what that means is that I gave our "model", our regression equation, the thing stored in mylogit some example datapoints I gave it datapoints where noOA is 1 and Impact.Factor is several possibilities and it used the coefficients it estimated and plugged in those values 10:38 AM and computed the probability that a datapoint like that would have a sharing policy so for a journal that is not OA and has an IF of 1, the probability is just 8% 10:39 AM but for a journal that is not OA and has an IF of 15, the probability is 60% how try these three lines: newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25)) newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response") newdata1 this is the same, but noOA = 0... so in other words, it is a journal that has some OA 10:40 AM for a jorunal with some oa and an IF of 1, it has a 14% prob of data sharing policy, and with an IF of 15 it has a 72% probability make sense? it is just a different way to interpret what the slopes/coefficients are actually telling us :) 10:41 AM Nicholas: ok Heather: questions? 10:42 AM Nicholas: I think I added a column so I'm seeing something different one sec Heather: oh I see, yeah that woudl be confusing 10:43 AM here is the orig that I'm using: mylogit = glm(requests~Impact.Factor + noOA, family=binomial(link="logit"), na.action=na.pass) then 10:44 AM newdata1 = data.frame(noOA = c(1), Impact.Factor=c(1, 5, 10, 15, 20, 25)) newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response") newdata1 then do you get the first line ending with 0.08497077 Nicholas: yup ok got it now Heather: cool 10:45 AM ok, so for no OA, IF=1, prob of plan = 8% no OA, IF 15, prob of plan = 60% make more sense now? Nicholas: yup 10:46 AM as IF goes up, probablity of a data sharing plan increases if no OA Heather: right. now what about if some OA? then would try it again with the noOA variable set to 0 like this: newdata1 = data.frame(noOA = c(0), Impact.Factor=c(1, 5, 10, 15, 20, 25)) 10:47 AM newdata1$requestsP = predict(mylogit, newdata=newdata1, type="response") newdata1 so as IF goes up, probablity of a data sharing plan increases if some OA, also and the overall level is higher, right? Nicholas: yes, increase is more dramatic 10:48 AM Heather: yes. looked at a different way, you could also be tempted to say, based on this data at an IF of 15, prob of plan if noOA is 60% and prob of plan if some OA is 72% 10:49 AM but, you want to keep in mind when you say that that the noOA variable is pretty unclear in our model its p-value was higher than 0.05 its confidence intervals were wide, etc 10:50 AM so while there is a coefficient calculated and the table of probabilities uses that coefficient, we want to be careful about saying things about trends in the noOA variable whereas we feel quite confident about saying things about trends in IF because its p-value was signifiant see how the p-value and confidence interval interpretations need to be at the back of our mind as we play with the numbers? confusing, or ok? 10:51 AM Nicholas: no I understand, but...this might be a naive question, why would we continue to play with the numbers if we know our p value and conf int aren't sig? Heather: yeah, it is a good question 10:52 AM some people woudl advocate for dropping noOA out of our model when we realize it isn't signifcant and rerunning our statistics and calculating our proabililities based on the streamlined model 10:53 AM that said, the 5% threshold is in some ways arbitrary and so others say we should keep the nonsignificant things in anyway, because we don't actually know enough to take them out so ???. it depends. 10:54 AM mostly, whether you keep them in or take them out, you need to be very hedgy whenever you say anything about them does that help? Nicholas: yes 10:55 AM so when you present stats like that, if you choose to, how do you present them by present I mean write about them Heather: yup that's one of the things I love about the tutorial.... 10:56 AM "These findings can also be interpreted using predicted probabilities. With all other variables held constant at their mean, the probability of admission for a gpa of 2.0 was .15, while a gpa of 3.0 resulted in a .26 probability of admission and a gpa of 4.0 was associated with a .40 probability of admission. Likewise, for gre scores of 400, 500, 600 and 700, the probabilities of admission were .22, .26, .31 and .37, respectively, while holding other predictors constant at their mean." to put that in our language: 10:57 AM For jorunals with no OA content, for IF values of 1, 5, 10, ... , the probabilities of having a data shairng plan were x, w, sd, respectively Likewise, for journals with OA content, ..... 10:58 AM And in the first sentence I would say: 10:59 AM A logit regression was used to predict whether a journal had a data sharing plan from the journal's impact factor and whether or not it publishes any articles open access. IF was a significant predictor of having a data sharing plan, but publishign OA content was not statistically significantly related to having a data sharing plan. does that help? 11:00 AM Nicholas: yes, thats great. Heather: cool. yeah, I get how having it very applied to your own case can help a lot :) ok, so one more caveat I want to emphasize before we get start planning analyses 11:01 AM it can be tempting to really run with this stat analysis stuff but important to remember its limitations lots of limitations, but the one I'll emphasize right now is that it is correlation not causation so it isn't necesarily true that having a high IF CAUSES the journal to have a data hsaring plan 11:02 AM Nicholas: right Heather: it could be that having a data sharing plan CAUSES it to have a high IF (ok, unlikely, but still) or having a high IF and having a data sharing plan are both correlated to being published out of England, or something, where they are good publishers and they love data sharing or something 11:03 AM anyway, just want to reiterate that so that you can practice writing things up accordingly Nicholas: ok Heather: it can be tempting to say "increases" but often more appropriate to say "is associated with increased" or something 11:04 AM disputes can be had about which is better scientific writing, and don't second guess yourself much as you are writing this up jsut want to highlight the issue ok? Nicholas: thats very helpful Heather: ok 11:05 AM now. what variables do you think you want to look at? we'll think about what variables, and how they will be coded one detail is that you can't include oodles of variables because you only have so much data you have 307 datapoints, right? Nicholas: right 11:06 AM Heather: so opinions differ, but a rule of thumb is to have about 30 datapoints for every coefficient you are trying to estimate so that would be about 10 coefficients and depending on your variables, that is probably fewer than 10 variables. 11:07 AM if, for example, you had a publisher column with publisher A, B, C, and other, that is actually 3 coefficients (4 minus one for the base case) so you can see you can spend the 10 coefficients really fast Nicholas: yes Heather: so don't get tooo hooked up on the 10 right now but it does mean we need to focus on what is important 11:08 AM and probably leave most of the "nice to haves" out should I give you a bit more background about why you can't have 42 variables in the equation, or is that enough info on that for now? 11:09 AM Nicholas: um, I think we can probably move on-- I don't the why, but it makes sense that you can't calculate that many variables for a limited amount of datapoints don't know the why that should read Heather: yeah, good. It doesn't make sense to go into the why on everyting today, that is for sure :) 11:10 AM ok. so what would be the journal variables at the top of your list? and/or what ones do you think are not important? Nicholas: ok, so as far as significant variables, IF, Requested / Required are definitely important Heather: where "not important" is "not important to test" Nicholas: I think the categories, Ecology, Env Sci and EvoBio probably should be as well 11:11 AM but I don't know that we've already seen that the subscription model probably isn't sig. Heather: Nope! we haven't see that yet :) we just played around with it in some test stats but those were done for learning and not for interpreting yet 11:12 AM so don't read anything into them please :) Nicholas: the "has instructions how to cite data" has a really small number of observations Heather: it could be that when you include subscription model in an equation that also has the ISI categories the subscription model is actually relevant, for example..... Nicholas: Affiliation, would be interesting to have as well 11:13 AM Heather: where Affiliation woudl be "does it have an affiliation?" Nicholas: yes, a society affiliation 11:14 AM Heather: ok, so is it affiliated to a society btw do you get what I was saying about subscription model? 11:15 AM given that we don't know anything about it yet, do you think it is a must have in the initial analysis? Nicholas: yes, just because it wasn't sig for what we just ran, doesn't mean that its not signficant in all cases Heather: right... but even more.... Nicholas: I do 11:16 AM Heather: we were just running things unofficially, as learning Nicholas: right we can't discard it just because we know it to not be sig right now Heather: at the risk of belabouring the point, I'm going to say..... 11:17 AM " because we know it to not be sig right now" nope... we DON'T know it to be not sig right now we were playing with some data, but we weren't being careful yet, and so our stats inputs and methods probably weren't all proper yet, and so we don't actually know anything yet 11:18 AM Nicholas: right... yes, I think subscription model is important and it should be included Heather: ok, cool. Nicholas: could we divide the publishing group into a Elsevier, Wiley, Springer, Taylor vs everyone else -- or what that not be sound ? 11:19 AM would that not be sound Heather: I think that would be ok anything else? 11:20 AM Nicholas: I dont think so 11:21 AM Heather: yup, seems like a good list to me 11:22 AM so the variable we are regressing on, our dependant variable, is "does the journal have a data sharing policy that requests or requires data sharing" aka requests (or some other name?) and the indep variables are: IF subscription model categories publisher society affiliation did I get them all? 11:23 AM Nicholas: its probably not significant to consider "instructions on how to cite data" because there are so few instances? Heather: yes, and it isn't exactly clear how it relates to the purpose of this analysis well, that is too strong Nicholas: oh, yes that too :) 11:24 AM Heather: it isn't quite related to the purpose of this analysis I think ideally we would do another, very similar analysis where instead of predicting whether the journal had a polciy to request data sharing, we look at whether they have a policy about how to cite data alas we may not have enough +ve case to have much of a story 11:25 AM another "ideally we would do" is unlump requests from requires in data sharing policy I think let's put that one on the backburner for today but keep it in mind, write it down as a todo for tomorrow or something Nicholas: ok and to do so Heather: does that all make sense? 11:26 AM yes, so mostly I don't think I've shown you the tools yet to do requests vs requires well Nicholas: ok yes that makes sese snese Heather: it is a step more complicated than logistic regression so let's hold off till we nail this one :) 11:27 AM Nicholas: good plan Heather: btw, you do realize that you are currently doing Stats 301 or something, right? I mean if you feel lost ever, it is with good reason. you are holding on well, nic. keep askign if/when/as it gets confusing 11:28 AM Nicholas: oh thanks, I don't know about that but thanks...I've already lined up a stats tutor for next semester Heather: and I'm guessing you'll feel all sorts of Ahhhh moments next term as you realize in retrospect all the things we are doing :) Nicholas: fingers crossed Heather: and no doubt some urggg moments as you realize some of the steps we are skipping, but oh well, c'est la vie :) ok. 11:29 AM Nicholas: quick question Heather: yes? Nicholas: have you taught stats before? Heather: nope I haven't taught anything before, actuall Nicholas: well, you're a patient instructor you'd thrive 11:30 AM Heather: that is kind of you. I do like it. I wish I'd done it before, though, clearly more practice about what order to talk about things, what exampes to use, etc is useful! Nicholas: you can cut your teeth and google chat, and then take your refined skills to the classroom 11:31 AM Heather: yeah. though it feels like it would be a whole different world. hard to tell if the students are really with you, no? Nicholas: yeah, definitely... ok thanks for indulging my curiosity 11:32 AM Heather: ok. variables. I think we can do a bit here and then you'll be equipped to go do a bit more R coding and running on your own then we can sync up in a few hours again does that work? Nicholas: sounds great Heather: I'm around till 3 my time ok 11:33 AM so let's think about our variables and what kind of variables they are impact factor is a real number, right? a float. Nicholas: yes Heather: now ideally when you have real numbers in a regression, they have the shape of a bell curve or a normal distribution (same thing) 11:34 AM if you do hist(Impact.Factor) you can see in the graph that it doesn't look like a bell curve Nicholas: no it doesnt 11:35 AM Heather: now try this hist(log(Impact.Factor)) much more like it, eh? Nicholas: yes Heather: do you know enough about logs to know what that is doing? Nicholas: no Heather: ok no problem 11:36 AM ok, I'm not going to explain it in much detail it is actually pretty cool, useful, prevalent but right now we are just going to treat it as a black box Nicholas: thats fair 11:37 AM Heather: it is a "transform" 11:38 AM that takes the number 1 and turns it into 0 log(1) so an IF of 1 has a log(IF) of 0 it takes numbers between 0 and 1 and turns them into negative numbers... more negative the smaller they are log(0.1) log(0.01) 11:39 AM and takes numbers bigger than 1 and turns them into positive numbers but it shrinks the scale down such that the logs of really big numbers aren't that much larger than the logs of medium-big numbers 11:40 AM log(10) log(10000) log(100000000) make enough sense? Nicholas: yes Heather: so we are taking the log of all of the values in our impact factor and putting those into our model instead of the impact factor 11:41 AM and that will make the "fit a best fit line" math be more robust becaues that math assumes that the data coming in has a normal distribution 11:42 AM the only tricky part is that it can't handle impact factors of 0 log(0) I'm guessing 0 is actually NA in this case? Nicholas: right, I noticed that yesterday, there shouldn't be an impact factor of 0... it should be NA right? Heather: yes 11:43 AM for now, I'll work around it by just adding 0.1 to it as a temporary hack so to add the log of the IF in the model, you do this: mylogit = glm(requests~log(0.1 + Impact.Factor) + noOA, family=binomial(link="logit"), na.action=na.pass) summary(mylogit) 11:44 AM it does make the coefficient a bit harder to interpret (though I think if you replace the 0 with an NA and take out the +0.1 hack it will get easier) 11:45 AM but you can see that now the noOA variable is a bit more signifcant p=0.078 so still not sig at p<0.05 .... but more so modelling your data better can show things that are otherwise hidden 11:47 AM ok so let's leave impact factor at that for now subscription model it has three levels now, right? Nicholas: yes 11:48 AM Heather: I think maybe for the purposes of this analysis we might want to collapse it into two to save "the number of variables" and make it easier to interpret I'll leave it up to you to decide what to collapse this will make it a binary variable something like hassomeOA = 1 or 0 11:49 AM depends on what you decide ok? Nicholas: ok Heather: ok, categories there are how many main categories? and can journals belong to more than one? 11:50 AM Nicholas: yes I was just intereted in the three categories we originally used to gather this list Heather: yes, cool Nicholas: Ecology, EvoBio and Env Sci 11:51 AM Heather: then a binary variable for each of those? since some journals are members of more than one? Nicholas: ... I guess so Heather: ok. it is true that some journals are members of more than one, right? Nicholas: yes 11:52 AM Heather: (btw I'm using the word binary. it just means two-valued. usually 0 and 1. ) ok cool then publisher pick the big 3? the big 4? you still there? 11:53 AM for publisher, pick the big 3 or 4 and call the others "Other" Hi Nic. I think we lost the connection. are you still there? 11:54 AM Nicholas: yes ok, publisher Heather: pick the big 3 or 4 and call the others "Other" Nicholas: ok Heather: have them all in one column. so not an is.Wiley column, but instead a 11:55 AM publisher_code column or something that has one of four values in it "Wiley", "Elsevier", "Other" or whatever we'll need to work together to figure out how to interpret the results once you get this column 11:56 AM the society affiliation column is a binary one, just 0 or 1? Nicholas: I have both a yes/ no and a numeric 1 - 0 Heather: great! Nicholas: for affiliaiton Heather: ok, that's all of them? so if you can get the data in that format, write it up on your OWW page, 11:57 AM try to put them all in the model, separated by + signs Nicholas: ok Heather: then we'll see what we have :) Nicholas: ok Heather: remember to hold off interpretations still.... because doing data analysis takes several attepts to get right esp when still a newbie (and I count as that too) thing slike realizing, duh my impact factor woudl really be better as a log transform 11:58 AM can change your results and thus your interpretations so treat it all as a work in progress Nicholas: ok Heather: until we are convinced that the data is a good representation, doesn't have mistakes, the model is a good fit, etc etc etc 11:59 AM ok! I'll probably be away from chat for a bit, but then will be back Nicholas: ok. I'll be here Heather: for sure let's talk again in an hour or two.... Nicholas: until about 5 my time, then I have a conf to go to ok Heather: ok. 12:00 PM enjoy the conf, cool. see you in a bit! (oh, btw, I'm assuming you are adding this to an .R file and putting it in git.... if not, woudl be a good time now.... have you figured out how to add new revisions to a git? to a gist, I mean? 12:01 PM that way you can keep the same ID number and just update the contents Nicholas: I've got a text fiel that I'll put into a gist Heather: yeah great taht would probably be an easy way for me to see/run your code play with the revisions stuff till you figure it out or ask. ok, later! 12:02 PM Nicholas: bye thanks again Heather: my pleasure!


 * }