# DataONE:Notebook/Summer 2010/2010/07/27

 DataONE summer internships 2010 Main project pagePrevious entry ```Heather: Hi Nic! Have you got a few minutes to chat stats, or when would be good? 10:52 AM me: Hi Heather, now is good Heather: cool I love watching your evolving code, good stuff A few ideas 10:53 AM one idea is not to worry right now about what is statistically significant or not no need to call it out of your results me: ok Heather: you are still clearly getting the data flushed out and cleaned, your columns rejiggered.... all of this makes the results change and really the results aren't to be trusted until all of that is done 10:54 AM me: yeah, I've spent a lot of time cleaning up columns this morning Heather: and in some ways when you comment on it early it suggests that that there is something worth looking at, when really there isn't yet :) what I mean is, there isn't anything worth looking at yet because it is all still transient me: ok I understand Heather: cool 10:55 AM ok, different topic: columns and whether they are separate or together I haven't explained "factors" to you yet, is that right? me: thats right, we haven't gone over that Heather: right. so let's do that a bit right now :) me: ok Heather: there are multiple datatypes for a column 10:56 AM binary real/float integer and "category" where category could be favourite colour and if you could only have one favourite colour then it would make sense to have a column called fav color 10:57 AM and within it it would say "Red" "Blue" "undecided" etc these are also called nominal variables and also called, in R, "factors" where the levels of the factors are the distinct values it can take make sense so far? 10:58 AM to more things about factors one is that they can be ordered me: yes Heather: so for example a journal policy can be "weak" "medium" or "strong" this isn't really an integrer 0, 1, 2 10:59 AM because it doesn't make sense to do math on it. strong isn't medium*2 but it is more than just a category, because it is ordered so in R this is called an ordered factor and when you have an ordered factor it can help to tell R that because then the stats can use that information 11:00 AM ok? me: ok Heather: to do that, you can see the command ?ordered or we can just talk about it when it is relevant :) a different conversation about factors is when to put them in the same column, and when to make a bunch of different binary columns if you allow people to have one fav colour, then you should just have one column 11:01 AM but if you let people have several fav colours, all of a sudden one column doens't work very well and it works better to have muliple columns, is.red.a.fav is.blue.a.fav that are all binary so..... since a journal can have muliple ISI categories, each of the categories should have their own columns 11:02 AM but since they can only have one publisher, it makes most sense for the publisher to stay in a single column that has muliple factor "levels" that helps to interpret the stats I'll show you how to do that. make sense as a concept? 11:03 AM me: yes I think so Heather: any questions about it? you seem a bit unsure? 11:04 AM me: no I think I get it Heather: ok. so I think using your PubCode variable in the analysis directly woudl probably work, woudl it? how many different values can it take? 11:05 AM me: four other, elsevier, wiley, springer Heather: and taylor? or not? me: well I was finding that I had too many variables, so I collapsed taylor into Other 11:06 AM Heather: gotcha ok, so I think if you could rerun a glm including PubCode and post its results, I think we could go through them and I could show you how to interpret them. 11:07 AM one command that I've never used but I think would be helpful is relevel it tells R which level to use as the basis, the reference level I think your results would be most interpreable if that was "Other" 11:08 AM so I think (but am not sure) that the following code will work: relevel(PubCode, ref="Other") you'd put it right before the table(PubCode) command, before the glm call 11:09 AM let me know? me: ok just one sec 11:13 AM ok, I just posted it http://www.openwetware.org/wiki/DataONE:Notebook/Data_Citation_and_Sharing_Policy/2010/07/27#Cleaner_Analysis 11:17 AM Heather: ok, so it does still have a taylor in it, is that right? 11:18 AM me: shoot I'm sorry I called the wrong file in 11:19 AM Heather: also, it looks like this line has an error, an extra ] at the end? > Afil = ifelse(Affiliation.Code > 0, 1, 0)] # Society Affiliation Error: unexpected ']' in "Afil = ifelse(Affiliation.Code > 0, 1, 0)]" 11:21 AM Nic, I think I made a mistake... I think you actually have to make it PubCode = relevel(PubCode, ref="Other") just 11:22 AM relevel(PubCode, ref="Other") isn't enough.... it has to be assigned back to the PubCode variable I'm learning too, clearly :) me: ok let me fix that and the Afil 11:23 AM Heather: sorry about not seeing it before. your results up on your OWW page helped me figure it out :) ok 7 minutes 11:30 AM me: It might take me a few more minutues, I don't know why but it keeps showing Taylor in PubCode Heather: ok, no prob 12 minutes 11:42 AM me: Ok, I posted what I ran in OWW-- I think there is a problem somewhere though 11:43 AM http://openwetware.org/wiki/DataONE:Notebook/Data_Citation_and_Sharing_Policy/2010/07/27#Cleaner_Analysis 11:45 AM Heather: what makes ou think that? 11:46 AM the fact that there is no PubCodeother in the results is actually a good thing, in case that was it.... 11:47 AM the reason that is true, is that "other" is used as the base case or the reference so... to interpret these other factors, using the "exp(confint(mylogit))" results PubCodeelsevier 1.69733323 11.6908318 11:48 AM means that, compared to "other" publishers (= ones coded as other), journals published by elsevier are 1.7 to 11.7 times as likely to have a data sharing policy whereas PubCodespringer 0.15399592 2.6087098 11:49 AM means that being published by springer, a journal is between 0.15 and 2.6 times as likely to have a data sharing policy. (since this goes from less than 1 to more than 1, it doesn't actually tell us anything interesting.... not coincidentally... the pvalue for PubCodespringer is large!) make any sense? 11:50 AM me: yes Heather: cool ok, any quick questions before we zoom over to the group chat? 11:51 AM I'm going to ask you and Sarah both (Valerie wil lbe joining a bit later, hopefully) to give everyone a brief rundown on what you've been doing and what your plans are. that sound ok? me: sure no other questions right now Heather: great! 11:52 AM you relatively comfortable with understanding the statistics you are running right now? on a scale of 0 to 10? me: 6 Heather: nice. good. 11:53 AM me: I wouldn't say I totally get it, but it makes more sense when I re read our conversations Heather: keep asking if there are things that you'd like to talk through some more. yup, makes sense. great! ok, off to hopefully try to join everyone in. wish us luck and strong connections! me: ok ```