DataONE:Notebook/Summer 2010/2010/07/23
From OpenWetWare
2 12:07 PM me: Hi Heather, whenever is convenient to chat let me know Heather: Hi Nic! I'm doing some stat analysis prepping for a talk with Sarah right now... but I have a few minutes 12:08 PM what is low-hanging fruit that we can cover so that you can keep going? I think how to do subsets? also, maybe I'll introduce you to boxplots me: yes Heather: ideally we would do analysis-planning but that might take longer :) 12:09 PM ok have you done any database SQL stuff before, by chance? me: yeah, I need to focus in on that no I haven't Heather: ok, no problem if you had, there is an sql-type way to do it so what are you trying and what isn't working? 12:10 PM me: what I am really confused about is how to represent required and request as different variables not variables, but setting >1 to required which corresponds with my code Heather: are you confused about why we might want to do that? or how we do that? or both? 12:11 PM me: no I undertsand why, I just don't understand how Heather: ok so you have a variable called Policy.request...require.code right? and it has 0 1 and 2s? me: yes 12:12 PM Heather: and if you do table(journdat$Policy.request...require.code) you can see the relative numbers of 0s and 1s and 2s me: yes Heather: and if you do dim(journdat) you can see how big your whole "dataframe" is 307 rows, 54 columns 12:13 PM me: right Heather: now you don't want all 307 rows, you just want the ones with a code of 1 or 2 so that would be 29+10 = 39, right me: yes, but how to get rid of the 0's and not lose the 1's? 12:14 PM Heather: another way to look at journdat is journdat[,] the [] notation is "indexing into the matrix" that is journdat joundat has two dimensions, rows and columns 12:15 PM and these are separated by a , inside the []s if you just do journdat[,] you get everything ie dim(journdat[,]) but now let's say you want to be pickier you just want the ones where code is >0 but you still want all the columns 12:16 PM so yyou make a "logical" that picks out the rows you want like this journdat$Policy.request...require.code >0 and then you put that inside the rows indexing like this dim(journdat[journdat$Policy.request...require.code >0,]) 12:17 PM or instead of the dim you could do str like this str(journdat[journdat$Policy.request...require.code >0,]) or summary(journdat[journdat$Policy.request...require.code >0,]) or whatever you can also assign this new subset thing to a new variable, like this 12:18 PM journdat.request.code.gt.0 = journdat[journdat$Policy.request...require.code >0,] and then use "journdat.request.code.gt.0" whereever you used to use journdat in the summary etc me: right but that gives us both required and requested right? Heather: right me: becaues its everything over 0 Heather: right 12:19 PM and I was imaginging that you'd want the dataset that includes both request and required... but me: yes thats what I don't undertand Heather: when looking at it in summary, you'd look at things that distinguished request from required ok. with your new dataframe you can do 12:20 PM table(journdat.request.code.gt.0$Policy.request...require.code) and it just comes back with 1s and 2s as we expect so now just make a request vs required column that codes it in terms of 0s and 1s 12:21 PM (the reason I think you want it in terms of 0s and 1s rather than 1s and 2s is that it makes the summary plot easier to interpret, I think. I don't know this for sure. you could try the summary plot 12:22 PM using "Policy.request...require.code" before the ~ and the new dataframe at the end.... but here's what I'd do...) 12:23 PM journdat.request.code.gt.0$is.required = 0 then journdat.request.code.gt.0$is.required[journdat.request.code.gt.0$Policy.request...require.code > 1] = 1 does that make sense? then use "is.required" before the ~ where we first experimented with "simple.var" 12:24 PM (for what it is worth, you can also easily get the same thing by doing journdat.request.code.gt.0$is.required = journdat.request.code.gt.0$Policy.request...require.code - 1 ) 12:25 PM have I totally confused you? me: ok, I'm going to try this, I thought this was what I was doing, but I've gotten a little turned around Heather: if so, that is fine. is your code for this up? I didn't obviously see it me: no no, I understand making new dataframes Heather: maybe I didn't click through the right thing 12:26 PM me: When you enter: journdat.request.code.gt.0$is.required = 0 then journdat.request.code.gt.0$is.required[journdat.request.code.gt.0$Policy.request...require.code > 1] = 1 12:27 PM doesnt that put requested and no request both back to 0? 12:28 PM Heather: hmmm, what do you mean by "requested and no request" 12:29 PM me: all values below 2 Heather: yes isn't that what you want? then the is.required variable is 1 only for the datasets where sharing is required and 0 when it is either no policy or merely requested 12:30 PM me: .... yes? Heather: you're not sure it that is what you were wanting? that's ok! 12:31 PM me: for some reason, I was trying to represent only requested with the 0's and only required with the 1's Heather: oh I see. and you were confused about the fact that it would in theory make the "no policies" be 0 also? 12:32 PM me: yes, although I didn't really stop and think, I just kept plugging...( I think I've already done the plots for only required) 12:33 PM Heather: gotcha. yeah, well I think that isn't really a problem in this case because there are no "no policies" in this subset because we got rid of them all :) 12:34 PM are you clear, or more fuzzy or ? me: I think I'm clear I want to go back and make sure I understand... I'll post the code and plot on OWW Heather: ok. 12:35 PM for what it is worth, I'm not 100% sure you actually want to do a lot with the subset it depends on research questions so if you get stuck on it, don't dwell, just keep plugging on other things me: ok,,, so it might be better to spend my time looking at what I'm trying to get out of the stats 12:36 PM Heather: right and I want to show you another thing me: ok Heather: try this boxplot(Impact.Factor ~ Policy.request...require.code, journdat) 12:37 PM do you get a plot? me: yes 12:38 PM Heather: ok, so what you are seeing on the x axis is the three levels of your code variable no policy, requests, required me: right and impact factor on the y Heather: and on the y axis it is plotting the impact factor right have you seen boxplots before? 12:39 PM the dark line is the mean me: I have not Heather: the "whiskers" show the range of most of the applicable datapoints with a few outliers showing up as "o"s and the box itself shows where most of the data is 12:40 PM so looking at this it says to me that the mean, average, impact factor is higher as policies get stricter though there isn't a whole lot of difference between levels 1 and 2 12:41 PM since their boxes mostly cover the same range of impact factors this is useful beyond what we were looking at in the summary dot tables because there we had collapsed policy request and required to be the same.... and here we are can look at them individually 12:42 PM does that make sense? me: yes Heather: cool one more thing 12:43 PM that sort of plot (or others like it) is useful when you want to see a continuous variable, like impact factor, across more than two categories (like code) but you also have a lot of binary variables like is.Wiley 12:44 PM there are ways to show that info too in a table: table(journdat$is.Wiley, journdat$Policy.request...require.code) or in a funky plot plot(table(journdat$is.Wiley, journdat$Policy.request...require.code)) 12:45 PM anyway, mostly here just wanted to expose you to the fact that there are ways we can analyze and plot your code variables while keepign their three levels I know we've collapsed them so far but there are advantages to looking at all three distinct levels at the same time so we'll try to do that too make sense? 12:47 PM me: yes Heather: cool btw, did you get a chance to install Mondrian? it is kind of picky about getting data in me: I cant remember, if not I'll do so Heather: but after that has some nice data viz opportunities 12:48 PM me: oh i did Heather: yeah. play, and if you have lots of trouble loading in your data then just get a few key columns, open them up in Excel, save as tab delimeted and try to import it is really picky about not having any blank cells, fyi ok, if you have stuff to keep going on, maybe I'll leave it at that for a few hours while I go look at Sarah's stuff? 12:49 PM me: ok, yeah I still have the second half of that tutorial you sent me to look at Heather: cool. don't get too hung up on the tutorial, it is without a doubt a hard read, beyond where you are at 12:50 PM me: yes, yes it is Heather: if you can get mondrian going, though, that woudl be cool me: so I load dataframes into mondrian? Heather: it would be fiddly with getting data in. you can in theory save your dataset in R as a ".Rdata" file and then load directly into mondrian but fiddly. 12:51 PM ok, bye for now me: ok thanks Heather: (mondrian has some good docs. mostly, try starting small, with just two clean variables maybe? 12:52 PM you can use ?select nope I mean ?subset to make a dataframe with just a few of your variables (or this is another way to pick just a subset of the rows....) ok, off now....... 12:53 PM oh, wait do you know about c() c is how you make list of things so if you wanted just two non-adjacent columns from your data you would say 12:54 PM small.dataframe = subset(journdat, select=c(is.Wiley, Policy.request...require.code)) dim(small.dataframe) ok? 12:55 PM me: oj I'll try it Heather: ok, cool. bye!