DataONE:Notebook/Summer 2010/2010/07/23

From OpenWetWare
Jump to navigationJump to search
2 12:07 PM 
me: Hi Heather, whenever is convenient to chat let me know
Heather: Hi Nic!
I'm doing some stat analysis prepping for a talk with Sarah right now... but I have a few minutes
12:08 PM 
what is low-hanging fruit that we can cover so that you can keep going?
I think how to do subsets?
also, maybe I'll introduce you to boxplots
me: yes
Heather: ideally we would do analysis-planning but that might take longer :)
12:09 PM 
have you done any database SQL stuff before, by chance?
me: yeah, I need to focus in on that
no I haven't
Heather: ok, no problem
if you had, there is an sql-type way to do it
so what are you trying and what isn't working?
12:10 PM 
me: what I am really confused about is how to represent required and request as different variables
not variables, but setting >1 to required
which corresponds with my code
Heather: are you confused about why we might want to do that? or how we do that? or both?
12:11 PM 
me: no I undertsand why, I just don't understand how
Heather: ok
so you have a variable called Policy.request...require.code right?
and it has 0 1 and 2s?
me: yes
12:12 PM 
Heather: and if you do
you can see the relative numbers of 0s and 1s and 2s
me: yes
Heather: and if you do
you can see how big your whole "dataframe" is
307 rows, 54 columns
12:13 PM 
me: right
Heather: now you don't want all 307 rows, you just want the ones with a code of 1 or 2
so that would be 29+10 = 39, right
me: yes, but how to get rid of the 0's and not lose the 1's?
12:14 PM 
Heather: another way to look at journdat is
the [] notation is "indexing into the matrix" that is journdat
joundat has two dimensions, rows and columns
12:15 PM 
and these are separated by a , inside the []s
if you just do journdat[,] you get everything
but now let's say you want to be pickier
you just want the ones where code is >0
but you still want all the columns
12:16 PM 
so yyou make a "logical" that picks out the rows you want
like this
journdat$Policy.request...require.code >0
and then you put that inside the rows indexing
like this
dim(journdat[journdat$Policy.request...require.code >0,])
12:17 PM 
or instead of the dim you could do str like this
str(journdat[journdat$Policy.request...require.code >0,])
summary(journdat[journdat$Policy.request...require.code >0,])
or whatever
you can also assign this new subset thing to a new variable, like this
12:18 PM = journdat[journdat$Policy.request...require.code >0,]
and then use ""
whereever you used to use journdat
in the summary etc
me: right but that gives us both required and requested right?
Heather: right
me: becaues its everything over 0
Heather: right
12:19 PM 
and I was imaginging that you'd want the dataset that includes both request and required... but
me: yes
thats what I don't undertand
Heather: when looking at it in summary, you'd look at things that distinguished request from required
with your new dataframe you can do
12:20 PM 
and it just comes back with 1s and 2s as we expect
so now just make a request vs required column that codes it in terms of 0s and 1s
12:21 PM 
(the reason I think you want it in terms of 0s and 1s rather than 1s and 2s is that it makes the
summary plot easier to interpret, I think. I don't know this for sure. you could try the summary plot
12:22 PM 
using "Policy.request...require.code" before the ~ and the new dataframe at the end....
but here's what I'd do...)
12:23 PM$is.required = 0
then$is.required[$Policy.request...require.code > 1] = 1
does that make sense?
then use "is.required" before the ~
where we first experimented with "simple.var"
12:24 PM 
(for what it is worth, you can also easily get the same thing by doing$is.required =$Policy.request...require.code - 1
12:25 PM 
have I totally confused you?
me: ok, I'm going to try this, I thought this was what I was doing, but I've gotten a little turned around
Heather: if so, that is fine.
is your code for this up? I didn't obviously see it
me: no no, I understand making new dataframes
Heather: maybe I didn't click through the right thing
12:26 PM 
me: When you enter:$is.required = 0
then$is.required[$Policy.request...require.code > 1] = 1
12:27 PM 
doesnt that put requested and no request both back to 0?
12:28 PM 
Heather: hmmm, what do you mean by
"requested and no request"
12:29 PM 
me: all values below 2
Heather: yes
isn't that what you want?
then the is.required variable is 1 only for the datasets where sharing is required
and 0 when it is either no policy or merely requested
12:30 PM 
me: .... yes?
Heather: you're not sure it that is what you were wanting?
that's ok!
12:31 PM 
me: for some reason, I was trying to represent only requested with the 0's and only required with the 1's
Heather: oh I see.
and you were confused about the fact that it would in theory make the "no policies" be 0 also?
12:32 PM 
me: yes, although I didn't really stop and think, I just kept plugging...( I think I've already done the plots for only required)
12:33 PM 
Heather: gotcha. yeah, well I think that isn't really a problem in this case
because there are no "no policies" in this subset
because we got rid of them all :)
12:34 PM 
are you clear, or more fuzzy or ?
me: I think I'm clear I want to go back and make sure I understand... I'll post the code and plot on OWW
Heather: ok.
12:35 PM 
for what it is worth, I'm not 100% sure you actually want to do a lot with the subset
it depends on research questions
so if you get stuck on it, don't dwell, just keep plugging on other things
me: ok,,, so it might be better to spend my time looking at what I'm trying to get out of the stats
12:36 PM 
Heather: right and I want to show you another thing
me: ok
Heather: try this
boxplot(Impact.Factor ~ Policy.request...require.code, journdat)
12:37 PM 
do you get a plot?
me: yes
12:38 PM 
Heather: ok, so what you are seeing on the x axis is the three levels of your code variable
no policy, requests, required
me: right
and impact factor on the y
Heather: and on the y axis it is plotting the impact factor
have you seen boxplots before?
12:39 PM 
the dark line is the mean
me: I have not
Heather: the "whiskers" show the range of most of the applicable datapoints
with a few outliers showing up as "o"s
and the box itself shows where most of the data is
12:40 PM 
so looking at this
it says to me that the mean, average, impact factor is higher as policies get stricter
though there isn't a whole lot of difference between levels 1 and 2
12:41 PM 
since their boxes mostly cover the same range of impact factors
this is useful beyond what we were looking at in the summary dot tables
because there we had collapsed policy request and required to be the same....
and here we are can look at them individually
12:42 PM 
does that make sense?
me: yes
Heather: cool
one more thing
12:43 PM 
that sort of plot (or others like it)
is useful when you want to see a continuous variable, like impact factor, across more than two categories (like code)
but you also have a lot of binary variables like is.Wiley
12:44 PM 
there are ways to show that info too
in a table:
table(journdat$is.Wiley, journdat$Policy.request...require.code)
or in a funky plot
plot(table(journdat$is.Wiley, journdat$Policy.request...require.code))
12:45 PM 
anyway, mostly here just wanted to expose you to the fact that there are ways we can analyze and plot
your code variables while keepign their three levels
I know we've collapsed them so far
but there are advantages to looking at all three distinct levels at the same time
so we'll try to do that too
make sense?
12:47 PM 
me: yes
Heather: cool
btw, did you get a chance to install Mondrian?
it is kind of picky about getting data in
me: I cant remember, if not I'll do so
Heather: but after that has some nice data viz opportunities
12:48 PM 
me: oh i did
Heather: yeah. play, and if you have lots of trouble loading in your data
then just get a few key columns, open them up in Excel, save as tab delimeted and try to import
it is really picky about not having any blank cells, fyi
ok, if you have stuff to keep going on, maybe I'll leave it at that for a few hours while I go look at Sarah's stuff?
12:49 PM 
me: ok, yeah I still have the second half of that tutorial you sent me to look at
Heather: cool. don't get too hung up on the tutorial, it is without a doubt a hard read, beyond where you are at
12:50 PM 
me: yes, yes it is
Heather: if you can get mondrian going, though, that woudl be cool
me: so I load dataframes into mondrian?
Heather: it would be fiddly with getting data in. you can in theory save your dataset in R as a ".Rdata" file and then load directly into mondrian
but fiddly.
12:51 PM 
ok, bye for now
me: ok thanks
Heather: (mondrian has some good docs. mostly, try starting small, with just two clean variables maybe?
12:52 PM 
you can use
nope I mean
to make a dataframe with just a few of your variables
(or this is another way to pick just a subset of the rows....)
ok, off now.......
12:53 PM 
oh, wait
do you know about c()
c is how you make list of things
so if you wanted just two non-adjacent columns from your data you would say
12:54 PM 
small.dataframe = subset(journdat, select=c(is.Wiley, Policy.request...require.code))
12:55 PM 
me: oj
I'll try it
Heather: ok, cool. bye!