Join Barton Poulson for an in-depth discussion in this video Recoding variables, part of Learning R.
When you've taken a thorough look at your variables, you may find that some of them may not be in the most advantageous form after your analyses. Some of them may require, for instance, rescaling to be more interpretable. Others may require transformations, such as ranking, logarithms, or dichotomization to work well for your purposes. In this movie, we're going to look at a small number of ways that you can quickly and easily recode variables within R. For this one, we're going to be using the data set we've used before, social network, and I'm going to load that by simply running a line 12 here.
And then I'm going to be using the psych package, because it gives me some extra options for what I want to do here. So, I'm going to run line 15 to install it, and then run line 16 to load it. Now, what I'm going to do right here is I'm going to first take a look at the variable times; the number of times that people say they log in to their site each week. The easiest way to do this is with a histogram, because it's a quantitative variable. I'm going to run line 19. What we have here is an extraordinarily skewed histogram.
You see for instance that nearly everybody is in the bottom bar, which says they log in somewhere between 0 and 100 times per week. We have somebody in the 100 to 200 range, and then we have another person we saw before in the 700 to 800 range. The normal reaction to this might be simply to exclude those two people, because they are such amazing outliers, and yet, you can do that, but I want you to see that there are other ways to deal with it. The first thing I'm going to do is one common transformation; it actually doesn't change the distribution, it just changes the way that we write the scores, and that's to turn things into z-scores, or standardized scores.
And what that does is it says how many standard deviations above or below the mean each score is. Fortunately, we have a built-in function for that, and it's called scale. So, what I'm going to do is I'm going to create a new variable called times.z for z-scores of time, and I'm going to use scale, and then sn for the social network data frame, and then the variable Times. So, I'm going to run line 24 here, and you see that on the right side on your workspace, I have a new variable that has popped up. It's actually a double matrix, which is an interesting thing.
I'm going to run line 25, and get a new z distribution; a histogram. You see, it should look the same as the Times distribution. It's pretty similar, but it's abended differently. And so, some of the people who are in the 0 to 100 in range, if they were in, like for instance, the 50 to 100, they got put into different bin, but you still see that we have these two incredible outliers here. I'm going to get a description of the distribution. This is where I have the trimmed mean, and the median, so on, and so forth. One of the interesting ones here is at the end of the first line you see the level of skewness.
Now, a normal distribution has a value of zero for skewness. This distribution has a level of over 10, which is enormous for skewness. Even more is on the next line is kurtosis, which you don't always talk about. One of the things that affects kurtosis, which has to do with sort of how peaked or pinched the distribution is; it's affected a lot by outliers, and so we end up having a kurtosis, which for a normal distribution for a bell curve is zero, and we have this incredibly high value of 120. Anyhow, that just gives us some idea of what we're dealing with here, and the ways that we can transform it.
Okay, what I'm going to do next is sometimes when you have a distribution with outliers on the high end, it can be helpful to take the logarithm. You can take the base 10 logarithm, or the natural logarithm. I'm using the natural log here, and what I'm going to do is I'm going to create a new variable here called times.ln0, and this just takes the straight natural logarithm of the values. Now, I'm going to do this twice, because there's a reason why this one doesn't work. I'm going to just show it to you. I'm going to run line 29, and now you see on the workspace on the right I've got a new variable, and I'm going to get a histogram.
The histogram is really nice. You can tell it's almost like a normal distribution. It's a lot closer, but if I run the describe, I get some very strange things. The mean, we have sort of this negative infinity, and we have not a number for all sort of things, and the descriptions don't work well. The problem here is that if you do the logarithm, and you have zeros in your data set, you can't do logarithms for zero. And so a workaround for this that is adequate is to take all of the scores and add 1. That's what I'm doing right here.
Now I'm going to create a new variable called times.log1, and what I'm going to do is I'm going to take the value of Times, and add 1 to it, so there's no more zeros. The lowest value is going to be 1, the highest is now going to be 705, and I'm going to take the logarithms of those. So, I'm going to hit that, and run line 33, and then I'll take a look at the histogram. You see the histogram is very different, because the last one simply excluded all the people who said they had zeros. Now they're in there, and so you can see that the bottom bar has bucked up.
I'm going to run describe now. Now I actually get values, because I'm not full of infinite values or not a numbers. If you have zeros, adding 1 can make the difference between being able to successfully run a logarithm transformation or not. The next step is to actually rank the numbers, and this forces them into nearly uniform distribution. What I'm going to do here is I'm going to use the ranking function. I'm going to put times, rank, and so it's going to convert it into an ordinal variable from first, to second, to third, to fourth.
If I just run it in its standard form, you see there it's created a new variable over there; I'm going to get the histogram of that. Now, what's funny about this histogram is, theoretically, if we have one rank for each person, there should be a totally flat distribution, and that's obviously not what we have here. The reason for that is because we have tied values. A lot of people put zero, a lot of people put 1, and so on. I'm going to run the describe just in case. There are a lot of ways in R for dealing with tied values. In line 41, you see, for instance, the choices are to give the average rank, to give the first one, to give a random value, to give the max, the min, and all of these are used in different circumstances.
I'm going to use random for right now, because what it does is it really flattens out the distribution, so I'm going to run line 42. Now it's going to be times.rankr, for random. Then I said I'm going to rank it, but I'm specifying how I'm going to deal with ties. So, ties.method; in this case, I'm going to use random. I run that, and if you look over here in the workspace, I now have that variable down at the bottom. I'm going to come back to the editor, and run line 43, and now look; that's totally flat.
If I run describe, you see, for instance, that the mean's 101.5, which is what we would expect with this distribution, and it's just flat all the way. Skewness is zero. We have a negative kurtosis, because this is actually what's called a platykurtic distribution. Anyhow, that's exactly what we would expect with a totally ranked distribution with no ties. The last thing I'm going to do is I'm going to dichotomize the distribution. Now, a lot of the people get very bent out of shape about dichotomization. They say you should never do this, because you're losing information in the process, and that's true.
We're going from a ratio level variable down to a nominal or ordinal level variable. So, we are losing some information. On the other hand, dichotomization, when you have a very peculiar distribution, can make it more analytically amendable. More to the point, it's easier to interpret the results. I do not feel that is never appropriate to dichotomize; to split things into two. I feel there's a time and a place for it. Just use it wisely, know why you're doing it, and explain why you did it.
Anyhow, it would feel like the appropriate way to do this would be to say, for instance, if x is less than this value, then put them in this other group, but that doesn't work properly. You'll get some peculiar results. Instead, you need to use this one line function in R; it's called if else, and it's written as one word. And in line 48, what I'm going to do is create a new variable. It says time.gt1, because I'm going to dichotomize it on whether they log in more than once per week. So, GT stands for greater than one.
And then I have the assignment operator, and then I use the function ifelse. And then what you do is you have in parentheses three arguments. The first one is a test, and so I'm going to say is times greater than one; sn is the data frame, and the dollar sign it says I'm going to use a variable, then times is the name of the variable, and if that's greater than one, then the second argument is what to do if that test is true; then give them a one on the variable time.gt1. If their score on times is not greater than one, so if it's zero or one, then give them a zero.
So, I'm going to run line 48, and now you can see over here I have got a new variable, GT1, and then I'm going to get the description of that one by just writing its name. And what you can see here is it's printed out the entire data set. It's taken all the people who said they logged in zero or one times, and it's given them zeros. Everybody who logged in two or more times got a one, and the people who didn't respond to the question in the first place still have their NAs for non-applicable. And so, that's a form of dichotomization of a distribution that can be done in a way that advances your purposes, and can be done, I feel, with integrity, if it's done thoughtfully.
These are some of the options for manipulating the data, and getting it ready for your analyses, and of course, there's an extraordinary variety of what's available, but these are some of the most common choices, and hopefully some of them will be useful for you.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means