Once the data are entered into R, the first task in any analysis is to examine the individual variables. Now, the purpose of this task are threefold: first, to check that the data were entered correctly; second, to check whether the data meet the assumptions of the statistical procedures that you've planned to use; and third, to check for any potentially interesting, or informative observations, or patterns in the data. For a categorical variable, such as a respondent's gender, or a company's economic sector, that is, a nominal or an ordinal variable, the easiest and most informative way to check the data is to make a bar chart, and so that's where we turn first.
The unfortunate thing about R is that it's not really set up to do bar charts from a raw data file. It wants to do them from a summary data file, where you say, this is the category, and this is how many people are in that category. On the other hand, if you have raw data, where you're simply listing category 1, 2, 1, 1, 2, 2, 2, there's an easy way to work around it, and that's what I'm going to show you here. I'm going to be using the social network data that I've used before, and I'm going to get that loaded. The way I'm going to do this is I'm going to use the same read.csv function that I've used before. That's because I'm dealing with a comma-separated values spreadsheet, and I'm going to feed it into a data frame called sn, for social network.
I am going to set it up a little bit differently, though, because you may recall in the previous versions, I specified explicitly the entire file path from C on. I want to use a shortcut version. I am going to show you how to set that up. If you go up to Tools, down to Options, one of the choices you have in the General window is the Default working directory; that is, when you're not in a project that explicitly puts it somewhere else. Even though we have a little tilde here, this actually is currently going to my Documents folder, but I'm going to go to Browse, and I'm going to change it temporarily to the Desktop, because I've copied the files over to the Desktop.
Then I put Select Folder, and now you see it has the C:/Users Barton Poulson/Desktop, and I can just press OK. And now I can just have a very short version, where I give just the file name without the entire file path. I still need to use the read.csv, I still need to say that I have a header, but otherwise it's more abbreviated than that. So, I'm going to read that in right now, and now that's loaded in, we can move on to the next part. You see in the console that it ran, and you see on the top right under workspace that I now have a data frame, sn, 202 observations with 5 variables.
What I have here now is a bunch of comments that R doesn't work with raw data; it can't do it directly from the categorical variables. We first have to create a table with frequencies, and I'm going to use a table function to do this. In line 25, this is where I create the table. What I do is I specify the name of the new table, and that's going to be site, because I'm looking at the Web sites that people say are their primary social networking sites; .freq for frequency.
And then I have the assignment operator, gets, and then table is the function. And then I am specifying in the parentheses the data set, sn, that's my data frame, with the dollar sign; I use that to specify which variable I'm using to create the table. In this case, I'm using site. Please note the capitalization. R is capitalization sensitive. You've got to make sure that the capitalization is the same all the way through. So, I'm going to run that command, and now you see it ran down in the console, and on the right, I now have values.
I have a table now with 6 values in it. What I'm going to do now is create the default bar chart. This is one where I simply take a barplot, and I just run it exactly as it is. So, that's barplot, and then you put the table in there, site.freq, and then I run that one. In the bottom right here, you see that it's opened up, and there are a few things that are going on. Number one is it's gray. It doesn't have any titles. There's only every other label. The scale only goes up to 80, and there are some other issues.
You can see it bigger if you want to. Just come down and click on Zoom. Now it fills up the whole space, and you can see all of the labels. There are a lot of options within barplot that allow you to control the color, the font, the orientation, the order; a ton of things. I'm actually going to take just a second here to show you how you can find out more about that. I've got, here on line 28, the question mark, a space, and then barplot. This is how you find help on any of R's functions, and I'm just going to run that line.
Now you see it brings up the Help window here that talks about all the functions and the options available in barplot. And so, I'm going to show you a few of these. I'm not going to run through all of them, because there's an enormous number, especially because barplot feeds into some other more general options, such as this one here that talks about graphical parameters, which gives you just an incredible amount of control of things you want to specify. Mostly I want to show you just this very basic one, and I'm going to make a few variations on it.
The first thing I'm going to do, and I think it's really important, is to put the bars in descending order. Unless there is some sort of inherent and necessary order in your data, a descending order is a really convenient way to do it. The way to do that is actually I have to tell it that I'm going to be drawing a barplot, and I'm going to be using this data, but I want to order it according to this variable, because theoretically you could order it according to a different variable, and then I'm going to use a decreasing order. So, decreasing = True.
So, I come over here, and I'm going to run this line, and now you see that it's in decreasing order. That's good. And if you want to see it bigger, what we have here is a lot of people who reported using Facebook. The next biggest was people who said they used None, but they still answered the survey. Then, you can tell this is a few years older, because we have people saying they used MySpace, and then we have LinkedIn, and Twitter with just a couple of people each, and I'm willing to bet that all those things have changed since this data was first gathered. I am going to close that window. Now, it's better that it's in order, but we still have an issue of the labels, and the scale is not long enough, and we have no titles.
I'm going to show you some of these other things. What I'm going to do first is I often like to put bar charts horizontally, because then the scale is in the same direction that it is on a lot of other analyses. So, what I do then is I'm going to do barplot, and I'm still going to order them, except I'm not doing them decreasing, because it needs to be increasing when you're dealing with horizontal, because it starts at the bottom and goes up. But this time I have horiz, or for horizontal = True. So, I'm going to run that command, and now I have a horizontal one, but you see I lost even more of the labels.
Now, I also want to do something about the color here. For instance, Facebook has a distinctive color of blue associated with it, and so it would be nice to highlight it with that color. So, what I'm going to do is I'm going to come down here, and I need to create a vector; a collection of color specifications. And the way I do that is I first give it a name. So, it's like a new variable; a new data frame. I'm calling it fbba, for Facebook blue; fbba, and then a, for ascending, because if I were doing this as a vertical bar chart, I need to go descending.
Then I have the assignment operator, and that's the arrow, and then c is for concatenate; sometimes collection, or combined. And then, I'm going to have six colors in here. Five of them are going to be identical; they're going to be gray. And so, I could write gray, gray, gray, gray, gray, or I can use this other option; that's rep, and that's for repeat. And what I do is I put down rep, and then I put in parentheses what it is I want repeated, and I want the word gray in quotation marks repeated.
And then after a comma, how many times I want to repeat it, and I want it five times. Then, after the comma, I can put the last color that I want, and I am going to do that one in particular way. First off, in order to get the Facebook blue, I want to specify it exactly, and I've got what are called the RGB codes; the red, green, blue codes. And that's 59 for red, 89 for green, 152 for blue, but I also need to tell R that I'm working on a 0 to 255 8-bit color scale. And so, that's what the maxColorValue is for, and then I finish the command.
This is also the first time, I think, that I've broken code across two lines. The reason for that is this is a long line of code, but it's all a single command, and so this is one way of making it easier to follow, by breaking it into pieces. So, I'm going to highlight both of those lines, and then hit Ctrl+Return to run them. Now I'm going to do a modified version of the barplot, where I'm adding this bottom line here that says, col, that's for color, and I'm saying use the vector fbba, and I'll highlight the whole thing, and I'm going to run it.
And you'll see that in my chart on the bottom right, the top one, which is Facebook, turned blue. Now, it doesn't say Facebook, because it's small. If I click on Zoom, then you can see that it's Facebook. There are some other issues with this chart. Number one, I'd like to turn off the borders around the bars. Also, I need titles; I like to have a subtitle. The scale on the bottom goes from 0 to 80, but the bars go farther than that, so I'd like to change it, so it goes up to 100. I happen to know that the maximum value is just under 100. And that's why I'm adding several other arguments to this function.
So, this is the same barplot function, and I'm making a chart of the site frequency. I'm going to order it by site frequency, and this one says make it horizontal. This one says use the Facebook color vector. Borders = NA; that means no borders at all. xlim; that's the limits for the X. This one needs be its own little vector, and so I have c, for concatenate, and I say it goes from 0 to 100. And then I have one that says main, and that means the main title.
That one is kind of long. I didn't want to break it across. So, let me scroll through here. And what I'm saying is Preferred Social Networking Site, and then the \n is a way of inserting a line break in the middle of it. So, there will be a second line to this one that says, A Survey of 202 Users. Then xlab at the bottom means the label for x that's going to appear underneath the scale. So, when I highlight all of those lines, and run them, you see now the borders have gone away, the scale has extended to a 100, I have a title on the top, and I have a scale label on the bottom.
If I make this bigger, you can then see all of the site names. If I wanted to spend some more time on this, I would turn the labels, the Facebook, and None, so that they were horizontal. I would probably move Other and None down to the end. There are a lot of other things that I could do here. That's why you want to be able to explore the options that come through boxplot; that's why I had the question mark, space, boxplot. And then also the parameters that are the general graphics parameters. They give you an immense amount of control.
You can basically make this do whatever you want, but this is an example of some of the modifications that are possible. There's just one other thing I want to show, and that's how to export these charts, because right now it's a chart that's just inside R. You see right here, we've got a really easy thing. It says Export. This is one of the advantages of using RStudio. I can say, for instance, save it as a PDF, and I can tell it how big I want it. Let's say I want it to be 8 inches by 6 inches. Then I can give that file a name: snPlotpdf.
One of the great things about RStudio is that it gives you options for exporting your graphics. So, for instance, let me zoom in on this graphic. We've got what we need there. I'm going to close it, and I can export it as a PDF. And that's something that the regular version of R does, but also, I can save the plot as an image, and I have a lot of choices here, from PNG, JPEG, TIFF, and so on. I can choose my own width and height, which is hard to do in a regular version of R. I can view it after I watch it, and make it big enough so you can see all the labels.
Anyhow, I'm just going to press Cancel right now. The idea here is that you have a lot of control over these bar charts, and that RStudio in particular gives you a lot of options for exporting and sizing your charts. That is really one of the first things you want to do when you're dealing with a categorical variable is to make the chart so you get a feel for your data to see how well you meet the assumptions, and to see whether it got entered correctly, and to lead in to the later analyses that you're going to do.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means