Once you've taken a look at all of your variables individually, and you've gotten them into the shape that you need for your analyses, the next step is often to start looking at associations between variables. A very common form of association is to look at group membership, and how that's associated with scores on a quantitative outcome. I'm going to use an example for this to show a couple of different ways of depicting group distributions by using bar charts, and also by box plots. For this one, I'm going to be using a data set that is based on Google searches by state.
The idea here is that the Google search data is showing how many standard deviations above or below the national average each state is in their relative interest in a search term. The first thing I'm going to do is I'm going to load a data set called google_correlate.csv. I put it into a data frame called Google. There are 51 observations, because there are 51 states, and there is D.C. Next, I'm going to just run to see what the names of the variables are. That's line 7. What we have is State, that's the name of the state, then the state_code, that's like CA for California.
Then we have their relative interest in data visualization; so, how often do they search for that relative to their other searches? Then we also have searches for Facebook, searches for NBA, and for fun, to put down whether that state had an NBA team. Also, the percentage of people in that state with a college degree, whether that state had a K-12 curriculum for statistics, and the region of the country. Let's take a closer look at that with structure; that's str.
If I hit that, and make this bigger, it gives you the idea of how many levels they are. It gives you the first few data values. So, that's a way to seeing what we're dealing with. I'm going to clear that out, because it's pretty busy. Put that back down. One of the interesting questions might be, do the responses to one of these vary by region? I thought I'd look at data visualization, and I want to see whether it varies by regions in the United States. So, the easiest way to do this is to first create a new data set, a table or frame, where I split the data by region.
So, what I'm going to do in line 12 is I'm going to create my new unit as searching for data visualization, .reg for region, and then we're going to get the distributions. I'm going to use the R function split, and then I tell what it is that I'm going to split. I'm going to use the data set Google, and the variable data_viz; the dollar sign joins those two, and I'm going to split it by the variable region that's in the Google data set. I'm going to run line 12 now. You see how that shows up in the Workspace on the right.
So, I have this new list. Then I'm going to draw boxplots by region. I'm going to use a boxplot here, and I'm going to go back to my new data frame or list for interest in data visualization. I'm also going to color it lavender. There we have it. What this shows us is the distribution for each region. So, for instance, you can see here that the box indicates the range of the middle 50% of states in that region; their relative interest in data visualization.
So, we see that there's a lot of variation in the west, because its boxes are wider than the others. There's less variation among the middle 50% in the northeast. That's because the box is tighter. But we have outliers in the northeast. We have one that's unusually low, and one that's unusually high. Interestingly, the state with the highest relative interest in data visualization is in the south, and that's where we have a z-score of over three. You can see the northeast is generally higher than the others, with the exception of that one outlier. So, that's one way to get a feel for the variations and distributions by groups.
Another very common way is to do barplots for means. That's what I'm going to do down here. I'm going to create another data set here where I'm going to use means. And so 18 says viz.reg, so visualization, and the .reg is for region, except this time I'm doing the means. This makes it so I can do the bar chart. I'm going to use the R function s apply. Then I'm going to tell it what I'm dealing with, and that's relying on the list that I got on the last one. This time I'm going to be calculating the mean.
So, I'm going to do that in 18. Then I'm going to run a barplot. And so I'm telling it barplot what it is I'm charting. I'm going to color it beige, and I'm going to give it a title that's rather long here. I'll scroll to the end for a moment. There we go. By the way, this right here means to break it into a new line. The backslash is the escape character, and n is the new line. Then this backslash right here means I actually wants to print these quotes, because otherwise it thinks I'm done with the title, and then I have to do it again at the end of data visualization.
This one, because it's not escaped, it means it's the end of that command. So, I'm going to go back to the beginning, and I'm going to run that command by itself, barplot, by highlighting those three lines, and then pressing run. So, now I've got a barplot. It shows where the average is for each of these groups. On the other hand, there is one thing that's missing that would be really nice, and that is we don't have a zero axis line. Fortunately, I can add that manually with this abline function. All I've got to do is put the height. It's at zero.
If I highlight all of that, and run it together, now I get the means plot, and this time, it has the reference line at zero, which is a lot easier to read. Finally, it would be nice to have the actual numbers that go with each of these things. What I'm going to do to facilitate this is I'm going to use the psych package again. The first one installs it, and this one loads it for use. Then I'm going to do describeBy. It says, I want to take the variable data_viz, and I want to break it down by region.
This is based on describe. It just does it categorically. I'm going to make this one down here bigger. As you can see that, for each area, I know that there are 12 states in the midwest, 9 in the northeast, 17 in the south, 13 in the west, and this gives me the mean for each of these. So, for instance, you see that the midwest, the mean score is -0.32. That's what we see over here. This bar comes down to -0.32. In the northeast, the mean is 0.45; it's positive, and we come up here.
Again, these are z-scores indicating relative interest and searching on Google for data visualization compared to all of the other searches in that area. Anyhow, these box plots and these means plots are one way of looking at how a quantitative variable differs from one group to another, and it can often be an important step in an analysis.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means