Join Barton Poulson for an in-depth discussion in this video Creating crosstabs for categorical variables, part of Learning R (2013).
When you're looking at the associations in your data set, a lot of times you're going to want to look at the associations between two categorical variables, and that's when you want to use a cross tabulation, and usually a chi square test of significance. That's the simplest possible version of it. In this example, I'm going to be using the social network data, though I need to mention, I did make one modification to it. There was one case that did not have information on gender. Since I'm using gender here as a predictor variable, I wanted to have that missing case out, so I deleted the one case.
So, we're going to go from 202 cases to 201. I'm going to list the names of the variables. We have ID, gender, age, their preferred social networking Web site, and the number of times that they log in per week. I'm going to be looking at the association between gender and site to see, for instance, if men and women report different Web sites as their preferred method for social networking. The easiest way to do this is by creating a contingency table. I'm going to call it sn.tab.
That's for social network dot tabulation or table. I'm using the table function that's part of R. All I need to say is what my two variables are; two categorical variables, and I'm using gender, and the sn -- the dollar sign means it's from the sn data set -- and I'm using Site. So, I'm just going to run line number 11. You see that the table shows up in the Workspace there on the right. Then on line 12, I just have sn.tab. That's just going to put it out. So, there I have the number of men and women who report Facebook, LinkedIn, MySpace, None, Other, and Twitter.
Looking at this, you can see there's a couple of interesting things. First off, identical numbers of men and women prefer Facebook. LinkedIn, Twitter, and Other are so small as to be negligible here. Again, this data set is a few years old. You see that MySpace has a much higher number of women reporting it as their primary method, and then for None, there's a lot more men who say they use None. These work in with some expected patterns. Now, these are just the frequencies or the counts; the cell frequencies. On the other hand, it can be really nice to get marginal frequencies, which are the totals for the rows and the columns, and it can also be nice to get percentages or proportions.
So, what I'm going to do is I'm going to scroll down here. First, just get the marginal frequencies. I'm going to get the row frequencies, and that's going to be just the number of men and women. So, I have 98 women and I have 103 men. The fact that they both have 46 in Facebook; they're closely balanced anyhow, so that's essentially the same. Now I'm going to look at the column marginal frequencies, and that tells me the overall number of people who prefer each social networking site. We've seen this before when we've done bar charts for this variable, but now a more interesting one is to get the proportions of people within each cell, and also the proportions who report using each one of these.
To do this, I'm going to use prop.table. That's proportions for the table, but I'm wrapping it up in a thing that rounds off the number of decimal places. It gives a huge number by default, and I only want two. What I'm doing with each one of these is, to get the cell percentages, I'm doing prop.table right here, and it tells that I want to use sn.tab, that's the table for social network as my data set, and I'm wrapping it in round to two decimal places. So, I'm going to run line 20.
23% of respondents in this data set are women who said they like Facebook. 1% are men who said they like Twitter. These, all together, these 10 numbers add up to 100. Now let's look at the row percentages. Similar procedure, but now what they do is they add up to 100 going across. Say, for instance, we had dramatically different numbers of men and women. This would allow us to compare the relative interest in each of these sites, even with unbalanced marginal frequencies.
You can see, for instance, that MySpace, the numbers mirror what we saw earlier. 18% of the women like MySpace, whereas only 4% of the men. Then finally, line 22, let's just do a similar thing going in the other direction. Now these percentages add up going down. So we see, for instance, that for MySpace, 82% of the people who said they like MySpace were female; 18% were male. So, these are ways of looking at the data in several different dimensions. The last thing that I'm going to do is I'm going to actually do an inferential test to see if the distribution of preferred networking sites differs by gender.
This is a statistical significance test, and I'm using chi square in this particular case. The function for this is chisq.test, because we're doing the inferential test, and then what the data set is the tabulation, or the table that I'm working, sn.tab. I hit that one, and it's doing the Pearson's Chi-squared test. It tells me what data I'm using, and then it's doing the X-squared here. So, the value for chi squared is 13.2076, and with 5 degrees of freedom. The probability value, and that's the one that I'm really interested in here, is 0.02.
That's less than 0.05, which is the standard cutoff for statistical significance. So, this tells me that the variations between men and women in their preferred social networking sites, those are bigger than we would expect by chance; that they, in fact, are likely reliable differences between men and women in what they prefer. This shows up in terms of women are much more likely to prefer MySpace than men are, and men are much more likely to report that they have no preferred site. This warning message on the bottom, it says that chi squared approximation may be incorrect; that's going to have to do, because I have a relatively small sample, and I have some, what are called, sparsely populated cells.
Normally, for a chi square to be reliable, you're going to want to have a certain expected frequency of five or ten cases per cell; not observed frequencies, but expected frequencies, which is a different thing. But mostly, I may want to exclude some of these social networking sites from the analysis, or combine them, so I can bump up the expected frequencies, and better meet the requirements of the chi square. That being said, I still have good evidence that suggests that there are gender differences in preferred social networking site by using the cross tabulated data, and the chi squared test for significance.
The course continues with examples on how to create charts and plots, check statistical assumptions and the reliability of your data, look for data outliers, and use other data analysis tools. Finally, learn how to get charts and tables out of R and share your results with presentations and web pages.
- What is R?
- Installing R
- Creating bar character for categorical variables
- Building histograms
- Calculating frequencies and descriptives
- Computing new variables
- Creating scatterplots
- Comparing means