Learn how to summarize categorical data.
- [Instructor] Categorical data is described by how observations are distributed across the variable's categories. A very simplistic approach to sentiment analysis could involve web scraping public product reviews. Then classifying certain words found in the scraped data as positive and others as negative. Lastly, you do a categorical word count on the product review data to score a product review or feedback as either good or bad. Categorical variables only assume a fixed number of values.
For example, think of a fruit. A fruit can be an apple, an orange, a lemon, a pear, a prune, et cetera. But there are infinite options. In other words, fruits fall into one category or another. Categorical variables are easily summarized using counts, grouping, variable descriptions, or cross-tabulations. Before going into the demonstration, I want to explain to you a little bit more about cross-tabulation. These tables are really called crosstabs in practice.
A crosstab is a cross-tabulation of two or more features. By default, a crosstab table shows frequency counts for features. As you can see here, we generated a crosstab of two variables, am variable and the gear variable. When we create a crosstab of those variables, what you get is a frequency count. For the am variable, am variable assumes one of two values. Either zero or one. The gear variable has three values, three, four or five.
What you're seeing here in this crosstab is say for example, the first row of returns where am variable has a value of zero. Zero actually represents an automatic transmission. Up here, we've got cars with three gears, four gears and five gears. So what this is really saying is that there are 15 cars that have an automatic transmission and three gears. There are four cars that have an automatic transmission and four gears. And there's zero cars that have an automatic transmission and five gears.
Now let me show you how to use Python to describe categorical variables. In this demonstration, we're going to use numpy and pandas. So we'll import those. And this example also is going to be of the cars dataset. So we just need to load that data like we did in the last video. I generated ahead of the first 15 records just so we can get an idea of what's in there. Okay let's look at the carb variable. This represents the number of carburetors a car has. Cars can either have one to four carburetors.
But let's use the value counts method just to double check. So we'll first isolate the carb variable. Cars.carb and then call the value counts method off of it. Value_counts And so by looking at these results here, you can see that when we eyeball the cars data frame, we miss these two cars here, the one with eight carburetors and one with six carburetors. That's exactly why it's a good idea to use the value counts method to quantify your dataset and to describe it.
So you know you don't miss anything when you just eyeball. So now I'm going to make a small subset of the cars data frame and it's going to include categorical variables only and we'll print that out. So we'll say cars_cat and we'll choose our variables that we want to include. Cylinders, vs, am, gear and carb. And then just to print the first few records, we'll say cars_cat, that's the name of our data frame. And then we'll call the head method off of it.
And what you're seeing here is just the first few records in our new sub set data frame. Now I want to show you how to group this new data frame by the gears variable. We'll call the output of this gears_group and then we'll write the name of our new data frame which is cars_cat. Call the group by method off of it and pass in the name of the gear variable. Then we'll tell Python we want a description of our gears_group data frame called describe.
We covered the describe method in the last video. As you can recall from our earlier discussion, we knew that there were cars that had either three gears, four gears or five gears. Know what the group by method has done is grouped our data frame into these three subgroups and then generated a statistical description for each of the variables broken into those subgroups. It's time to look at transforming variables to categorical data type.
To create a categorical variable, you can call the series constructor on an existing data set. Just make sure to pass in the dtype equals category argument. This tells Python to assign the new variable a data type of category. Here we create a new categorical variable from the cars dataset gear variable. After we have created this variable, we added as an additional column to the end of the cars dataset. We will call the new column group. Let me show you how to do this in code. So we write the name of our data frame.
We're going to add a new column called group. This group is going to be assigned a new series object. So we call the series constructor. And then we're saying, we want the series to be comprised of data from the gears variable in the cars data frame. But we want the new column to be a data type of category. Looks like I had an S there where it shouldn't have been so we'll just fix that and re-run it.
Let's check out our new variable. See what kind of data type it is. In order to see the data type of a variable, write dtypes. And you can see, we've created a categorical variable called group. Now let's use value counts to describe that variable. And there we have it. So we've got cars with three different counts of gears. We've got three gear, four gear and five gear. You can see the distribution of how the cars fall into each of those groupings. The last thing I wanted to show you was to how to describe categorical data with crosstabs.
Creating crosstabs is really simple. You just call the crosstab function on the variables you want included in the output table. Like this. So we'll say pd.crosstab and then we'll pass in the name of our variables that we're interested in including in the output table. So we'll do one for am variable and one for gear variable. And when we print this out, you see we get our crosstab table back that we've discussed earlier in this lesson.
Now that you know how to describe categorical data, let's look at different ways to assess the correlation between variables.
- Getting started with Jupyter Notebooks
- Visualizing data: basic charts, time series, and statistical plots
- Preparing for analysis: treating missing values and data transformation
- Data analysis basics: arithmetic, summary statistics, and correlation analysis
- Outlier analysis: univariate, multivariate, and linear projection methods
- Introduction to machine learning
- Basic machine learning methods: linear and logistic regression, Naïve Bayes
- Reducing dataset dimensionality with PCA
- Clustering and classification: k-means, hierarchical, and k-NN
- Simulating a social network with NetworkX
- Creating Plot.ly charts
- Scraping the web with Beautiful Soup