Join Barton Poulson for an in-depth discussion in this video Naive Bayes classifiers, part of Data Science Foundations: Fundamentals.
- [Voiceover] Another common choice in classifying algorithms is the Naive Bayes classifiers. Now, you can think of this as the unreasonable effectiveness of naivete, or of simple solutions. The term "Naive Bayes" requires a little explanation. It's a classification method, so it's trying to take the cases in your data set and place them into categories. It's "Bayes," or "Bayesian," because it uses Bayes' theorem, where it's trying to get the probability of the class that the data should be in, given the data.
And it's "naive" because it ignores the relationships between predictors. Now that might seem like a problem, but one of the interesting things is it's very effective nonetheless. Despite the naivete, or ignoring the relationships between variables, Naive Bayes still works very well, and in fact it works even better with data preparation. That can include things like balancing class size. It can include normalizing the classification weights to compensate for dependence among features. It can include transforming the data, getting transformations to emulate a power-law distribution.
There's some other choices, but these are the main ones that can contribute to it being an extremely effective approach. Let's take a quick look at a very simple example of Naive Bayes in r. To do this, I'm going to use two external packages. I'm going to use one called "e1070," which is named after a class for which it was developed. That has the Bayesian approach that we're going to use. And I'm going to use some data from another package called "mlbench," which is for "machine learning benchmarks." Install those if you need to, and then we'll load them.
And then we're gonna use some data called HouseVotes84, which has to do with votes in the House of Representatives in the U.S. Congress in 1984. So I'm gonna load that data, and then let's take a look at the first six cases. I'll make this a little bigger here. And what we have is a class which is the political party of the person in the House of Representatives, and then we have 16 variables, V1 through 16, which are votes on 16 particular bills. The NAs are for people who abstained or weren't present.
And so we have yeses and nos, and we're going to use those to try to categorize people as whether they are democrat or republican. Let's go back to our model here. As I've done with other situations, I'm going to split the data into a training set and a testing set. We'll use the seed for reproducibility, I'll split it into two sets, and then I'm going to create two separate data sets: a training set and a testing set. Now we'll come down, and we'll build the classifier. I'm going to create an object called "nbc" for "Naive Bayes Classifier," and I tell it what data to use, and then we can actually look at the classifier.
I'll open this up so you can see a little more of what's going on here. What it's doing, first off, is it's giving us the a-priori probabilities, that is what percentage of the overall sample that we're training on. We have 59%, there's democrat, and 41% that's republican. The conditional probabilities are telling us what percentage of the democrats and what percentage of the republicans voted yes or no on each of our 16 bills. So, for instance, with the V1, the first bill, 36% of the democrats voted no, 64% voted yes, about 18% of the republicans voted yes and 82% voted no.
And you can see, for instance, on the second bill, they're nearly 50/50 on both of them, and then we get some enormous splits on other ones. We can use this data then to take the cases in our training data and see if we can correctly classify them according to their political party. What I'm going to do, though, is I'm going to check how well the model that we have works on the training data. So I'm gonna create a table here, and we can see that not everybody was classified correctly. What I'm going to do, though, is I'm going to put it into proportions by using prop.table for proportions and rounding and multiplying.
And we can see that 94% of the democrats were classified correctly in the training data versus 88% of the republicans. Now let's try it on the test data. All I'm going to do is create a new table, but now I'm going to use the same model that we had, nbc, that's my Naive Bayes classifier, but I'm going to apply it to the testing data. And here we see that, again, classification's not perfect. Let's look at the percentages with the round function. And now what we get is 96% of the democrats in our testing data are correctly classified based on their voting records, and 77% of the republicans were correctly classified.
So we might want to, of course, include a larger sample. We might want to include more observations to get a more accurate classification, but that's the basic idea of how a Naive Bayes classifier works. And so what conclusions can we draw from this? First, Naive Bayes is simple, but it's an effective approach. Two, it works with a variety of predictors. You can have quantitative, you can have categorical predictors, and also it's easy to interpret the results. All of these together make Naive Bayes a great choice for classification in machine learning.
- The demand for data science
- Roles and careers
- Ethical issues in data science
- Sourcing data
- Exploring data through graphs and statistics
- Programming with R, Python, and SQL
- Data science in math and statistics
- Data science and machine learning
- Communicating with data