Join Keith McCormick for an in-depth discussion in this video What is bagging?, part of Machine Learning & AI: Advanced Decision Trees.
- [Instructor] Now, let's talk about a very influential technique called bagging, which is a kind of homogeneous ensemble. We'll be demonstrating in in Modeler. I'm going to begin by opening a stream. We're going to use the Quest stream. Bagging can be applied in many situations, not just Quest. So we're simply using Quest as an example of this technique. I'm going to go inside the Quest modeling node, and over to Build Options, and we're going to go ahead and go right to generating the model.
And as you notice below, there's a checkbox that says that we can go ahead and enhance model stability through bagging, but initially what I'd like to do is go ahead and run this model as is. There it is. And I'm going to sever this link, because I'm going to go ahead and now create a second model using bagging so that we can compare and contrast the two in a couple of moments. So, I go back in and I choose bagging.
And now I've got two models. The first one without bagging and the second one with. Before we examine the model that utilized bagging, let me explain what bagging is all about. Bagging, or bootstrap aggregating, has been around for many years. In fact, it was first proposed in the early '90s by Leo Breiman. If you recognize that name, it's perhaps because he was the gentleman who came up with CART.
So, bagging is a special case of model averaging. The notion is, is that we build multiple models and then compare all of those results. The key to understanding bootstrap aggregating is to understand what bootstrap sampling is. There's no point in building multiple models, in this case 10, if all the models are identical. There has to be some variation.
Bootstrap sampling is sampling with replacement, so if you imagine a lottery drum, the notion would be, is that you pick out a number, you record it, but then you put it back. That's critical, because there has to be a possibility that some numbers will get picked not at all, or more than once. Let's take a look at an example. I've taken some of the passengers from the Titanic passenger list and I've gone through exactly this process.
I took just the top 20 names, and I went ahead and used a randomizer to pick the name, and then pick another, but again, with the possibility of picking some names twice. Let's take a closer look at three resamples that I've done. Resampling is often a term that's used for the same thing. So take a look. We have three different samples. Each would be the basis of a different tree, and look at the amount of repetition. So, for instance, Miss Allen was chosen twice in the first sample, but not at all in the second or the third.
Mrs. John Jacob Astor was picked once in the first sample, four times in the second sample, and once again in the third sample. One more passenger was picked once in the first and second samples, and then four times in the third sample. So these samples really become quite different from each other, and that's the basis of the different models. 10 different models in this case, built on 10 different resamples of our data. In fact, if you look inside the modeling node, and go to Ensembles, you can see that the default setting in Modeler is to build 10 of these models, each with their own resample.
Let's take a look at the resulting model. There are a number of interesting details. We can go down to Predictor Importance and recognize that this is the Predictor Importance across all 10 of the models. A particularly interesting lens into this is to click on the next tab here in Modeler, and what we can actually see is that Sibling/Spouse was used in the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, tenth model.
Was used in all 10. We have a fairly modest data set, so in all 10 instances, the Quest has chosen to use all of the variables at some point in the tree. Now, let's add an analysis node and compare the performance of the single Quest tree to the 10 bagged trees. We'll run that.
The performance of the first model was 83% accurate on the training data, and a bit lower at about 79% of the testing data. That would be considered pretty good. We don't want the drop between 83% performance and the testing model to be more than a 5% drop, so we're doing okay. But you can see the motivation behind bagging. The performance of the second model is better. The second model we see has a performance of 83% on the training data, but 81% on the testing data.
Its performance on the testing data is pretty good, compared to the single tree, so we've accomplished the kind of thing that we want here. We have a better model through combining the 10. What are some potential disadvantages, though, of bagging? Why don't we just always do it? Well, one disadvantage, unfortunately, is that you can examine and learn about your data through looking at a single tree, but it's a lot harder to do with 10. So essentially, once you bag, you've turned Quest, in this case, into a black box technique.
So that's a potential downside. Another potential downside that we're not experiencing here is that sometimes, bagged models can get overfit. I think we can imagine why that's the case. We keep on building the model over and over again on slightly different varying data sets, and sometimes what will happen is you'll get that overfit tree, which we would recognize by having a high train performance, but perhaps a poor test performance, and that large drop between the two would indicate that, in fact, we were overfit.
In this case, would we go with the bagged model? Well, unless the black box issue was the case, I probably would. Notice, however, that the agreement between the two is actually quite high. The single model makes the same predictions 95, 96% of the time. So in this case, we probably could go with either. If transparency was important, we would probably go with the single tree. If we wanted that little bit of extra accuracy, then the bagged version would also be a good choice.
- Understanding QUEST functions and applications
- C5.0 concepts and practical applications
- Understanding information gain
- Random forests
- Boosting and bagging
- Costs and priors