Join Keith McCormick for an in-depth discussion in this video Costs and priors, part of Machine Learning & AI: Advanced Decision Trees.
- [Instructor] Now I'd like to talk about costs. Now, costs are a feature that almost all the decision tree algorithms have, but I'm going to demonstrate them via CART, because CART has a really nice feature that some of the other algorithms doesn't have in regards to costs. I'm going to start by opening a stream that I've created for just this purpose. The first CART model that's been already placed on the stream here was run on default settings, so nothing particularly special about that model.
Now, in the Settings for the CART modeling node, I'm going to show you where the settings for costs are, you can check off Use misclassification costs and I've increased the value in the upper right corner from the original 1.0 to 5.0. What's going on here? Well, that corner of the matrix means the following. Those are folks that actually died in the accident but the model said that they would survive.
That's a particular kind of mistake. The opposite kind of mistake would be predicting that they were going to die, but they actually survived. Now, the reason that you focus on costs is that those two mistakes might not be equally important to you. Let's say that you had an intervention strategy. We're going to have to imagine that the situation is slightly different than the Titanic itself, which is historical incident. We'd have to imagine that this was an ongoing series of cruise trips.
Maybe what's happening is that folks aren't getting to safety in the way that they should, so when the model tells us that a particular passenger is at risk of not getting to the safety boats, we're going to have a crew member help them. We have to identify those customers that are at risk of being hurt in the accident so that we can help them. It means we really have to figure out who's at risk, therefore, the mistake where we predict they're going to survive, but they don't, is a serious mistake, more serious than the other kind of mistake, so I'm going to give it five times the weight.
Now here's the really interesting feature that CART has. Not only will CART change the cutoff points for the predictions, all the techniques do that, CART will actually change the shape of the tree to facilitate reducing that mistake that we're worried about. Let's take a look at the two trees side by side. We do indeed see that there's a difference. They both first branch on gender, passenger class, and age.
But the tree on the left, which was the default settings tree, we then split on Fare and Sibling Spouse. The tree on the right, which is after the costs has been adjusted, now splits Sibling Spouse on both sides, and then beneath it, there's now a new branch. Let's see how that impacts the accuracy of these two trees. I'm going to run the Analysis node, and I've chosen a non-default setting, Coincidence matrices.
This will allow us to see what kinds of mistakes are we making. Let's take a look. Now, remember, the first model is the default model, and the second model we've applied costs. Notice that the overall accuracy for the first model is over 81%, and the overall accuracy for the second model is substantially less. This is guaranteed to happen, when you start to play around with costs, what you're doing is you're saying that I really care deeply about a particular kind of mistake and I want to avoid it.
But your overall accuracy is going to suffer. Let's look at the test data, because that's really where the action is. In the default model, notice that it says rows show actuals. The zeroes would be folks that actually died and there are 14 folks that actually died that were predicted to survive in the default model. Looking down at the new model, the new model where costs have been applied, of those individuals that actually died, only five, not 14, but only five now were predicted to survive.
It's done exactly what we asked it to do. We told it that that kind of mistake was what was important to us, and even though we had to sacrifice overall accuracy, it has reduced that mistake by several cases.
- Understanding QUEST functions and applications
- C5.0 concepts and practical applications
- Understanding information gain
- Random forests
- Boosting and bagging
- Costs and priors