Join Keith McCormick for an in-depth discussion in this video How does C&RT weigh purity and balance?, part of Machine Learning & AI Foundations: Decision Trees.
- C&rt weighs two factors equally, purity and balance. Purity is typically measured with a variation of the Gini coefficient. Balance is simply the left branch and the right branch having the same or similar number of cases. C&RT always produces binary splits, meaning that it always splits into two. Let's take a look at an example. This is the age variable from the Titanic data set. This split shows where C&RT wants to split.
Note that it's not terribly balanced. However, purity has changed. The root node shows roughly two thirds blue and one third red. Remember that red is survival here. The leaf node on the right is similar, but the leaf node on the left has moved in a new direction, showing nearly two-thirds red. This would favor age as a potential predictor. It is showing a sharp contrast between the very young and everyone else. If this process were to continue, we'd eventually end up with leaf nodes that were much more pure than the root node, which, of course, is what we want.
Let's take a look. I've forced it to be more in balance, so, out of the 514 passengers, 238 are on the left and 276 are on the right. So, certainly, we would give ourselves a good score for balance, but notice that we're really not achieving purity, in the sense that we have two thirds blue in the root node, we have two thirds blue in node 19, and we have two thirds blue in node 20. It's only moved a little bit. So, as far as progress on purity goes, not as good as where C&RT wanted to split.
Finally, be careful what you wish for. Here's an instance where we've achieved perfect purity in node 22, but there's only one passenger, obviously not something that's desirable. This is the main reason that balance is so important to weigh equally with purity. If we didn't consider both factors, this phenomenon might occur quite frequently, where we would get nodes that had only one passenger in them. In the next video, we're going to be taking a closer look at how C&RT handles continuous, nominal, and ordinal variables.
- Using the SPSS Modeler
- Building a CHAID model
- Adding a second model with C&RT
- Analysis notes
- Using a lift and gains chart
- Exploring algorithms
- Building a tree interactively
- The Bonferonni adjustment
- Handling nominal, ordinal, and continuous variables
- Examining the CHAID tree
- The Gini coefficient
- Weighing purity and balance
- Understanding pruning
- Examining the C&RT tree
- Applying stopping rules
- Using the Auto Classifier tuning trick