Join Keith McCormick for an in-depth discussion in this video How CHAID handles nominal variables, part of Machine Learning: Decision Trees.
- [Instructor] Let's take a closer look at how CHAID is going to split a nominal variable. We'll start with Embarked. Notice that when it scanned the data, Modeler found four values for Embarked. Blank, C, Q and S. See what CHAID has done? It has automatically grouped. I did not tell it how to group them. Queenstown and South Hampton are combined. But Cherbourg is on its own, along with the missing data which is indicated in Modeler with the word blank.
I can actually force it to break the data into all three groups, plus a separate Node for missing. So why did it form these particular groups? Notice that Q and S have very similar survival rates, rates who's difference is not statistically significant. So CHAID has automatically combined them. On the other hand, the left side doesn't look very similar in terms of survival rates, so why did CHAID combine them? It's because sample size is a factor.
There's only one passenger for whom we don't know their embarkation point, so there isn't enough data to get a statistically significant difference. So CHAID has taken the one person with missing embarkation and lumped them in with Cherbourg. Next, let's take a look at how CHAID splits ordinal variables.
- Using the SPSS Modeler
- Building a CHAID model
- Adding a second model with C&RT
- Analysis notes
- Using a lift and gains chart
- Exploring algorithms
- Building a tree interactively
- The Bonferonni adjustment
- Handling nominal, ordinal, and continuous variables
- Examining the CHAID tree
- The Gini coefficient
- Weighing purity and balance
- Understanding pruning
- Examining the C&RT tree
- Applying stopping rules
- Using the Auto Classifier tuning trick