From the course: Machine Learning and AI Foundations: Decision Trees with SPSS

How CHAID handles nominal variables

From the course: Machine Learning and AI Foundations: Decision Trees with SPSS

Start my 1-month free trial

How CHAID handles nominal variables

- [Instructor] Let's take a closer look at how CHAID is going to split a nominal variable. We'll start with Embarked. Notice that when it scanned the data, Modeler found four values for Embarked. Blank, C, Q and S. See what CHAID has done? It has automatically grouped. I did not tell it how to group them. Queenstown and South Hampton are combined. But Cherbourg is on its own, along with the missing data which is indicated in Modeler with the word blank. I can actually force it to break the data into all three groups, plus a separate Node for missing. So why did it form these particular groups? Notice that Q and S have very similar survival rates, rates who's difference is not statistically significant. So CHAID has automatically combined them. On the other hand, the left side doesn't look very similar in terms of survival rates, so why did CHAID combine them? It's because sample size is a factor. There's only one passenger for whom we don't know their embarkation point, so there isn't enough data to get a statistically significant difference. So CHAID has taken the one person with missing embarkation and lumped them in with Cherbourg. Next, let's take a look at how CHAID splits ordinal variables.

Contents