Join Keith McCormick for an in-depth discussion in this video Winnowing attributes, part of Machine Learning & AI: Advanced Decision Trees.
- [Instructor] Now I'd like to demonstrate a special feature of C5.0 that allows you to identify your most important variables before the tree is built. It's called winnowing attributes. Going to open a stream that I've prepared for you called C5.str stream. A couple of quick things to double check. In the source node you should verify that the data is where the source node expects to find it. You can do that quickly just by clicking on preview.
Looks like it's okay. Also if you go into the type node, you'll see that I've declared the level of measurement and the role the way that we need it to be and I've placed a C5 modeling node on the canvas, but I've kept it on default settings. Let's take a look. Winnowing attributes is actually found under the expert settings but the idea behind it is really straight forward. For now we're going to keep it on the default setting which means that it's been turned off.
Now I'm going to run the model and I'm going to look at the results. We'll come back in a moment to look at the tree but for now I just want to identify for you that all seven variables have been used. We actually can see that by looking at the predictor importance. So passenger class, sex, sibling spouse, embarked, age, fare and parent child are all being used by the model. If we go back into the C5.0 settings, we can now select winnow attributes and see if it affects a change in the result.
We actually find that not all seven variables are being used. We can verify that in a different way. Where it happens to be in Modeler is we can go to the summary and go to fields and inputs. We can expand the inputs and verify that only five variables of the available seven are being used. So what have we accomplished? Let's take a look at the tree and discuss. If you look over at node 22, passenger class has been chosen as that branch.
What you want to remind yourself of is that we're not looking at all the data. We're looking at 80% of the data. The 80% of the data that's in the train partition. So what you can imagine is it's like pulling on the handle of a slot machine. We've chosen a particular 80% but if we were to pull on that handle again, we would get a somewhat different 80%. So what's another variable that's similar to passenger class? Well we've seen that fare is certainly similar to passenger class.
So at the moment node 24 is third class passengers, but if we pulled on that slot machine handle and got a different 80%, perhaps C5 would choose fares under 25 pounds instead of third class. On one level you might think, no big deal, the variables are similar. But remember that it affects all the variables beneath it as well. That entire branch of the tree is affected. So in large, complex situations with lots of attributes, the winnowing prevents this kind of competition between similar variables to explain the same information content.
And as a result, the tree will be both more accurate and more generalizable, meaning that it will have a greater ability to make predictions even about data that it has not seen.
- Understanding QUEST functions and applications
- C5.0 concepts and practical applications
- Understanding information gain
- Random forests
- Boosting and bagging
- Costs and priors