Join Keith McCormick for an in-depth discussion in this video How QUEST handles ordinal and continuous variables, part of Machine Learning & AI: Advanced Decision Trees.
- [Narrator] Let's discuss how QUEST handles ordinal and continuous variables. We have two issues. The first issue is how does QUEST handle the ranking of variables? How does it determine where Fare will fall in the list of variables from most to least important? Well, in the case of QUEST, in continuous and ordinal variables it's using the F-test which is going to produce a statistical test and a P-value. Since categorical values are handled with chi-square which also produces a P-value, all the variables in the dataset can be ranked from most important to least important using these P-values.
The next issue is how does QUEST determine where the cut points should be? In other words, what Fare amount in British pounds should separate the members of Node One from the members of Node Two? Let's talk about both of these issues in a little bit more detail. How does F-test work? Well, the first thing is to be clear about what we're attempting to do. We're trying to predict a scale-dependent variable using a categorical predictor. So actually what we're doing in this case, we're actually trying to use Survived to predict Fare.
And that's how we're generating that P-value. F-tests can be thought of as a signal-to-noise ratio. And if you look, those that survived, indicated by the number one, paid about 50 British pounds for their ticket on the Titanic. Those that died paid considerably less, on average around 22 British pounds. The gap between these two groups between around 48 or 50 on the high end and 22 on the low end, represents the signal that we're measuring in our signal-to-noise ratio.
Now let's talk about the noise. The error bars show that the upper bound of the 95% confidence interval is probably about 56 pounds for the survivors. And the lower bound of that confidence interval is more like 43 or 44 British pounds. The width of that confidence interval gives us some indication of the noise in the system, both the variety of the prices paid and also a sense of things like standard deviation and even factors like sample size.
And of course, if we look at the folks that did not survive, we see variation there as well. What the F-test is doing is comparing this signal-to-noise, literally in the form of a ratio. Let's take a look at formal F-test results, the way that they would appear in statistical software. There are a couple of things that I want to draw your attention to. First, we can see the exact averages of our two groups. About 22 British pounds paid by those that did not survive and just over 48 pounds paid by those who did.
Also, notice the actual F-ratio. And it is indeed a ratio. The larger that becomes, the more significant the difference between the two groups are. And we see an F-ratio of 63. Finally, we see the P-value of 0.000. These two values are exactly the values that are going to be reported in our QUEST tree indicating that this is exactly what's happening under the hood in our analysis. Now let's revisit the issue of how QUEST chooses a cut point.
Well, as it turns out, QUEST uses something called Quadratic Discriminant Analysis to determine the split point for all variables, including categorical variables. This can be difficult to see in some cases. Remember that when we have lots of variables, we're not simply in two-dimensional space. But it's actually not too difficult to see how this works on scale variables. Also, forgive me a bit of poetic license here. I'm going to show this to you visually.
And what I'm going to be showing to you visually is a bit more like linear discriminant analysis than it is like QDA. To begin, what we're trying to do with Quadratic Discriminant Analysis is we have a categorical dependent that we're trying to predict with a scale. So we are reversing the situation now. We are predicting Survived using Fare. And as you can see, the cut point that's been chosen is 72.5. So how would that be determined? Let's take a look at it visually.
I've placed that exact value as a reference line on this histogram. Of course, this is a special histogram. We have green on the right indicating those that survived and blue on the left indicating those that have died. And of course, we see lots of both colors below the line, but look above the line. What discriminant analysis has identified here is that above the line, virtually all the passengers are green. More specifically, what we see is that above that threshold of 72.5, 73% survived.
But below that threshold, 34% survived. And that's why discriminant has chosen that value as the cut point.
- Understanding QUEST functions and applications
- C5.0 concepts and practical applications
- Understanding information gain
- Random forests
- Boosting and bagging
- Costs and priors