Join Keith McCormick for an in-depth discussion in this video Construct proof, part of The Essential Elements of Predictive Analytics and Data Mining.
- [Instructor] The next element is proof that we're right. We've already discussed that data mining, by its very nature does not have a priori hypotheses. But it does need proof. A priori is just a fancy Latin phrase meaning that we got our hypothesis from theory and not experience. So we know we're not doing that. But we do need proof. Is this some kind of contradiction? Well, the most fundamental requirement of data mining is that the same data, which was used to uncover the pattern must never be used to prove that the pattern applies to future data.
We have to have more than one data set. We're performing a kind of experiment on that second data set. It's not exactly the same as the drug study but it does produce the confidence in the result that we need. The standard way of doing this is to divide our data randomly into portions. Building the model on the train data set to find the pattern, and then verifying the pattern on the test data set. The test data set is often simply called the unseen data.
And this is found the essence of data mining because it gives us freedom to explore the train data set, uncover its mysteries, awaiting eventual judgment that we did it right on the test data set. That is how we know our model on our findings will generalize successful new data. It's actually pretty straightforward. But if you need a little bit more convincing consider this. What we're doing is going to be deployed. In most projects we do a partial deployment.
So for instance you might roll out a model just to one regional sales office and collect feedback about how useful the model is. And you guessed it, also measuring very careful how accurate the model is. And you're doing all of this before you deploy the model to all locations. So it turns out you have the train data set, the test data set. And then on top of that, you're validating it in the field, as well. So when you do this correctly you have plenty of evidence before the model goes live to all locations.
- What makes a successful predictive analytics project?
- Defining the problem
- Selecting the data
- Acquiring resources: team, budget, and SMEs
- Dealing with missing data
- Finding the solution
- Putting the solution to work
- Overview of CRISP-DM