In this video, get a high-level introduction to cross-validation, another tool in the toolbox.
- [Instructor] So we've been talking about using training, validation and test set to measure our model's ability to generalize. In this lesson, we're going to add one additional layer on top of our evaluation framework with something called cross validation. We're going to dive into cross validation in this lesson. And in the next lesson, we'll zoom back out and put all the pieces together. Let's start with a couple of quick definitions and then we'll walk through an example. So a holdout test set is essentially a generalization of the test set that we've been talking about. It's just any data set that was not used in fitting a model that is set aside for evaluating the model's ability to generalize. So with that idea, both the validation set and the test set that we've been talking about both qualify under this general term of holdout test set. The next definition is a new one, K-Fold Cross Validation. This is a process by which data is divided into k subsets and the holdout test method is repeated k times. Each time one of the k subsets is used as a test set and then other k-1 subsets are combined together to be used to train the model. This is another tool that is really helpful to gauge a model's ability to generalize to unseen examples. Let's look at an example to see if we can clarify this a bit. So in this example, we'll start with a full data set of 10,000 examples and let's just say we want to run five-fold cross validation. So in other words, k in k-fold is equal to five. So the first step is to split those 10,000 examples into k or five subsets. So you'll see now we have five subsets of data each has 2,000 examples. So I just want to clarify that this is sampling without replacement, so no single example will appear in two different subsets and all original 10,000 examples are still accounted for in these subsets. And these subsets will remain the same throughout this entire process. So an example in subset one will remain in subset one all the way through to the end. So now we assign one of these five subsets as a test set as indicated in red here and the other four indicated in blue are the training set. Now we'll fit a model on the training set of 8,000 examples in blue and then we'll evaluate the model on the 2,000 example test set and then we're going to record the performance metric. So to be clear, this is all handled under the hood in scikit-learn. You don't have to manually implement any of these steps but it is really useful to understand what is actually happening and why we use this. After the first iteration, it will store the model's performance on that holdout test set. So here we're saying that's 0.867. So in other words, it was able to generalize and make a correct prediction on 86.7% of these examples that it did not see during training. Okay, then next we'll move onto the second iteration where now the fourth subset is the test set and then one through three along with the fifth subset represent the training set. Now we'll refit a brand new model on these 8,000 training examples and evaluate it on the 2,000 test examples and then we'll store the performance metric. Here we're saying that's 0.884. So again we're just saying that this model was able to generalize and make a correct prediction on 88.4% of the examples that it did not see during training. Then again in the third iteration, the third subset will be our test set and then the model will be trained on 8,000 examples in the first, second, fourth and fifth subsets. We'll train the model on those subsets, evaluate it on the third subset and store the performance metric of 90.1%. We'll do the same thing again for the fourth, same process as before but now the second subset is our holdout set. And last but not least, the first subset is our test set and two through five are our training data, fit a model, evaluate it on the first subset, and store the evaluation metric. Now it's worth noting that at this point, every subset and thus every example has been used in the training set four times and the evaluation set once. So we've now used this algorithm configuration to fit on all different combinations of these examples and we've evaluated it on every single point in the data set without training on that point. So you can see how this would be a really powerful tool to gauge a model's ability to generalize. Okay and then lastly, you can either output the full array of scores that you see here or you can just output the average across the array. Either way, this gives us some more robust gauge of what the outcome would be since we have now tested this model configuration's ability to generalize on every single point in this data set. Now let's just say we are using a single holdout test set and using that to gauge the performance of this model that we're going to use for our business. Now from the lowest score to the highest, so from 86.7% to 90.1%, that's a difference of 3.4%, which may not seem like a lot but in a business setting, this could potentially impact thousands or even millions of dollars. So having a read on how this model performs over five separate test sets is a leg up on truly understanding our model performance instead of getting a read on how it generalizes to just a single test set. So this gives us more confidence to understand how the model is performing and even a range of plausible outcomes or simple error bars on the projection.
- What is machine learning (ML)?
- ML vs. deep learning vs. AI
- Handling common challenges in ML
- Plotting continuous features
- Continuous and categorical data cleaning
- Measuring success
- Overfitting and underfitting
- Tuning hyperparameters
- Evaluating a model