In this video, learn the difference between a training set and a test set, and how relying only on training set performance can yield a sub-optimal model.
- [Instructor] Now that we have a decent idea of how to explore and clean our data, we want to start setting the stage for building a model by understanding what it means to measure success of the model. The first step is walking through why it's necessary to split up our data before we build a model. Let's first revisit the definition of machine learning that we settled on back in the first chapter. Fitting a function to examples and using that function to generalize and make predictions about new examples. Now, again, this part on generalizing to make predictions about new examples is the core of machine learning. The model needs to learn the underlying pattern that will allow it to make predictions about future examples. You could fit a model that simply memorizes the examples that you give it, but that's not going to be very useful when it's presented with new examples. So, let's revisit this linear regression model that we looked at previously. The question we're concerned with is whether this line or model will generalize to new examples, and we can't say for sure because we don't have any additional examples, but based on the data we're looking at right here, it looks like it's doing a pretty good job of capturing the underlying pattern, or correlation. Now, as a counterexample, what if we have a line that's really squiggly and just runs through every single point on this plot? That would be an example of a model that's just memorizing the examples that it has seen but never really learned the underlying pattern. Okay, so now the question becomes, how do we make sure that the model is learning the underlying pattern and not just memorizing the examples? Well, this is something that we're going to continue to talk about throughout the rest of this course, but first, I mentioned back in slide two that we don't know how well our model will generalize because we don't have any additional data to test how well it will generalize. We can solve that by splitting our dataset into three separate segments. The first is called the training data. This is the data, or the examples, that the model will learn those general patterns from. In the previous slide, that would be all the points we see on the plot. For linear regression like we see on that plot, the model would learn that trend or underlying pattern from the data and come up with the best fit line. Now, the second dataset is what's called the validation set. That represents the first attempt to understand how the model will generalize to new examples. So, generally what will happen is we'll fit a number of different models on our training data and then we'll use this validation set to determine which models generalize best. It's important to emphasize that these models have not learned in any way from this validation set. These are completely unseen examples. So, I just mentioned fitting a number of different models, and we'll talk about this more in the following lessons, but in machine learning, we typically will experiment with a variety of algorithms and hyperparameter settings. So, maybe we'll test out five, 10, 50, or potentially even hundreds of different models, and they'll all learn from the same training set but they'll be tuned in ways to learn different things about the training set, and then we'll treat this validation set as their testing ground. So, in the linear regression example we looked at before, this would be like if we gave it an instance where there was 110 millimeters of rain, the model didn't see this at all in the training set, but if it was fit well and it learned the underlying pattern, then it would be able to generalize and make an accurate prediction about the example, and more broadly, about the other examples in the validation set. Lastly, the test dataset is used provide an unbiased evaluation of what the model will look like in its real environment. So, by this point, you fit a bunch of different models on the training data, you've evaluated them on the validation set, and you've selected the best model. The only difference between the validation set and the test set is that you use performance on the validation set to select this model as your best model. So, this test set is just one more check to make sure that the performance does not deviate too much from what you saw in the validation set. Okay, so let's zoom back out for a minute. Now, this can vary, but a standard way to split your full dataset out is to assign 60% of the data to the training set, 20% to the validation set, and then the remaining 20% to the test set. We'll talk more about this as we move forward, but this is kind of what the high-level overall process looks like. So, moving left to right, starting at the bottom of this training circle by adjusting and cleaning the features, then you train it on the training data, evaluate it using the validation data, then at this point, you have a decision to make over kind of on the right-hand side of this validation data circle. If none of your models are any good based on the performance on the validation set, then you need to revisit the training phase and consider some new variables or new models. If the performance is quite good, then you can select your best model and pass it onto the testing phase. Now, if you evaluate it on the test set and performance is what you expect, then that model is ready to go. However, in rare cases, the performance on the test set may heavily deviate from the performance that you saw on the validation set, and in that case, you need to circle back and dig in a little deeper to understand what's going on here. So, these two concepts are related, but the risks are that your model will overfit or underfit the data, which could lead to an inaccurate representation of how the model will generalize. We're going to dig into these concepts quite heavily in the next chapter.
Released
5/10/2019- What is machine learning (ML)?
- ML vs. deep learning vs. AI
- Handling common challenges in ML
- Plotting continuous features
- Continuous and categorical data cleaning
- Measuring success
- Overfitting and underfitting
- Tuning hyperparameters
- Evaluating a model
Share this video
Embed this video
Video: Why do we split up our data?