Learn how to create a test/train split in scikit-learn.
- [Narrator] Let's look at train model part 2.py. When training a machine learning model, we always need to do two things with our data set. First shuffle the data so it's in a random order, and second split the data into a training data set and a test data set. Because shuffling data and splitting the data into train and test groups is such a common operation, psykit learn provides a built in function to do this in one line of code. This command will shuffle all of our data so it's in a random order, and then split it into two groups. The test size equals 0.3 parameter tells it we want to keep 70 percent of the data for training and pull out 30 percent of the data for testing.
A 70/30 split is pretty typical. Splitting the data into testing and training groups allows us to keep the test data hidden from the machine learning system until we're ready to verify its accuracy. If we verify its accuracy with training data it had seen before it wouldn't be much of a test. By using data the model hasn't seen before, it proves that the model actually learned general rules for predicting house prices and it didn't just memorize the answers for the specific houses it had seen before.
- Setting up the development environment
- Building a simple home value estimator
- Finding the best weights automatically
- Working with large data sets efficiently
- Training a supervised machine learning model
- Exploring a home value data set
- Deciding how much data is needed
- Preparing the features
- Training the value estimator
- Measuring accuracy with mean absolute error
- Improving a system
- Using the machine learning model to make predictions