Learn how to split your data into a train and test parts in order to avoid over fitting. You can also learn other methods of splitting your data for better results.
- [Instructor] We cheated the data when we turned in a set of data and then checked our accuracy on the same data. However, in real life, our model will need to predict from data we haven't seen. We risk what is called overfitting, which means our model is accurate for the data itself, but will fail on new data. The common practice is to split the data to training data and test data. We train the model on the training data and then use the test data to evaluate the model. Scikit-learn makes this process easy. So let's implement it. From.sklearn.model_selection import train_test_split.
As the name suggests, this will split our data into a training part and testing part. It will do so randomly. So let's do x_train and x_test and y_train and y_test equal train_test_split of boston('data') and boston('target'). And we'll say that we'd like the test size to be a third. Train_test_split returns four values.
The first two is the breakup of the features of the data. The features variable is traditionally called x, so we call the parts x_train and x_test. The same happen for the labels, or target, which is usually named y. Let's see how we do now. So clf is a RandomForcedRegressor, and we now fit it only on the x_train and y_train data, and then we'll score it on the x_test and y_test data.
Still not so bad. There are several other methods for splitting data into train and test. For example, in KFold, we split the data to K parts, then take one part out, train on the rest, and score the parts that's left. We do so for every part, and the final score is the average score. Scikit-learn comes with KFold and other methods available, and it's usual that the commutation is great. Note that sometimes splitting data to test and train randomly might not be a good strategy. For example, if you try to find credit card fraud, most of the transactions in the data will not be fraudulent, and there's a good chance that we'll pick only non-fraudulent samples to train on.
Then our model will just learn to say no fraud. So learn about your data and figure to the best way to split it to get a good model.
- Working with Jupyter notebooks
- Using code cells
- Extensions to the Python language
- Markdown cells
- Editing notebooks
- NumPy basics
- Broadcasting, array operations, and ufuncs
- Folium and Geo
- Machine learning with scikit-learn
- Plotting with matplotlib and bokeh
- Branching into Numba, Cython, deep learning, and NLP
Skill Level Intermediate
NumPy Data Science Essential Trainingwith Charles Kelly3h 54m Intermediate
1. Scientific Python Overview
2. The Jupyter Notebook
3. NumPy Basics
Manage environments5m 11s
6. Folium and Geo
7. NY Taxi Data
10. Other Packages
11. Development Process
Next steps1m 33s
- Mark as unwatched
- Mark all as unwatched
Are you sure you want to mark all the videos in this course as unwatched?
This will not affect your course history, your reports, or your certificates of completion for this course.Cancel
Take notes with your new membership!
Type in the entry box, then click Enter to save your note.
1:30Press on any video thumbnail to jump immediately to the timecode shown.
Notes are saved with you account but can also be exported as plain text, MS Word, PDF, Google Doc, or Evernote.