It's possible when working with data that the wrong data is included in your model. In this video, learn how to prevent data leakage.
- [Instructor] A common rule of thumb in machine learning. If your result looks too good to be true, it probably is. In these cases, the primary culprit is usually data leakage. Data leakage can be thought of as any time information from outside of your training set enters your model. Data leakage is especially prevalent when working with time series data and in environments where there are data cleanliness issues. The end result of this is that you may be fooled into thinking your model generalizes much better than it really does. So how can we detect and prevent data leakage? Here's an example. Let's say you are working to predict customer cancellations and you have a theory that a recent product that has been introduced makes for stickier customers. If you formulate the problem with historical data, you will likely find that all canceling customers did not buy this product. As you validate this model on new unseen data, you'll find it performs very poorly. So the first question I want you to ask yourself is, are there any features in my model that are surprisingly correlated with your target variable? Something that really stands out to you? A good way to see this really quickly in your data is to use the core function in pandas. Here, we import the tips dataset from seaborn and call the core function to see the correlation between each variable and each other variable. Similarly after you train your model, review the feature importance to see if anything stands out. Next, if you're using time series data, be sure to train test split along your date variable. It is not appropriate to do a random test split as you normally would. As an example, let's import the sunspots dataset. By calling the head function, you see it's a time series dataset with one row per month. To perform your train test split along the date variable, first we'll want to make sure that it's sorted in ascending order. I've decided to use a 75% 25% train test split. Our train_len variable identifies the length of our train set. We then subset our data for train and test accordingly. By printing the shape of our train and test, we see this work as intended. Another aspect I want you to be mindful of is that when scaling, fit your scaler to your training group only, then transform both the training and the test group. Here you see we're calling fit_transform on our train group and transform on our test group. Note that when employing K fold cross validation, you'll want to repeat the preprocessing steps within each fold separately to prevent data leakage. Preventing data leakage is something you'll always have to be mindful of in machine learning. Remember if your result looks too good to be true, pump the brakes and follow these important steps before sharing out your model results.
This course was created by Madecraft. We are pleased to host this training in our library.