In this video, learn how to clean up continuous features by filling missing values, creating new features, etc.
- [Instructor] Previously we talked about how exploratory data analysis will inform our data cleaning. In this lesson we'll take what we learned in the last two lessons, and actually implement some of the necessary cleaning. So lets start by importing our data, and then we discussed in the last lesson how passenger ID doesn't really factor into whether somebody survived or not in any way. So we're going to go ahead and drop this feature in place just like we did in the previous section. So we do that by calling the drop feature on the titanic data set, we tell it to drop passenger ID. We pass in axis=1, and that tells it to drop the column instead of trying to drop rows. And lastly we'll pass in this inplace=True, which tells pandas to alter the titanic data set as it stands now, instead of trying to create a new data frame. So we'll run that, and you can see that this passenger ID is no longer in our data frame. Now in the last lesson, we learned that age has some missing values. We also checked to see if the missing age values were correlated with any other variables, to see if it might actually mean something. Or maybe it was just missing at random. As a refresher, let's just look at this output of this line of code again, and we'll highlight the fact that there are some differences but probably not enough to conclude that this isn't just missing at random. So we'll treat it as missing at random. And we'll use one of the most naive, but useful methods for filling in missing values. And that's just replacing the missing values with average value for that feature. That way it satisfies the model by making sure that there's a value in there, but by replacing it with the average value, it's not biasing the model towards one outcome or another. In other words, because the age value is just at average, it will rely on all of the other features in the model to try to decide whether the given person survived or not. Okay, so the way that we'll fill in these missing values with the average value is we'll call the age feature from the titanic data set. We'll use this fillna method, and all we have to do is tell it what we want to fill all the missing values with. So we want to fill it with the missing value of the age feature. So again, we'll call titanic, and the Age feature, and then we'll just call the mean. And that'll just take that entire column, it'll compute what the average value is, So this fillna method will cycle through all the age values in that column. It'll find where the missing values are, and it'll just fill it with whatever this mean value is. Lastly, we just have to pass in this inplace=True argument, again, just to tell it to alter titanic as it stands now, rather than creating a new data frame. Okay, and then the last thing we're going to do is we're going to call this titanic.isnull().sum() And what this is going to do is it's going to tell us where we have missing values across each of the columns. And this is just going to be a check to make sure that this did its job. Let's go ahead and run that. And you can see now for age, it's indicating that there are zero missing values. So let's just check it in another way. Let's add another cell below this, And then we're going to run titanic.head(10) And that'll print out the first ten rows. And what we're checking for here, is to make sure that where there used to be missing values, there's now the average value of age. So let's go ahead and run that. And you can see here, in this row right here, we have 29.699, and that's going to be the average value. And you can tell, because the rest of the values are integers, and this one is a float. Lastly, we saw that SibSp and Parch both represented the size of a family, and they have a similar relationship with the target variable. So we want to combine those into one feature. Again, we talked about this last lesson, but whenever we can reduce the number of features, while maintaining the information that those features provide, we should do that because it clarifies the picture for the model. So all we want to do here is declare where we want to store this new feature. So titanic, we'll call it Family_cnt And then all we have to do is add together the two features that we want to combine. So that's SibSp, add to Parch. So again, what this is going to give us a count of, is the number of siblings, number of spouses, number of parents, and number of children. The last thing we have to do, and this is pretty important, is we have to drop the SibSp and the Parch features. Now that Family_cnt is basically representative of both of those features, they become repetitive. This creates what's called multicollinearity. When you have multiple features that are basically accounting for the same information, in this case family size, what happens is the model has a very hard time assigning proper value to those three features, since they all look the same to the model. Because they are so highly correlated. This can have negative consequences on the performance of the model. So that's why when you combine multiple features into one, you almost always want to drop those original features. So again, we're going to call titanic.drop and we'll pass in a list of features that we want to drop. So that's SibSp, and Parch. And then just like we saw above, we'll tell it axis=1, and we want to replace titanic inplace. And then now that that's dropped, let's print out the first five rows. So we can see Family_cnt here, and we no longer see SibSp or Parch.
- What is machine learning (ML)?
- ML vs. deep learning vs. AI
- Handling common challenges in ML
- Plotting continuous features
- Continuous and categorical data cleaning
- Measuring success
- Overfitting and underfitting
- Tuning hyperparameters
- Evaluating a model