From the course: Python: Working with Predictive Analytics

Handling missing values - Python Tutorial

From the course: Python: Working with Predictive Analytics

Start my 1-month free trial

Handling missing values

- [Instructor] Looking at our roadmap, we scratch the surface of data understanding by looking at the data types in the previous videos. We will now start the Data Preparation step. It's 80% of what data science is. As they say, "Garbage in, garbage out," if not careful during the data prep. In real life, we seldom have completely full data sets to work with. In the Python world, missing values are represented as NaN, which is "not a number". Most prediction methods cannot work with missing data, thus, we need to fix the problem of missing values. We have quite a few methods to handle this. Three options we will mention here are first, drop the entire column where the NaN values exist. Secondly, drop the rows with NaN values. And finally, fill in the NaN values. There's no right answer for every data set. One or the other may be appropriate, depending on the conditions. Let's open the Begin file in Spyder. Let's look first how many missing values each column has. For that, let's type count_nan is equal to data.isnull().sum() Then, we will print the count_nan values where the count_nan values are larger than zero. Let's click run, and we will see that we are missing five values from the bmi column as we can see in the console. So this time we will be filling in NaN values with the mean value of the bmi column, so let's do that. data['bmi'].fillna And this is where we specify what are we filling in the missing values with. In this case, we are filling it with the mean value. And inplace equals to true. When I say inplace equals to true, it will make the changes on the data frame. Okay, let's select this cell. Right click and run cell. So we just filled in the values. To make sure, we will do the same thing again. We will say count_nan = data.isnull and .sum() to check if we have any missing values left. Then we will print(count_nan) where count_nan is larger than zero. Let's hit run, and you will see that we have zero missing values. As I've mentioned earlier, there are quite a few other methods to fill in the NaN values. In fact, this is a very extensive topic which can be studied as its own class. Please review the Finish_ visualize version of the code, to learn more about the other methods to drop or fill in the NaN values. Let's open it up now, and here you may see the other methods to fill in the NaN values. So here in the Finish_visualize code, we have other methods to handle the missing values. A picture is worth a thousand words. After we've filled in the missing values, it's important to visualize the data we are working with in order to absorb the data quickly and understand the next steps for making good predictions. Some examples of the next steps may include removal of outliers and finding trends and so on. Let's run the code and enlarge the console to see the visualization. Let's scroll up to the beginning of the visualization. Let's first see the distribution of the charges. The response variable, as we see, is right skewed. Let's scroll down to see the counterplots and we see that we have a lot smaller number of smokers versus non-smokers. And when we scroll down, then we see the pair plots. And in the pair plots, in the lower left corner, age versus charges, it seems to have a good correlation with layers. Let's look at this a little bit more closely. From the linear model plot, we can see that smokers clearly do have higher charges. Let's scroll down to see the correlation matrix. Looking at the correlation matrix, we can say that the biggest correlation is with age with .3 on the lower left corner. Handling missing values is a must, as prediction models require full data sets. There are three main methods for missing value fix. We can drop the entire column, drop the rows of the NaN, or we can fill in the NaN. Looking at the data visually, also helps us gain insight.

Contents