Start free trial Sign in

From the course: Data Science Foundations: Data Engineering

Handling missing values

From the course: Data Science Foundations: Data Engineering

Start my 1-month free trial

Handling missing values

“

- [Voiceover] Now let's take a look at actually handling missing values. So the first thing we want to do is filter out any rows that have a missing OrderId. OrderId is a key field that we need for our later analysis, so, if there's a missing OrderId we can't use that row. Then, anytime we have an invalid date, which we've already identified, we want to replace that with the stub value, kind of a value that we'll know that we can use as an identifier later, and that isn't real, it's something that we made up as a way to identify these rows. Lastly, we'll do the same but with a product key, so with a string field. So we're going to replace dates with a value that we know to be placed there by us. Same with product keys, so that those values are filled in, however, we know that they originally were missing. So here in my virtual machine I have the exercise file three underscore two dot sequel. And I have several steps here, at the top we're going to make sure that we're using the back office database, then step one we're going to filter rows with missing OrderIds. So this is similar to the query we had before. We're using insert overwrite, however, which is going to replace all of the data in this cleansed sales table with whatever follows. So if you want to actually overwrite that data that's the key statement you need to add to you query here. Okay, so then we have our select statement, we cast everything into the correct data type, and then down at the bottom we have where OrderId is not equal to blank. So, essentially, we're going to pull in all the data from our staging table except those where the OrderId is blank. Step two, as I mentioned, was to replace the invalid dates with nine nine nine nine 12 31. And in order to do that we use a case statement. So when the OrderId equals blank, then I'm going to return this value, nine nine nine nine 12 31, otherwise I'll return the actual order date. So, this query now, anytime there'd be that missing value, it will return a value instead of having a blank there which would be really better for our analysis later if we want to count things. Then step three was to replace the blank product keys with unknown. So in this step what we're doing, similar to the one with dates, we're using a case statement when the product key equals blank, then we're going to return pound unknown, otherwise we'll return the actual product key. This will make sure that we have consistent data in our table. And lastly, what I've done is combine all three of these in one step. This is the statement we're actually going to run here. We're going to do insert overwrite, so again it'll replace all the data that's currently in that table, it'll process the dates, it'll process the product keys, and at the very bottom it'll filter out any missing OrderIds. I'm going to copy this over to our queue editor, paste it in, and hit play. Alright, and now that that's done, we'll go take a look at our cleansed sales table. And we have data. Looks like the data is typed correctly and everything's good. Alright, so we've been able to actually handle our missing values now.

Contents