Learn about common methods for handling different data types.
- [Voiceover] Alright now that our data is loaded to the staging environment, we've performed some data quality testing, it's time to add some data types. We're actually going to be cleansing and conforming the data a little bit here. So we're going to read in that data from the staging the table for our sales orders. Then we're going to convert the numeric values to numeric types, so integers, doubles, et cetera. And then we will convert date values into date types. So simple enough, basically converting the data from its raw format into something that's more analytical.
Here in my virtual environment I'm going to switch over to my back office database. And then from there I'm going to open up the three underscore one dot sequel file from the exercise files. And then we're going to create our cleansed sales table. So I'll copy that first statement, paste that in, and if we look it's essentially is the same schema as before, with a new name, cleansed underscore sales, instead of stage underscore sales, so it's the second step in our process. Then instead of having strings for everything, we have integers, small ints, floats, and we even have date values down here for our date fields.
Okay, so let me run this. We've created the table, should appear here on the list of tables on the side. Yep, cleansed sales. I can click on that. Okay, looks good, we have dates, we have good data types. I'll delete this. Now we want to actually pull the data in from our staging table. And so I'll just copy this in and then I'll walk through it real quickly. So here what we have is basically just a simple select statement form our staged table, however, we're doing cast operations on a lot of the field.
So RowId is going to be cast as a small int, ClientId doesn't yet exist so it's just null, but we are casting it as an int. OrderId is an int, order date is a date, et cetera. So we cast these as the data types that we want them to be in the end. So we're transitioning, we're transforming our data from its raw form into something that our analysts would be able to use in a better way. Alright, so once that's all done, and everything set up there, I'm just going to run this statement here.
And now that that's done let's just go do a quick check of our cleansed sales table. Select star from cleansed sales, limit 100. Make sure the data actually made it in there. Alright, and if you have any null columns, with the exception of the one we know was null, because we decided it that way, anytime you have a problem with converting data types is when you see nulls pop up. So if I saw order dates that were null, that would be why, typically, is that the original data, the raw data coming in which was typed as a string, didn't convert well into this new date function.
That's why we do all the data quality testing in the first step before we try to convert any of the values over. So you can see everything came across and it looks good. We have dates, we have numbers, there are some things missing, which we've already identified. Those aren't deal breakers though. Alright, so our data's good. Last step as always is to do a little audit check. This time what we're going to do is just like we did originally, but now we're going to do a row count from this table and log it as a cleansed sales table. This is really important because if things go bad later in our processing, you're going to want to go back and know how many rows made it, because as you start to apply some of these types and add some filters, that's when things really start to drop out.
So, if in the end you have some erroneous analytical data, you're going to want to know the lineage of that data. So this statement here simply does a row count, which I recommend any time you move data from on step to the next to do a row count to make sure you compare those numbers. And with that done, we've essentially cleansed our data and are good to move onto the next step.
- Working with systems and schemas
- Managing of a good data pipeline
- Setting up an environment
- Loading and profiling data
- Testing quality
- Adding data types
- Handling missing values and inferred members
- Performing master data lookups
- Loading schemas and tables
- Creating views