Learn how to perform lookups against master data tables for conformed values.
- [Instructor] Alright, now we're going to get in and look at how to actually do some master data lookups. First, we're going to create a client staging table for our client information. Then we'll create the cleansed version of that, so the second step in actually processing our client data. We'll load that table, and then we'll use that table to update the ClientID in Cleansed_Sales. So, we have a client table which has the IDs and then we have our sales table, which doesn't have them. So what we're going to do essentially is setup that clients table and then update our sales table.
So here in my virtual environment, I have part of 4_1.sql from the exercise files. I'm going to create this table, stage_clients. As before it's just a raw dump, so everything is a string. Then we're using that custom CSV processor, and we've already loaded the data in an earlier clip into this location, so if you haven't done that, go back and check that out. Basically, we're just copying our clients CSV into this location in our Hadoop environment. I'm going to copy this, paste it in here.
Make sure that I'm in the back office database, and then hit play. Good to go. That table should appear here in a second. There it is. And we'll just do a check real quick, stage_clients. Good to go. Now let's go back, and we need to add data types, so this is the cleansed version of that where I create basically the same table but I have different data types. So something that's better for analytics. Run that one, that one should appear here as well. There it is, so we have cleansed_clients.
Looks good. So then let's load that table, just a simple select from our stage table. This time we're actually going to convert those certain fields over to the correct data types. Copy this, past it into our Hive environment, and run. And with that done, let's just go do a real quick check, select star from cleansed_clients, and it looks like we have data which is a good sign.
Scroll over, you can see that we have client IDs, we have the name, all the good information there. Alright last step, we want to update our sales table with these new client IDs. So, we're going to override the data in our cleansed sales table, and we're going to join to this new client table we just created based on the company name, and return the client ID. So essentially, all the data that's already been cleansed and is ready to go in our sales table, is going to be overridden with this new data just adding in that client ID.
We'll copy this whole statement here, go over to Hive, paste that in, and hit run. And now that that's done, let's just go do another check of our cleansed sales table. And you can see now in the client ID column that it's populated there with actual IDs there. Alright, so we were able to upload our client master table, and then perform a lookup and then update our data in our sales table. So we can continue on processing this data.
- Working with systems and schemas
- Managing of a good data pipeline
- Setting up an environment
- Loading and profiling data
- Testing quality
- Adding data types
- Handling missing values and inferred members
- Performing master data lookups
- Loading schemas and tables
- Creating views