Learn about the large components of a good Data Engineering pipeline.
- [Instructor] Let's take a look here and figure out what are the components of a good data pipeline? Looking at our overall data pipeline, first we have our stage step. And in a good data pipeline, what we will do is create data profiles showing basic information about the incoming data feed. Row counts will be logged for future reference in the process, and quality checks will be run to test for things like percentage of null values in important columns. After our staging step is done, we get into cleansing, where we actually start to manipulate the data.
And at the end of this phase, we need to make sure that we log everything that we can, so we start logging things like the null values that are missing, any IDs that aren't there, or anything else, like a city or state combo that doesn't really match up. And if need be, the pipeline will be able to roll back any operations here that may cause a more systemic failure down the line. In the third phase, conforming, we start to get more serious about the monitoring and perform a go or no go check. This check will automatically determine if it's safe to proceed.
Typically I design these systems to look at the errors that have already been logged and any other red flags that may have appeared in the previous two steps. Our notifications also start to get personal here with SMS messages going out to the data ops team. Lastly, after our data has been delivered, our good data pipeline practices aren't over yet. We need to run some additional quality checks and, again, if need be, perform an automated rollback. A great example here is checking the overall sales figures, website traffic, or any other important stats for your business.
If they fall outside two standard deviations of a four week moving average, they're probably wrong. Not always, but often they are. At this point I would delete the updates, rollback those changes, and notify everyone involved, including your users, and let them know there is a delay in the daily processing of your jobs. After that happens, your data engineering team will have the full account of what occurred and where to go to resolve any issues that happened throughout this entire process. Now, a word of warning. If you start your data journey trying to implement each of these different checks and steps along the way, you're going to fail.
It will simply take far too long before users are getting any value out of the data for you to build every single one of these. So my recommendation is have a plan of all the checks and controls you want to put in place upfront, then strategically add them as necessary to protect your users. If you're building something for the CFO and he's going to use that on an earnings call which will be heard by market analysts who could affect the stock price of your company, yes, all of (laughs) these checks are important. However, if you're giving some data to a marketing team that's just experimenting with a marketing campaign, no, you don't necessarily need every one of these.
So as you go along, you'll get a better understanding of how important these checks are and how each of the different processes calls for a different requirement.
- Working with systems and schemas
- Managing of a good data pipeline
- Setting up an environment
- Loading and profiling data
- Testing quality
- Adding data types
- Handling missing values and inferred members
- Performing master data lookups
- Loading schemas and tables
- Creating views