Join Martin Guidry for an in-depth discussion in this video Introduction to incremental ETL, part of Implementing a Data Warehouse with Microsoft SQL Server 2012.
- View Offline
In this section, I'd like to talk about implementing an incremental ETL process. Rather than have our ETL process run as one huge process, we can break it down into smaller pieces, making it easier to maintain and troubleshoot. There's two different techniques for this. The first would be to create a separate SSIS package for each data source. So, if you're importing five data sources into your data warehouse, you would get five different packages, each one doing the extract, transform, and load, on one data source.
The other technique is to do separate packages for each step. We will have one package that handles all of the extraction from all of our data sources, we'll have a second package that handles all of the transformation on all the data, and a third package will do all of the load. Let's talk about each of these techniques, individually. When we have one package for each data source, each individual package needs to handle E, T, and L. It needs to handle extract, transform, and load, for that data source. This is a good option when each data source is fairly independent.
So, each package is going to have very little awareness of what's going on in the other packages. Therefore, each data source needs to be able to be processed independently. This is also a good option when we have one data source that is troublesome or changes often. If we have one package that handles that data source, then when trouble comes up, we only have to troubleshoot that one package. All of our other data sources and all of the other packages that deal with all of those other data sources can continue to run fine without any modification.
We've, basically, narrowed this down to the smallest possible set of things that work with one data source, and that'll minimize our troubleshooting. The other option, separate packages for each step, is going to require a staging database. So, after you do the extraction, you have to dump that data somewhere. It's not ready to be dumped into the data warehouse, so we'll put it into a staging database. Inside of the staging database, we can do our transformations. This is a good option when we have a small window of time to load.
Normally, in an ETL process, when we do an extract transformation and load at once, that can be a long-running process. If we split them up into three components, then each component takes the minimal of time, and in particular, the last one, the load, should take very little time. There's some business scenarios where you might have a limited window of time for a load. If your data warehouse needs to be online close to 24 hours a day, seven days a week, you probably want to minimize the amount of time you take to load, so having a separate package just for the load can be a good idea.
This is also a good idea when different data sources need heavy integration. In other words, we have data coming from two separate sources that needs to be merged, and maybe put into the same table, or maybe we need to build a relationship, a foreign key, primary key relationship, between those two tables. By having the staging database, it'll make it easier to perform that transformation. We can load the data from the two separate data sources into one staging database. And then, when we run the transform, we're not worried about the data coming from separate locations, because now that it's all in the staging database, it all appears in one location.
- Typical databases vs. vdata warehouses
- Choosing between star and snowflake design schemas
- Working with appliances and "big data"
- Exploring source data
- Implementing data flow
- Debugging an SSIS package
- Extracting and loading modified data
- Enforcing data quality
- Consuming data in a data warehouse