Join Martin Guidry for an in-depth discussion in this video Overview of data warehousing, part of Implementing a Data Warehouse with Microsoft SQL Server 2012.
In this section, I'd like to talk about a basic working definition of a data warehouse. Some characteristics commonly associated with data warehousing is that we will integrate data from multiple sources. The existing data in the data warehouse does not change, or changes very infrequently. We typically have new data loaded periodically, most commonly, once per day. At night, we run a process that will add new data to the data warehouse, but it doesn't have to be daily. It could be weekly, monthly, pretty much any timeframe.
The data warehouse should have all of the data necessary to answer certain questions about our business, sometimes very simple questions like, who is our best customer? In other words, what customer bought the most product from us in a certain timeframe? Maybe we would be interested in looking at something like, which product sells best in a particular location? Our data warehouse, by having a list of everything that's been sold, a list of every customer, and a list of the location for all of those customers, could, in fact, answer both of these questions fairly easily.
You may sometimes see the term Data Warehouse, and other times, see the term Enterprise Data Warehouse. In some instances, these phrases would be synonymous, but there can be a difference between a DW, a Data Warehouse, and an EDW, an Enterprise Data Warehouse. For example, if the marketing department of a large company wanted their own data warehouse, for their own internal use, to handle, primarily, marketing data, that would be a Data Warehouse. Contrast that to an Enterprise Data Warehouse, which, in addition to marketing data, would contain data throughout the enterprise, in inventory, sales, accounting data, everything.
It would contain data for entities throughout the enterprise and be consumed by entities throughout the enterprise. The process to add additional data to the data warehouse, that typically happens once per day, is often called an ETL process, which stands for Extract, Transform, Load. So, this is a three-step process. Step 1, we pull data out of various locations. That's called extraction, extracting the data. Step 2, we may want to change the data in some way.
That would be called a transformation. You don't have to change the data if you don't want to, but typically, we do things to make the data more similar to other data sources. So, for example, if one data source writes out months as the full name, and another data source uses a three-letter abbreviation for months, and a third data source uses a numeral to indicate which month, you may want to do some sort of transformation to make all three of those look the same. The last step is loading, which is actually inserting the data into the data warehouse.
This gives us a three-step process, extract, transform, and load. This can be handled inside of one business process, or it could be three separate processes that run sequentially. We'll get into more detail about ETL processes later in this course.
- Typical databases vs. vdata warehouses
- Choosing between star and snowflake design schemas
- Working with appliances and "big data"
- Exploring source data
- Implementing data flow
- Debugging an SSIS package
- Extracting and loading modified data
- Enforcing data quality
- Consuming data in a data warehouse