Join Alan Simon for an in-depth discussion in this video Examining traditional data warehousing, part of Transitioning from Data Warehousing to Big Data.
- On the surface, the data warehouse architecture appears to be very simple, and, in fact, it is. But it's with this simplicity where we often find ourselves struggling to support the broad range of business intelligence and analytics, and then the underlying data management capabilities that organizations need today. There are three major aspects of the data warehouse architecture. The source systems, the data feeds, and then the data warehouses. Let's look at each of those to provide a grounding into how data warehousing works so later, when we look at big data, we'll be able to see where the differences are.
The typical data warehouse architecture is one that we've seen before where source systems feed data into the data warehouse as it's built. In a data warehouse, we only bring in data from some of the possible sources of information that are out there. Even within each of those sources, we never bring in all of the data but we select certain information that we decided we need for reports and analytics. We do so through significant upfront requirements analysis involving many people from the business side and the IT side of the organization.
And then finally, adding a new data source, even though it might seem to be a simple process, is actually time consuming, and, in fact, often results in us having to re-architect at least part of our data warehouse itself and then the supporting systems. The data feeds are the means by which the data makes its way from the source systems into the data warehouse and this is a process most of us know as extraction, transformation and loading, or ETL. One of the key aspects of the ETL process is to help ensure that only clean data is fed into the data warehouse.
The ETL process, by tradition, is batch oriented, though, as we've evolved business intelligence from just strategic usage into operational usage, some data warehouses use real-time feeds of data. However, to do so often requires a different architectrure than the traditional batch-oriented ETL processes. ETL is heavily driven by business rules. All of the data transformation that has to occur needs to conform with the rules from different organizations across the overall enterprise.
This is where the ETL process often bogs down, not for technology reasons but for business reasons, trying to get different organizations and different stake-holders to agree. And then finally, as the data warehouse expands and grows over time, performance of the ETL process is often difficult to manage, as more data makes it's way into the environment. You might think that the data warehouse itself, once you address all the issues in the source systems and the ETL process, would be relatively straightforward, but we find a number of complications there as well.
The original set of rules that governed how data warehouses were architectured and strucured and built was very rigid. And even though, over the years, those rules have become relaxed, they still do affect and impact the way many data warehouses are built. Essentially, a data warehouse is a read-only copy of some of the data around our enterprise and, in some cases, from outside of our enterprise. For many reports and analytics, that's enough, but for many others, it's not necessarily detailed enough or robust enough in terms of the architecture.
Data warehosues themselves are heavily driven by business rules just as the ETL process is, because as we build our data and structure it, we need to understand how data relates to each other so we can access information in a meaningful way to make the types of decisions we need to. As you might expect, we often find a lack of clarity and agreement on those business rules. And finally, despite our best efforts to ensure that only cleansed data makes its way into the data warehouse, typically what happens is problematic data is found and people run reports out of data warehouses that, upon further review, aren't as accurate as they should be.
All three of these components, the source systems, the data feeds and the data warehouse, become increasingly complicated on an enterprise scale, as we try to grow our data warehouses from where they begin to the point at which they address a broader set of cross-functional needs.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success