Join Alan Simon for an in-depth discussion in this video Looking at two Hadoop DW approaches, part of Transitioning from Data Warehousing to Big Data.
- Hadoop is increasingly becoming part of the overall data warehousing architecture for many organizations but we actually have two different approaches we could follow for how we're going to use Hadoop. One approach using Hadoop is sort of a super-sized data staging area sitting alongside a traditional relational data warehouse. Or, what we could do is totally shift out of relational databases and instead, build our newest generation of Enterprise Data Warehouses directly on top of Hadoop. Let's look first at the super-sized data staging area approach.
Here we see our standard data warehousing architecture with our ETL process is feeding data from our source applications into our data warehouse which is built on top of a relational database system. However, if we look more carefully at our architecture, our ETL process actually takes the data from our source applications into a relational data staging area first and then it is sent to the database where users will access their information for their reports and dashboards. This staging area could be a separate set of tables in the same database where the users go or it could be a separate database instance, another SQL Server database for example, but regardless, we have our relational data staging area which is the first place where our data lands once it leaves our source applications.
With the super-sized data staging area approach, we no longer have a relational data staging area and then our ETL process, as we'll talk about later, is replaced by an ELT process. Here's what happens instead using Hadoop. We've already seen that with Hadoop and big data, our data will come from hundreds if not thousands of sources. Instead of using a relational staging area, we've replaced it with a Hadoop-based data staging area that's no longer constrained by all the rules and all the restrictions of a relational database.
Now, we can stream data from all these different sources in without having to worry about the requirements process or anything else that we've had to worry about and bring it into Hadoop as quickly as possible. Then when we build our relational data warehouse, we pull the data out just as we would from a relational staging area into the place where users will go for their reports and their analytics. We're not done yet though, because once we build the data staging area it also serves as what we could call an analytic sandbox.
We can run some of our rudimentary data mining routines here even though we have uncleansed our data or standardized it yet, we still need to treat the insides we get out of that dirty data as hypotheses rather than facts from which we would immediately make critical business decisions but we still can run some of our predictive and descriptive analytics, at least to some extent out of there. Meanwhile, the data that is used for reports and dashboards and visualizations that we would normally have in a relational data warehouse, still hasn't changed under this architecture.
If we look at our five essential needs for modern analytics, we'll see that two of them are addressed very well by this particular approach. We now can bring in as much data as we need and we can also quickly add new data sets. We've also partially addressed the operational business intelligence and real-time data needs by bringing in data as quickly as possible. And within that staging area, we're no longer restricted to just structured data, we can bring in semi-structured and unstructured data as well.
We still have some challenges fully addressing our predictive and discovery analytics needs because we can run those algorithms but we haven't necessarily processed the data, we're running them against data in a staging area rather than cleansed or standardized data. But for the most part, we've at least made some progress in addressing some of our analytic needs that we've had issues with under relational technology. Let's take a look now at our second approach, the non-relational Enterprise Data Warehouse built on top of Hadoop. In this architecture, we bring the data into Hadoop via streaming from all of our different sources just as we do in the staging area but what we don't do is pull the data out of that staging area and copy it into a relational data warehouse.
The Hadoop Enterprise Data Warehouse is our end points and our reports on our analytics and our dashboards will all run against the Hadoop data warehouse. As well, we also have the analytics sandbox in there. Following this approach, we've addressed all of those five essential needs for analytics. In some ways, our architecture is more complex because we've moved away from our standard relational database technology that we've been using for data warehousing for so many years into this new generation of technology.
But in other cases, our architecture is simpler because we've eliminated some of the components and some of the data flows and consolidated some of the functionality into the new Hadoop environment. What this means is that regardless of which of the two architectural approaches we select, we started enabling an entirely new generation of data-driven insights.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success