Join Alan Simon for an in-depth discussion in this video Exploring the evolution of DW with Hadoop, part of Transitioning from Data Warehousing to Big Data.
- So far we've seen several different approaches to the transformation from the world of traditional data warehousing into the big data era, and how Hadoop can play a critical role in not only worrying about that transition but also delivering an entirely new generation of data-driven insights. In many organizations, though, there might as well be a large sign hanging on the front door telling everybody that Hadoop is not welcome inside. Why would that possibly be? Here are three of the most common reasons, actually, misconceptions, about Hadoop and its supposed deficiencies when it comes to helping out with data warehousing and business intelligence.
First, Hadoop is way too slow for business intelligence and despite its power and its ability to handle these incredibly large amounts of data, Hadoop is totally unable to deliver our standard BI reports and dashboards with anywhere near the response time that plain old relational databases can. A second belief that's strongly held is that Hadoop should only be used for advanced data mining analytics, not reporting or dashboards or visualizations or anything that would be delivered through a normal business intelligence tool.
Third, we've seen the Hadoop architecture and how it's intended to injest as much data as quickly as possible. Many people believe that all that uncleansed data in Hadoop is useless when it comes to reporting and business intelligence. Why would so many people have these misconceptions about Hadoop? Simple. Because those misunderstandings are actually based on the first generation of Hadoop technology, not what is currently available today. We do need to understand, though, that these statements were accurate at some point.
Article from not that long ago listed some of the shortcomings of the first generation of Hadoop, specifically when it came to configuring and supporting environments, as well as the skills that were needed to build and support those systems. In fact, if we look at the Skills profile for BI and Data Warehousing versus First-Generation Hadoop, we don't see much in common between these two camps, so it's not a surprise that many experienced BI and Data Warehousing leaders couldn't quite get their heads around the value proposition of Hadoop for what they had been working with for so many years.
Earlier we looked at Hive and HiveQL or HQL and its role in providing SQL access to underlying Hadoop data. Hive essentially was a first-generation data warehousing infrastructure, as it was called, sitting on top of Hadoop, or more specifically, on top of the Hadoop Distributive File System, HDFS. It was SQL-like, it was not fully compliant with the SQL standard. The Queries from Hive were converted into batch-oriented MapReduce jobs which resulted for the most part in performance for those queries that was unacceptably slow when compared with just plain old business intelligence tools against relational data warehouses.
If we looked at trying to apply first-generation Hadoop into large-scale enterprise data warehouses, we really only had one option. Remember there's two approaches for bringing Hadoop into the world of data warehousing, one of which is to use it as a supersized staging area as well as an analytic sandbox, but our business intelligence and dashboards and reports would all come out of a relational enterprise data warehouse. That was the only approach that really was viable.
Trying to use Hadoop as the enterprise data warehouse itself really wasn't suitable because of performance and other issues. The same article looked at the second generation of Hadoop technology that began showing up around 2013, and they noted that the independent software vendor ecosystem, the major players in the Hadoop space, were definitely broadening and deepening the products that they were bringing to market to address many of those first-generation issues. The top priority of many of those vendors was to provide much better, much faster SQL access on top of Hadoop and bring about a much tighter integration between the two different components.
Earlier we saw how vendors such as Cloudera and Pivotal and IBM all have their own enhancements and extensions to provide SQL access to HDFS-based data, without going through MapReduce as the first generation did, and provide significantly better performance than the first generation did. If we look at applying Hadoop into the world of enterprise data warehouses with the second generation, the approach in which Hadoop is the enterprise data warehouse itself as well as the staging area is definitely a viable option, and many organizations have already begun building enterprise data warehouses using this particular architecture.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success