Join Alan Simon for an in-depth discussion in this video Applying SQL to Hadoop, part of Transitioning from Data Warehousing to Big Data.
- If we've decided to build a new enterprise data warehouse on top of Hadoop rather than a relational database as we've done for years, that means that we will now be running our reports and dashboards and visualizations, all of our time tested Business Intelligence directly out of Hadoop. But we've also seen how Hadoop is an entirely new technology that is not relational in nature. So how will we go about making our BI needs connect with our data that's now residing in Hadoop? Let's take a look at look at how old meets new, how we bring SQL and Hadoop together and where data warehousing intersects with Big Data.
Hadoop is an open source project that vendors adapt and then bring to market. One of the key components of the open source Hadoop is known as Hive, which has a language called HiveQL, more commonly known as HQL. HQL is a language that's almost identical to SQL, not quite as we'll see in a moment, but it's very similar to SQL and software programs and BI tools that issue SQL to access data in a relational database can make some minor modifications to go through HQL that in turn gets translated to access the underlying data sitting within the Hadoop environment, and specifically within HDFS or the Hadoop Distributive File System.
Over the years some vendors have taken the original Hive code and added some extensions to improve performance, but for the most part, Hive is the original mechanism that brought together the worlds of relational access to HDFS or Hadoop-based data. More recently though, some of the major Big Data vendors have created their own next-generation solutions that bypass some of the core components of Hadoop, which we'll look at in a moment. For example, Cloudera has a system called Impala, Pivotal has one called HAWQ, spelled H A W Q, and IBM has their Big SQL which accesses their Hadoop environment.
All of these are SQL access systems to the underlying Hadoop data, so what a Business Intelligence tool would do, or what a program would do, is issue the same type of SQL that it would normally do to access relational data, but the systems take that SQL and translate it into code that can access the data from Hadoop, or from HDFS. Some key points to remember are when it comes to the core Hive capabilities in HQL, HQL or HiveQL is based on the SQL standard, there are some differences between SQL and HiveQL but for the most part, it's relational-like access to the underlying data.
However, this earliest generation of SQL-like solutions went through the Hadoop component known as MapReduce which is a batch oriented interface to the underlying HDFS data. For many of our data-mining algorithms, MapReduce works very very well in terms of churning through very large amounts of data but when it comes to the performance that Business Intelligence tools and Business Intelligence users require, we saw some mismatches in the earlier days in terms of the access time and the relative performance of accessing the data and producing reports.
So today's solutions, the ones by Cloudera and Pivotal and IBM and other Big Data vendors, they bypass MapReduce and instead they have their own proprietary interfaces to the underlying HDFS data. What that means then is that a lot of the performance constraints of the earlier years for trying to get relational access to the underlying Hadoop data, we no longer have to deal with and we're starting to see some excellent performance from Business Intelligence tools going against Hadoop data.
To your average Business Intelligence user then, what this means is that while they're sitting down at a tool accessing the underlying data, they don't necessarily care nor even have to know that the data's coming out of Hadoop rather than a relational database. Just remember though, that because we have SQL or SQL(-like) interfaces to Hadoop, does not mean that Hadoop is the same as a relational database, even though we have that abstraction layer, Hadoop is definitely very different technology and follows an entirely different set of rules and has different capabilities, in some cases far better capabilities than our earlier relational technology that we've used for many years.
- Exploring big data, Hadoop, and analytics
- Examining the shortcomings of traditional data warehousing
- Comparing big data architectures for next-generation data warehousing
- Understanding alternatives
- Building a roadmap
- Managing big data-driven projects
- Monitoring and measuring success