Join Lynn Langit for an in-depth discussion in this video Introducing Impala, part of Hadoop Fundamentals.
- As we continue on with our journey into the Hadoop library ecosystem, we're gonna look next at a set of libraries that have to do with queries. The first one is Impala and as the name might imply, Impala has something to do with speed. In fact, the need for speed is driving a number of these libraries. What kind of speed? Speed of query execution. Although Hive might be convenient because users can write almost ANSI SQL and get information back out of a Hadoop cluster, they really don't wanna wait for batch processes to execute.
They're used to the speed of relational database queries. And so, there have been several attempts in the Hadoop ecosystem to replicate that speed, but against HDFS scale of data. So, Interactive Hive is another way to look at Impala. It uses a query language that is a subset of ANSI SQL and it gets a result running against Hadoop data 10x-100x faster. Sounds great, doesn't it? Sounds like everybody would wanna use it. What's the catch? The catch is that it's in-memory, makes use of the memory across the cluster to get the speed increase.
So, there are limits to Impala because of course, disk space is cheaper than memory. However, if used intelligently, it can really be a dramatic speed increase for getting Hadoop data out of the cluster. Interestingly, it also uses a columnstore format, which is a wide column you might rememember when we discussed HBase, a key and a wide column and the reason for this, is it's designed for analytic style queries. So, actually, as an addition to Hive, but not as a replacement to Pig, which you might remember, is a data manipulation language, which is designed to make changes or clean the data.
Another interesting aspect of Impala is because it's running in-memory, it does not generate MapReduce jobs and we'll see this in the demonstration. So, why might you wanna use Impala? Well, I've given you probably the most compelling reason already, exponentially faster query results. However, the limitation is, Impala requires use of memory on the cluster and a lot of memory for large data queries. Impala is an implementation of an improved Hive that is specific to the Cloudera distribution.
I selected Impala even though it's vendor-specific because it's an implementation that I'm familiar with and I've worked with customer on and they're excited about. There are some alternatives to Impala and I'll be discussing these in subsequent movies. As I mentioned, Impala is specific to the Cloudera distribution and it actually ships with it. In the sample Virtual Machine that we have, Impala is set up by default. Now, there are a couple of intricacies of working with it that I'll be discussing. It's not really difficult, but it's just not as intuitive as you might think.
So, I'll mention those. The first of the details around working with Impala is it relies on information about the data that it's querying from the shared metastore and you might remember from our discussion earlier of Hive, that Hive relies on this type of information as well for its queries. Now, one of the interesting aspects of working with Impala is if you make changes to the data that's stored in the Hadoop cluster, you need to update the metastore or you will not be able to query those tables with Impala and as I mentioned, this metastore is also used by Hive, so it's a shared metastore in this particular implementation.
If you add data to your cluster by importing and the data does not show up under the data browser, you'll need to open up the Impala query browser and run the invalidate metadata command and then refresh and then you will see the data and you'll be able to query it and this is because the metastore does not refresh automatically for the Impala view. Now, there are a number of ways to optimize Impala queries. There are query tuning hints, timeouts and Impala actually works with zip data.
Of course, an obvious way would be to allocate more memory or to work with more memory since Impala is a memory-based query system. In addition to optimizing Impala itself, there are emerging both vendor-specific libraries, for example Hortonworks has something called Stinger, and opensource library alternatives and we'll be talking about those in subsequent movies. So, this is a very active space in the Hadoop ecosystem, as you might guess, because people want to use SQL queries rather than writing MapReduce jobs and they want them to run faster.
So, be sure for whichever library or libraries you're considering using, that you look at both the opensource documentation and the vendor documentation because the releases literally are coming in months, rather than years.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output