Join Ben Sullins for an in-depth discussion in this video Why use Hive, part of Analyzing Big Data with Hive.
- [Narrator] First I want to cover the basics and answer the question, "Why use Hive?" Now, we need to define what Hive is to start, though. And Hive is a SQL-like abstraction that allows for interactive queries on top of Hadoop data. What this means is that we can use the language of a data analyst, SQL, and ask questions of our data living in Hadoop's file system known as HTFS. One important thing to note is that Hadoop is not a database. Hadoop is an ecosystem of programs working together to support a large scale application by using distributed fault-tolerant workflow approach.
All Big Data type jargon that I'm going to try to clear up for you here as we go through this course and learn how to use Hive on top of your Hadoop data. So, what you've maybe heard about is that Hadoop is a database and it's going to replace everything else you have going on, and that's just really not true. Now, it does offer some luxuries, such as SQL support and distributed reliable backup, so even if one of your servers goes down in your cluster your system is still functioning.
All these things are why Hadoop is much more than just one single thing, it is an whole ecosystem of things. Now, this is where Hive comes into play. Let's look at the benefits of using Hive. First, Hive offers a familiar language for analysts and developers. Previously, Hadoop required Java to work with your data, and now, with Hive, you don't have to have any Java knowledge whatsoever, you can use SQL, which is just the language all of us data analysts have been using for years to analyze data, so it really opened up Hadoop and grew the adoption of Hadoop because it was much more accessible.
Along with that, we also have tools that can work with it. Because it has a standard, SQL-like interface, business intelligence and analytics platforms can pull directly from your data living in Hadoop. Now this has opened up the floodgates, and now anyone that knows how to use Excel can actually analyze data living in Hadoop because of Hive. Now, that isn't to say that Hive isn't without its limitations. All those benefits put together, there're still a few things here that I want to call out that are limitations of Hive and things that, in working with it, you are going to run into, and so I want to make you aware of them now so that way you know how to work around them later.
Now the first one is that Hive is not fully supported of the ANSI standard of SQL. The ANSI standard of SQL is something that was created many years ago and has evolved several times, which is supported by almost every database. That's why SQL is such a powerful language to learn, because, once you know it and once you know the standard, you can really use it against any other type of database or anything like Hive that supports that SQL-like language. Now, because Hive doesn't support the full ANSI standard, what that means is you may try to run some functions or some queries that you're used to doing in other database systems that just don't work in Hive.
Another thing is that Hive is batch-oriented. So, what happens is, you write some SQL, and it converts it to MapReduce. Now, MapReduce is the language of Hadoop. It's the interface that was originally created to work with data in Hadoop, and it's incredibly slow because it's designed to scale, it's not designed to be a fast, analytical database. So, while Hive makes it accessible because it just punches in SQL and converts it to MapReduce, it still takes a long time to even run simple-type queries.
Another thing is that Hive supports schema-on-read. What that means is that it doesn't, what that means is that it doesn't define the structure of the data in advance, it does it when you actually query it. Now this is great, because your schemas can be flexible, and you can make things more dynamic, however, the challenges is that they become fragile. Because what they're dependent upon is this underlying structure in Hadoop. So, if that underlying structure kind of moves out from underneath you, and you're not aware of it, you may run a query and get invalid results or an error, because things changed and because there was no schema defined in advance, there was no way to know that, prior to actually running your query.
So, I look at Hive as a really accessible way to work with Hadoop data, but it does have some limitations, like I've pointed out here. So what do we do if we don't want to use Hive, or we're worried about some of these things? Well, there are other options. The first and fastest growing option is to use the Apache Spark platform, which offers SQL language interface, as well as many other options, such as Python and Scala. This really is taking over the Hadoop world and displacing MapReduce entirely in lots of companies. Another option is a Hive competitor from Cloudera called Impala.
Now, Impala supports all the SQL-like functions as Hive, but it runs much faster since it doesn't require MapReduce and has its own database engine running on each data node in your Hadoop cluster. Hive itself can be faster when run on Tez, which is another platform for executing the jobs and distributing the queries differently than the traditional method. The jury's still out on how this compares to Impala and Spark, but many organizations like this because they're already invested in Hive, and this is just a way to make it run faster.
Another newer platform gaining popularity is Presto. Presto provides full ANSI SQL support, so not the limitations that you have with Hive or any of the other ones that support a version of SQL. And it dramatically increases performance without having to move your data outside of Hadoop. That's kind of the holy grail. Your data lands in Hadoop, and it never leaves. You can analyze it. You can restructure it. You can transform it. Do whatever you need to do inside of Hadoop, and you don't have to worry about syncing things with anyone else. Now, this last option is one of those that, unfortunately, does have that, where you have a third-party database system.
Vertica is probably the most popular database in the space right now, and the idea here is that you take your data from Hadoop, and you transform it into these analytical data sets that are conformed and consistent and really, really great for analysts, and you put that into a database specifically designed to handle analytical workloads. And that's great, however, it does add more maintenance cost, and now you have to sync the two systems and those things can kind of fall down. So, there is a lot to consider, but it is a really valid other option that a lot of companies are exploring.
This course shows how to use Hive to process data. Instructor Ben Sullins starts by showing you how to structure and optimize your data. Next, he explains how to get Hue, the Hadoop user interface, to leverage HiveQL when analyzing data. Using the newly configured option, he then demonstrates how to load data, create aggregate tables for fast query access, and run advanced analytics. He also takes you through managing tables and putting functions to use. This course is designed to help you find new ways to work with datasets so you can answer the tough data science questions that come your way.
- Defining data structures in Hive
- Selecting data
- Joining tables
- Manipulating data
- Filtering results
- Aggregating data
- Using built-in aggregate functions
- Mastering built-in table-generating functions
- Using CUBE and ROLLUP
- Using clauses: WHERE and HAVING
- Using LIKE, JOIN, and SEMI JOIN
- Using functions: String, math, date, and conditional