At the end of this video the learner will have an understanding of managed vs. external tables in Hive.
- [Narrator] Now let's understand a little bit more about table structures in Hive. First, a table is a definition that we can use to run SQL queries against, remember Hive typically just does things on Read. So, the schemas defined at the time that you actually issue your query. Now when we think about a table in Hive, there are essentially two main types. First we have managed tables. And a managed table is where Hive actually owns the data. And I'll show you what that means here in a second.
Now with that, the data no longer is in its original location, when you create a managed table in Hive, the system actually moves the data from its original location into the Hive warehouse. Now with managed tables, when we create them, the format follows the typical SQL convention. So the same way you're used to creating a table in another database platform is the same way you typically do it here. This data lives in a specific location in Hadoop, in HDFS and it's known as the Hive warehouse.
The path is User, Hive, and Warehouse. And we'll take a look at that in a second. Now the other type of table are external tables. An external table is just the definition. So Hive doesn't actually own the data. It points to files that live elsewhere in HDFS. I say these are more fragile because what happens is, the location of those files can change. The data in the files can change, all of that can happen out from underneath you when you're just going to do your analysis and now you're getting errors or data that doesn't seem to match up with the data you were expecting.
This data, as opposed to the Hive warehouse data, can live anywhere in HDFS. Now let's take a look and actually see some of these tables in HUE, look at some of the files and identify where the data lives. So back here in HUE, I have my examples that we loaded, our Hive samples, and we have up top the data browser and the Metastore tables. So what this is, this actually shows me everything in my Hive warehouse. Everything that's defined as a table. And I can see the tables down below, and if I click on one of them, say Customers, I can see some basic stats for this.
There's one file, there's 15,000 bytes of data, the file format is parquet and it's not compressed. So if I scroll down just a little bit what you can see is the columns that are in the table and even some of the sample data. The key thing though is that we want to know where does the data actually live? So if I click on this link, for location, I can see exactly where this data lives here. Under user, Hive warehouse, and then a separate folder called Customers. That's typical when you create a managed table that that table, the name of the table, becomes a folder underneath the Hive warehouse.
And that's actually where your data lives. So this is how Hive stores data in its managed tables.
This course shows how to use Hive to process data. Instructor Ben Sullins starts by showing you how to structure and optimize your data. Next, he explains how to get Hue, the Hadoop user interface, to leverage HiveQL when analyzing data. Using the newly configured option, he then demonstrates how to load data, create aggregate tables for fast query access, and run advanced analytics. He also takes you through managing tables and putting functions to use. This course is designed to help you find new ways to work with datasets so you can answer the tough data science questions that come your way.
- Defining data structures in Hive
- Selecting data
- Joining tables
- Manipulating data
- Filtering results
- Aggregating data
- Using built-in aggregate functions
- Mastering built-in table-generating functions
- Using CUBE and ROLLUP
- Using clauses: WHERE and HAVING
- Using LIKE, JOIN, and SEMI JOIN
- Using functions: String, math, date, and conditional