Union two subsets with the same schema.
- View Offline
- [Voiceover] Since Hive is running on top,…even when it's not running on top of Tez,…it's capable of doing these types of,…I wouldn't call them lookup operations, but they…are similar in the amount of time that they take.…So let's say you just wanted to get the…top 10 lines for a particular file,…in Pig, it would launch a MapReduce job for that, I believe,…whereas in Hive, it would just return the top 10 lines.…In this case, we're doing a Sort and Limit,…so no matter what, it is going to have to launch…a MapReduce job, so we're going to select User ID…and the sum of scores as user score,…from Stack Overflow to comments, but here's the magic.…
We're grouping by user ID…and then sorting in descending order of their user score.…Cool, so we're going to do Sort and Limit now,…so we're just gonna pull up the file.…So we're doing Select User ID, so we're just gonna grab…every user ID and the sum of their scores,…which is going to make sense in a moment,…as user score, from the Stack Overflow DB…in the comments table.…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups