Join Jack Dintruff for an in-depth discussion in this video Sort and limit, part of Data Analysis on Hadoop.
- [Voiceover] Sort, literally just sorts stuff…in ascending or descending order, which you can specify.…Limit, will give you however many elements you ask it for.…So if I give it 10,000 rows and I say limit 10,…it will give me the first 10 rows.…So, the reason you would use these together is,…let's say, you wanted to calculate…the top K for some particular thing.…And so, we're also going to leverage…the aggregation function to…calculate the score at the user level.…And then, order by score in descending order.…And then grab the top 10.…
So very often you've got a lot of users…and you want to figure out who is the most impactful user.…And so in this case, we're using Score as a proxy for that,…as a signal for how active is this person.…So, we are going to sort and limit the comments data set…by user ID and we're going to get the…top 10 users based on their score.…So, let's go ahead and take a look at…this script that we've got here.…So, as you can see, we're just…loading up the data, like normal.…Then we're going to group all of…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups