Join Jack Dintruff for an in-depth discussion in this video Filters, part of Data Analysis on Hadoop.
- View Offline
- [Voiceover] So, in Hive if you have a table that's partitioned by something very often you can do a selective filter where it will filter only in the partition in which you know that exists. That's a pretty unique thing in Hive, I don't think Pig has that capability. A lot of the features in Hive, in the basic features and functionality like filter and sorting, limit a lot of the detail around how the are implemented, are just to make look-ups really, really fast, and to be able to treat HDFS as though it were an actual DB.
So, now we're gonna filter for this intriguing user that we found in the last module. Cool, so we're gonna select all columns from stackoverflow.users, which is the user table, where the id is equal to 267, which is the user id for the user that we discovered in the last module. So, we're gonna launch that query. Cool, and so you can see, this didn't even have to launch a map to do this job. This was able to, instead of doing that, just literally find that row by doing a DB scan, and it only took 8.7 seconds. So, this is the entry for this user in the user DB, and so this has everything that the site knows about them.
So, it has their user id, it has the time that their account was created, it has time since their last activity, and it also says that apparently their in Washington, DC, and their name is Al E. That's really all the information that this DB contains.
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups