In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups
Skill Level Intermediate
- [Voiceover] Hi, I'm Jack Dintruff, and welcome to Data Analysis on Hadoop. In this course, we'll look at how to analyze data on Hadoop using Pig and Hive. We'll also learn how to interact with HDFS and delve a bit into how YARN works. I'll start by showing you how to load your data onto HDFS, and manipulate it once it's there from the command line. Then I'll show you how to read that data in Pig, where you can start aggregating your data to review valuable insights. We'll also see how to create a Hive table from a query using the Hive query language. We'll be covering features on both Pig and Hive platforms while highlighting the similarities and differences along the way, so that you can choose the platform that's right for you and your data.
Now, let's get started with Data Analysis on Hadoop.