Join Jack Dintruff for an in-depth discussion in this video MapReduce processing, part of Data Analysis on Hadoop.
- View Offline
- [Voiceover] So, MapReduce is a concept…that derives from functional programming.…So in functional programming…there are three main functions.…There's map, reduce, and filter.…So filter isn't especially applicable in this case.…So we're just gonna talk about map and reduce,…and how they work together.…So, the way MapReduce works…is you have some function…that performs some task.…Let's say it's counting the occurrence…of particular words in a text file.…What you do is map that function,…to every single piece of data in your data set.…
And you often do this by splitting…the data set up into many pieces.…So MapReduce isn't for something that is iterative.…It's really made for things that can be parallelized,…where the processing, or the function is the same…across the entire data set.…So, as on the HDFS side we had the name node,…which was the master for all of the data nodes.…On the MapReduce side we have a very similar structure,…where the resource manager is what manages…all of the resources as you would think.…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups