Join Lynn Langit for an in-depth discussion in this video Understanding MapReduce 1.0, part of Learning Hadoop.
- So you've traveled far in our journey of understanding Hadoop and we're really at a critical juncture now. We need to start talking about the nitty-gritty of the programming paradigm which is MapReduce. So MapReduce is really what has caused Hadoop to be such a sensation in the data world. So let's start diving in so that we can understand it. I'm gonna think of it like an onion 'cause there's a lot of layers to this. So we're gonna start on the top and we're gonna just keep peeling in to deepen our understanding. So first of all, MapReduce is a method of programming. It's a way of thinking and it's a way of approaching data problems.
It was originally created out of research at Google and it was designed to solve a single problem, and that is how to index or get meaning out of all the information that is available on the internet. There's two parts to MapReduce. I bet you can guess what they are: the Map and the Reduce part. So let's first think about the Map part of MapReduce. Technically, you're gonna be writing a function in Java that is gonna perform some action on some data. That function, when MapReduce is run as a job, will be run on each node where the data lives.
You might remember from our discussion about a production Hadoop cluster that an HDFS node is normally replicated three times. So yeah, the map function will go to each and every node and while that might seem inefficient, if you think about the scale at which Hadoop clusters are run in production situations like Facebook or Yahoo, their need is to be able to process the data and to have high accessibility or availability of that data, and they don't wanna mess with if a single node goes down. You might remember from our discussion on files that HDFS is self-healing so if a node goes down, the map job will just be retried on another node that has that data.
This idea of bringing the compute to the data rather than bringing the data to the compute that is running the map job on each node is one of the keys to understanding MapReduce. Once the map or functions run on each of the nodes, they'll take some input and they'll output something very simple: a set of key and value pairs on each node. The second part of MapReduce is the Reduce part. Like the mapper, this will execute the reduce function on some set of data.
Now this is different than the mapper in that the reducer executes only on some of the nodes of the cluster. You might remember that mapper will execute on at least one node for each set of data. So really, in essence, all of the nodes. What the reducer does it will aggregate the sets of key-value pairs on some of the nodes and the output of the reducer is a single combined list. So I think it's important when we're learning new programming concepts to have a visualization, and I really like this one.
So you can see that we have three nodes underneath the mapper, and we have various types of data and that's represented by the various colors on the mapper. So these are three physical separate machines. Now in reality the implementation is probably gonna be more complex because that's the minimum amount. Most production Hadoop implementations have at minimum 50 to 100 nodes, but it's more common to have several thousands of nodes and the data is triple replicated and that's really not shown here. This has just simply shown different chunks of data on different map machines.
You'll see that once MapReduce runs, the output is sorted. Now I didn't discuss up until this point the magic in the middle which is called the shuffle and sort phase. The shuffle and sort phase, as you might expect, is quite complex and one of the key aspects of MapReduce. How do those lists that come out of the mappers get aggregated into their reducer? Is it an automatic process? Is it something that we have to write code for? Well, the reality is the mixture of both. When we are writing our MapReduce jobs, we'll work both with the default implementation of the shuffle and sort, and also we'll learn in what situations and how we wanna override those defaults in order to get better performance and for specific data types.
As we continue to think about MapReduce, I'll remind you that there are two major versions of this programming paradigm. I think when you're first getting started, it's helpful to start with Version 1. Version 1 is really an implementation of what Google open-sourced so many years ago that they had success with in indexing the Web. It's really very simple and focused. It's not for every big data problem but it's a great way to start. Version 2, which we'll also cover in this course, builds on Version 1.
So let's think about the key aspects of MapReduce Version 1. It is extremely distributed and scalable and can be very, very inexpensive because it runs on commodity hardware. We've talked about this over and over with the HDFS file system. It's resilient because if one of the nodes goes down, HDFS is self-healing. It really lends itself to parallel processing because as we just learned, the map portion of MapReduce executes on each of the nodes. So it allows for great scalability and the Reduce portion then aggregates everything on up.
So hopefully you're getting excited to start coding this. So let's talk about some of the steps. The first step is you're gonna create a class to hold your MapReduce. Now just like any other code, as you become more sophisticated about this, you're gonna refactor and you're gonna have separate classes for separate pieces of functionality. But in our samples, we're gonna keep it really simple and just really straightforward. So we're gonna create a single class that has the map class, the reduce class, and a main function with an implementation of a job runner which will then call an instance of the map and reduce classes.
Now you'll notice that I put static here. You might not be familiar with that if you're not a Java programmer; static just means global. So parts of this programming paradigm is that the mapper and the reducer have to be static. So here's some pseudocode to get us warmed up. So if you're unfamiliar with Java, we're gonna take it slow and gentle. If you've got C# or C++, it should be not a difficult sort of transition. The code itself I don't find too difficult. It's the programming paradigm because really what this is, is this is using OOP code or object-oriented code where we have objects and we have instances of objects and so on and so forth, but it's being used in a functional way.
I think that's where really the challenge is for so many programmers who come out of the OOP world like the C# or the C++ or the JavaWorld to be able to write effective MapReduce code. What I mean by functional is a type of code where state is not shared. You might remember back when I was discussing the different JVMs, the different demons, that I emphasized the fact that the task trackers that run on each node run in a single JVM and JVMs don't share their state. Well, as I just explained, when we run the map function, the map function runs in a JVM on each of those nodes.
So for most OOP programmers, they're used to being able to share a state when they do data manipulation. For example, they could create a customer object and they could continue to manipulate the object several times in several ways and then just output the results when they're done. That kind of programming won't be possible on MapReduce. In MapReduce, it's a single type of input in, run your mapper, get your key-value pairs out, run your reducer, get your combined list, and then go on to the next one. So what does that mean in reality? What that often means is methods that you use in your relational database-backed programs will need to be split apart into several different steps because you wanna think of it as single input, single output.
Very simplistically, you literally can think of it like functions in, for example, an Excel spreadsheet where you have a set of numbers, you do a function or manipulation and you get a new set of numbers, and then you go to the next cell; it is literally like that. So if we take a look at the code here, we've got our public class for MapReduce and then we have our main function, and we would call a job runner instance, and we would call a map instance and a reduce instance, and we would have our map and our reduce which would have a mapper and a reducer. So what is the Hello World for MapReduce? It's called word count and again, to understand it, we need to think about where MapReduce came from.
It was Google trying to solve the problem of counting all the words on the Web. So the example you'll see over and over and over and you'll wanna be familiar with as you're learning the different languages, libraries, and tools is Hello World. Now it's a starting point, you wanna go beyond there, but it's a great place for comparison. It's actually used for benchmark performance testing like simplistic benchmarks, and it's just de facto standard for getting started with Hadoop programming. So how does it work? Well, not a big surprise. It takes some text as input.
It produces a list of words and counts them. So let's take a kinda silly example. I'm gonna try to read this, let's see if I can. How much wood could a woodchuck chuck if a woodchuck could chuck wood? Now with the standard implementation of MapReduce, you can see the output after going through the mappers and the reducers. There's one word how, one word much, two words of wood, two words of could, so on so forth. Now the key question looking at this is what constitutes a word? This is where you, as a programmer, come in because you're gonna have to define that logic.
There's no magic here. So a couple of considerations just looking at this simple sentence: What's gonna happen to the question mark? Are words case-sensitive? Notice we have woodchuck which is counted as a single word, but we also have the word wood and the word chuck. So what happens when you have duplication? These are all considerations around the logic of the implementation of the mappers and the reducers. To that end, let's take a look at some pseudocode for word count. So you can see in this Word Count MapReduce Pseudocode, in the mapper we have a filename and file contents and we have some kind of loop to iterate over the information.
We're just using a "for each" and here we have "word in file-contents." We wanna emit that word and 1, so we're just splitting it out. Now you notice what's missing here the implementation of what constitutes a word. What we're gonna see as we get into the actual code is there's quite a lot of complexity here. It's very commonly implemented in a regular expression or some other kind of filter and you'll see as we go along this is a common source of errors, bugs, and also performance problems when we are scaling these mappers across hundreds and thousands of nodes.
The pseudocode for the reducer takes the output from the mappers, which is a list of keys and values, and in our case it's the words and how many of them. Then we start with zero as an initializer and again we have a loop. For each value in values, we take the sum and then we add a value and we emit the aggregated count. Now there's other ways to do this and, again, we're gonna look at the actual code in terms of correctness, effectiveness, and performance.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output