Join Lynn Langit for an in-depth discussion in this video Exploring use cases for Pig, part of Hadoop Fundamentals.
- There are several different ways to run a Pig Script. Let's go over them here. The first way is in Script or Batch mode. You just run your script from the Hadoop shell. The second way is, and this is just goofy, but in Grunt, or Interactive mode. From the Hadoop shell, you type the word pig, which will start the Pig shell, and the third way, is the Embedded mode, and that's within Java. Now you might be wondering, why do we have these ways? When do you use them? Script or Grunt are used during the testing phase, and the Embedded mode is very often used in production because it's very common that you combine scrips, you chain them together in a process that is repeatable and quite complex when you're doing processing of real world data.
Let's take a look at WordCount, or Hello World for the Hadoop ecosystem using Pig. So I'm going to just go through this simplified example so that we can compare and contrast this to WordCount in some of the other languages and libraries. So first, we're using the concept of aliases. So we have variables, so lines, words, grouped, wordcount, those are aliases. And then we're setting them equal to values that are generated by Pig. So lines is equal to calling the LOAD keyword. We're just loading some text.
And then we're loading it as a character array, which is a data type in Pig. And we need to load that type to be able to use with the function so that we can process it. To the words variable, we're using the FOREACH keyword, against the lines input, and then we're calling several functions. So this, I think, is really quite elegant, and I want to contrast this with what I showed earlier in the hive where we used a (mumbles) which, to me, was kind of unreadable, so we're going to generate a new output, we're going to flatten or group that output, put it in a bag, and then we're going to TOKENIZE, which I think is kind of beautiful.
I'm a little bit of a nerd, but we're tokenizing the line, which is doing a basic split of the text into words. Now the TOKENIZE function separates on spaces, and that might not be sophisticated enough for your particular implementation so you might have to write your own TOKENIZE, but for a lot of implementations of basic text processing, it's good enough. And then you get a word, and then we have the grouped variable, which is going to group the words by word, and then for the wordcount, for each grouped, we're going to generate a group.
And we're going to count the words, and then we're going to dump out the results. Now are you starting to see the patterns? Are you starting to be able to see where the map would occur? And where the reduce would occur? For me, it's easier to think backwards. If I look at the wordcount variable, it's clear to me that's where the aggregate is going to be generated. So we're going to have these key value lists that are generated, and we're going to group them together, so that's going to be the reducer. So the words variable, and the grouped variable are where the mappers are going to be generated.
Again, it's really critical, as you're working with Hadoop data, that you think in terms of map reduce. That's why we took the time in the earlier modules, to go down to that low level, and write the map reduce code. Even though you may not use it in production, you may use Pig, or you may use Hive. More and more customers do, because they are simpler, and more flexible. However, when you get into real production situations you need to be able to translate back down to mappers and reducers, so you can find out where bottlenecks occur and debug and fix them.
Now we're going to take a look at Pig by example. We're going to, again, use the Cloudera distribution on the virtual machine, the Hue tool, and we're going to look at the Pig samples.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output