Join Lynn Langit for an in-depth discussion in this video Understanding the parts and pieces, part of Hadoop Fundamentals.
- View Offline
- So as you might be thinking, setting up development environment for Hadoop can be quite complicated. I've found the best way to approach it is to break it down into steps. So let's do that here. The first part is you need to figure out how you want the Hadoop binaries to be accessible. So you have really a couple different considerations. The first thing is, do you want to work with the open source, plain-vanilla, Apache Hadoop binaries, or you want to work with a vendor distribution? And usually we'll work with a free version or a developer edition, and the way you pick the vendor is if you have a preference, because you have worked with that vendor before, say, Microsoft, or you think the vendor is leading in the market as Cloudera is said to be by many analysts and consumers at this time, or there's feature that the vendor has that you want to use.
So you need to figure out which version of Hadoop you want. Once you figure out which vendor version you get, then you want to figure out which version number you want. And usually you want the latest stable release, which is 2.4 as of the date of this recording, but it does change very rapidly, so you should check the Apache Hadoop site that I showed in an earlier recording to see what is the latest version, what's going to work for you. And probably the most complicated part of this is to figure out which location you want to put your Hadoop cluster or server in for development.
I've found that it's often faster and in some ways cheaper, to spin up a Hadoop server or a Hadoop cluster on the Cloud, because as I showed in the previous movies it's just a couple clicks once you understand all the different options. And I have tried to do a local install a couple times but I really ran out of time. It took more than a day the first time I tried to do it. So even though it's completely free to install the Hadoop binaries on a Linux box, it really is complicated, especially if you're coming like I was out of the Windows ecosystem, and time is money.
One thing I will tell you, if you do decide to go with a Cloud-based version of Hadoop, is the Cloud vendors do charge you. Amazon charges by the minute and it's really a common new developer mistake to forget to turn off a Cloud instance, and you can run up a lot of charges, so you really want to make sure that you turn it off when you're done. In the middle here, is the option that I most often use, which is, I will get a local virtual machine, because I don't run a Linux box natively, unless I used Hortonworks, which runs on Windows, I would have to use a Linux box for it.
And I will usually use the Cloudera virtual machine on my laptop. In order to do that, I have to have virtualization software. So I'm going to show you the steps to set that edition up with a local virtual machine in a subsequent movie. Now there are a couple more considerations on your development environment for Hadoop itself, the data storage. I find that when you're first starting, it's actually simplest if you just set up Hadoop as a single distribution, which uses the native file system, which might seem kind of like you're wimping out, you're not using HDFS, but as you'll see when we get into it, writing the MapReduce algorithms is pretty complicated, much less testing that code, and debugging it, and deploying it onto production so when you're just starting, it's good to start with, I think, the simplest possible situation.
Now, quickly, you're probably going to move to the pseudo distributed model, which is a single HDFS file system on a single machine. I find that I very seldom use a full distributed implementation for writing sample code, or proof of concept code. That, of course, I will do if I'm in a production situation, but I find that when I just get started to set up a cluster of either virtual machines or a cluster of Linux boxes, is kind of overkill for development. The other situation is, if you are using the Cloud, then you need to figure out if you're going to use HDFS in, let's say, Amazon EMR, Elastic MapReduce, or if you want to just use S3 in the case of Amazon, or in the case of Azure if you just want to use BLOB.
And again, just like using the local file system when you are developing a local machine rather than HDFS, I do find that sometimes when I'm using a Cloud-based distribution, I'd rather just use a Cloud-based file system, so kind of a learning I've gotten over time in working with Hadoop, that when you're in development, you don't always need to use the HDFS file system. The third consideration is which libraries do you want to have as part of your installation? Which of the features are you going to be writing code against? So, the core is which version of MapReduce? 1.0 or 2.0, and because there's so much richness around MapReduce, I'm actually going to show both in this course.
We going to start with 1.0, and then we're going to move to 2.0 when we're writing our MapReduce jobs. You'll also want to consider which additional libraries. As you've seen in previous movies, it's most common to have Hive or Pig at minimum. There are a number of other libraries that modern developers will use, and we'll touch on those as we get into that as well. And then finally, you're going to need some developer tools. So we're going to talk about that in the next movie.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output