Join Lynn Langit for an in-depth discussion in this video Introducing Hadoop cluster components, part of Learning Hadoop.
- We've got a lot of background information about the Hadoop ecosystem and it's time to take a look at it in its full glory. It's a little bit of a scary picture with all these boxes and components. One of the aspects of working with Hadoop is understanding what's possible and picking the various components that are right for your particular business situation. I'm gonna take a couple minutes and go through the core components. We'll be hearing about them again and again throughout the course because it's a little bit overwhelming when you first see it. At the bottom you can see that we have HDFS, the Hadoop Distributed File System, which we just covered in terms of files.
Most commonly used but as I mentioned in the previous movie sometimes the file system used, particularly in cloud implementations is actually a more standard file system like Amazon S3 or Azure blobs depending on the cloud vendor. But it is common to use HDFS. Sitting on top of HDFS, the second core part of a Hadoop implementation is Map Reduce. You'll see another new term sitting on top in this diagram which is Map Reduce version two. This is also called YARN, and YARN stands for Yet Another Resource Negotiator.
We'll be covering both Map Reduce 1 and 2 in this course. It is become standard, at the time of this recording that Map Reduce 2 is more commonly used, but it does build on top of Map Reduce 1 so it's a good way to learn the processing framework. You can see in addition to this on the right we have Hbase, which we covered in a previous movie. Very commonly used to be able to query out of a column store abstraction over the top of the file system. Next to that is Hive, which is HQL or the SQL-like query language that is used to query Hbase.
There are other libraries shown here are well. Things like Pig, which is a scripting language that's used for ETL-like processes, or Extract, Transform and Load. You also see we have the Mahout library which is for machine learning or predictive analytics. We have Oozie which is for workflow or coordination of jobs. And that works in combination with Zookeeper which is coordination of groups of jobs and we'll see both of those things. Sqoop is for data exchange in between other systems particularly relational systems like SQL Server and Hadoop.
Flume is a log collector because Hadoop jobs produce a large amount of log information about job process because the jobs are running batch, so they take time to run. In this particular diagram we also have Ambari which is provisioning, managing and monitoring Hadoop clusters. This is a representation of some of the open-source libraries in the Hadoop ecosystem. There are actually more. These are the core libraries. To contrast this we're gonna take a look at a commercial distribution.
You can see for Cloudera's commercial distribution in the center is Hadoop, and that's assumed to be HDFS and Map Reduce 2. Surrounding that is Pig and Hive. To the right you see HBase. You see Hive up above HBase as well because it's both for metadata and for query. You see something called Hue sitting on top of that. That's specific to Cloudera. We're gonna actually take a look at that coming up here in just a few minutes. It's a graphical user interface that makes interacting with Hadoop information simpler.
You also then see Oozie, which is workflow and scheduling. You see Zookeeper for coordination, and Flume and Sqoop. It is an important consideration when you're selecting a distribution to understand which libraries, and which versions of which libraries are supported by that particular vendor, being able to match the capabilities of the libraries with your particular business needs.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output