Join Barton Poulson for an in-depth discussion in this video A brief introduction to Hadoop, part of Big Data Foundations: Techniques and Concepts.
Any discussion of big data will invariably lead to a mention of Hadoop. Hadoop's a very common, a very powerful platform for working with data, but it can be a little hard to get a grip exactly on what it is and what it does. This movie is designed to give the briefest possible introduction to Hadoop, which could benefit from several courses all on its own, and with that, here is the bare minimum on Hadoop. The very first question is, what is Hadoop? It sounds like it's a untranslatable word for big data or for transformative business practice.
Instead, Hadoop was the name for the stuffed animal that belonged to the son of one of the developers. It was a stuffed elephant, which explains the logo as well. But what is Hadoop, and what does it do? Most significantly, Hadoop is not a single thing. It's a collection of software applications that are used to work with big data. It's a framework or platform that consists of several different modules. Perhaps the most important part of Hadoop is the Hadoop distributed file system, or HDFS, and what this does, is it takes a piece of information, it takes a collection of information, and spreads it across a bunch of computers.
It can be dozens or hundreds or tens of thousands, in certain cases, so, it's not a database, because a database usually implies a single file, especially if you're talking about a relational database, it's a single file with rows and columns. Hadoop can have hundreds or millions of separate files that are spread across these computers and all connected through the software to each other. MapReduce is another critical part of Hadoop. What this is, is it's a process consisting of mapping and reducing, and it's a little counterintuitive, but here's how it works.
Map means to take a task and to take the data and to split it into many pieces, and you do that because you want to send it out to various computers and each one can only handle so much information, so let's say you have 100 gigabytes of information and each of your computers has 16 gigabytes of RAM, you're going to need to split it up into 60 or 70 different pieces, and send it out to each of those different computers that you're renting from Amazon Web Services or wherever. Map splits it up and sends it out to work in parallel on these different computers.
The reduce process takes the results of those analyses that you've done on each of these dozens of different computers and combines the output to give a singe result. Now, the original MapReduce program has been replaced by a patchy Hadoop YARN, which stands for Yet Another Resource Negotiator. Sometimes people just call it MapReduce two, and YARN allows a lot of things that the original MapReduce couldn't do. The original MapReduce did batch processing, which meant you had to get everything together at once, you split it out at once, you waited until it was done, and then you got your result.
YARN can do batch processing, but it also can do stream processing, which means things are coming in as fast as possible and going out simultaneously, and it can also do graph processing, which is social network connections. That's a special kind of data. Next is Pig. Pig is a platform in Hadoop that's used to write MapReduce programs, the process by which you split things up and then gather back the results and combine them. It uses its own language. It's called the Pig Latin Programming Language. Probably the fourth major component of Hadoop that is most frequently used is called Hive, and Hive summarizes queries and analyzes the data that's in Hadoop.
It uses a SQL-like language called HiveQL for query language, and this is the one that most people are going to use in terms of how to actually work with the data, so between the Hadoop distributed file system and the MapReduce or YARN, and Pig and Hive, you've covered most of what people use when they're using Hadoop. On the other hand, there are other components that are available. For instance, HBase is a no SQL database, so a nonrelational database, or not only SQL database for Hadoop.
Storm allows the processing of streaming data in Hadoop. Spark allows in memory processing. This is actually a big deal because it means you're taking things off of the hard drive and putting them into the RAM of your computer, which is much, much faster. In fact, in memory processing can be a hundred times faster than on disk processing, although you do have to get through the process of putting the information into the RAM, which usually isn't counted when people are doing these statistics. Spark is often used with Shark, something that enables the in memory processing.
And then there's Giraph, spelled like graph with an i, which is used for analyzing the graph for the social network data. Now, there are maybe 150 different projects that can all relate to Hadoop. These are just some of the major players. So the question also is, where does Hadoop go? It can be installed in any computer. You can put it on your laptop if you want. In fact, a lot of people do so they can sort of practice with it and get things set up, and then send it out to the cloud computing platform, and in fact, that's where it usually is.
Cloud computing providers, Amazon Web Services is the most common, but Microsoft Azure has a form of Hadoop that they use, and there are a lot of other providers that allow you to install Hadoop and run it on their computer systems. Who uses Hadoop? Basically anybody with big data. Yahoo!, not surprisingly, because they developed it, is the single biggest user of Hadoop. They have over 42,000 nodes running Hadoop, which is sort of mind-bogglingly huge.
LinkedIn uses a huge amount. Facebook uses a bunch, and Quantcast, which is an online marketing analysis company, has a huge installation as well, and there's a lot of others. Finally, it's worth pointing out that Hadoop is open source. While it was developed by engineers at Yahoo!, it's now an open source project from Apache. So you'll often hear it called Apache Hadoop, or Apache Hive, or Apache Pig. One of the things about open source projects like this is it's free, which explains in part its popularity.
Also, anyone can download the source code and can modify it, which explains so many of the modifications or the extensions or the programs that work along with Hadoop that make the most of its capabilities. The takeaway message of this presentation is Hadoop is not just one thing, but a collection of things that collectively make it much easier to work with big data, especially when its used on a cloud computing setup. Hadoop is extremely popular in the big data world, and there's very, very active development for Hadoop, but there's also very stiff competition for the market.
Not everybody is just willing to stand by and let Hadoop have everything. This should make it a very exciting situation for companies and consumers who want the best tools for working with their big data projects.
- Evaluate the demand for data science in business, research, and consumer technology.
- Assess the careers and skills in data science.
- Review the ethical issues in data science.
- Explore data visualization with graphing tools.
- Discover how data scientists use tools such as Hadoop and Excel.