Join Lynn Langit for an in-depth discussion in this video Introducing Hadoop, part of Learning Hadoop.
- What is Hadoop? It consists of two components, and oftentimes is deployed with other projects as well. What are those components? The first one is open-source data storage or HDFS which stands for Hadoop File System. The second one is a processing API which is called MapReduce. Most commonly in professional deployments Hadoop includes other projects or libraries, and these are many, many different libraries. I think there's over 25 now.
The ones I see most commonly and we'll be covering in detail in this course are HBase, Hive, and Pig. In addition to understanding the core components of Hadoop it's important to understand what are called Hadoop Distributions. Let's take a look at those. The first set of distributions are 100% open source, and you'll find those under the Apache Foundation. The core distribution is called Apache Hadoop and there are many, many different versions. I think we're up to 3.4 as the time of this recording and there are many minor versions.
The Hadoop version release cycle is quite aggressive. And as consideration when you're implementing Hadoop most enterprises stay one to two full versions behind the currently released version because they consider the open source software to be immature and not ready for use in a professional setting. Because of this there are several commercial distributions, and these are the ones that I work with with my customers most often. How they differentiate from the open source distribution is that they wrap around some version of the open source distribution and they will provide additional tooling and monitoring and management along with other libraries.
The most popular of these are from the companies Cloudera, Hortonworks, and MapR. We'll be taking a look at all three of these most popular commercial distributions in this course. In addition to that, it's quite common for businesses to use Hadoop clusters on the cloud. The cloud distributions that I use most often are from Amazon Web Services or from Microsoft with the Windows Azure HDInsight. Here's where it gets a little bit confusing so let me clarify.
When you're using a cloud distribution you can use an Amazon Distribution which implements the open source version of Hadoop, so Apache Hadoop on AWS with a particular version, or you can use a commercial version that's implemented on the AWS cloud such as MapR on AWS. Not all commercial versions are available on all clouds. That's a consideration when you're selecting a cloud-based Hadoop distribution.
We'll also be taking a look at the Windows Azure HDInsight distribution as it's gaining in popularity particularly with Microsoft customers. As a reminder there are several factors that cause businesses to use Hadoop, and I like to say it quickly this way, Cheaper, Faster, Better. Again, it's very important to consider the appropriate kinds of Big Data problems. As I mentioned in a previous movie, those that are related to behavioral data rather than transactional or line of business data are most commonly a better fit.
But if you have those kind of data situations or problems the Hadoop ecosystem can be tremendously cheaper as it runs on commodity hardware and scales to pedabyte size or more. And because it uses the MapReduce processing algorithm which we're going to be looking at in quite some detail in this course which allows for parallel data processing. Even though the processing is implemented in batch it's implemented on each of the nodes which can result in much faster overall processing of large amounts of data.
In considering Hadoop business problems I wanna give you some examples. These are various types of business situations for which Hadoop could be a good solution in terms of the database. First one is risk modeling. If you think about it in terms of insurance companies or financial companies when they're determining whether they're gonna give you a loan, their business is making the best decision about where to allocate their resources. The more data that they can have both transactional and behavioral, the better results they can have.
Many clients in these industries are already working in the Hadoop ecosystem because they're storing massive amounts of data. Another one is credit card activity. If you've ever had that call from the credit card company where they're warning you about a purchasing pattern that seems to be out of normal range and asking you to validate it because it could be fraudulent they're most probably using some big data solution, and oftentimes it could be Hadoop. Another one is customer churn analysis. It costs a lot more to gain a new customer rather than to keep a current one, so it's in the best interest of many companies to collect as much information as possible both transactional when the customer actually left.
And also behavioral, what were the activities the customer was doing shortly before they left so that they can reduce the amount of customers that are leaving. Recommendation engine. Many of us enjoy NetFlix. This is probably the classic recommendation engine. Another recommendation engine is Amazon, you might like. These are engines or data solutions that take massive amounts of not only your own data but also data from customers who match the profile of you so that these engines can make recommendations that are useful for you.
To hear a common theme as I'm going through the use cases, it's behavioral data. Over and over again Hadoop Solutions make use of behavioral data so that companies can make better decisions. Let's look at a couple more use cases. Ad targeting. Ads are annoying and we live with them, but it's in the interest of those ad companies to get ads in front of us that we'll actually click on. How do they do that? They collect large amounts of data when we're on social media sites to see what we're doing, or large amounts of data when we're actually shopping.
It's common now when you go into a brick and mortar retail store for certain store chains that they are making use of the behavioral data that they can get from various sources whether it's your phone, from your location activity or other types of sensors or sources that they might have so that they can put ads in front of you that are gonna be compelling. Transactional analysis. We talked about relational databases as being the stores for your current transactions. What about your history of transactions? What if you can analyze the history of all transactions for all locations and you are, for example, some kind of a coffee shop at the click of a button? That might help you to predict what you should order so that you would have the appropriate supplies so that you could serve your customers.
If you are able to look at all transactions and then determine what customers purchased in a certain time period in similar locations. Again, behavioral data which is resulting in better business decisions. Threat analysis. This is very similar to risk modeling. Again, this goes along with the credit card example that I talked about. Search quality. We've got a lot of search engines out there. Of course this technology came from the premier search engine, but Google has competitors. How do competing search engines differentiate themselves? Well, they can capture your transaction, in other words what you search for.
But they could also capture your behavior, what you started typing and didn't press search for, for example. This is something that Facebook has been sort of notoriously known for for a while now, capturing all of your keystrokes so that they can understand when you post but also when you think about posting and abandon that post and try to figure out why you abandoned that post because, of course, it's in their interest for you to interact with their environment as much as possible. Speaking of Facebook, as I mentioned in a previous movie, Facebook is the largest known user of Hadoop or the largest public user of Hadoop.
There are many other businesses out there that use Hadoop and here's some of them that have gone public about it. Yahoo, of course, is a huge user of Hadoop. And in fact, as we look at the distributions the company Hortonworks was founded by former Yahoo employees and maintains a very close connection to Yahoo and tests all of the distributions on Yahoo data sets, kind of interesting. Amazon, of course, is a huge user of Hadoop. Talked about the recommendation engine. eBay is a huge user, similar type of reasons. American Airlines is another publicly announced user of Hadoop and they collect behavioral data on their flight in public.
New York Times, Federal Reserve Board, IBM, and the Orbitz Travel Company. And there are literally hundreds of companies that are making use of Hadoop in augmenting their line of business data with behavioral data to make better decisions.
- Understanding Hadoop core components: HDFS and MapReduce
- Setting up your Hadoop development environment
- Working with the Hadoop file system
- Running and tracking Hadoop jobs
- Tuning MapReduce
- Understanding Hive and HBase
- Exploring Pig tools
- Building workflows
- Using other libraries, such as Impala, Mahout, and Storm
- Understanding Spark
- Visualizing Hadoop output