Understand why big data systems were invented and how difficult it can be for them to perform tasks at a rapid pace. Being "big" means that the load of data and processing is widely distributed. These systems were designed primarily with scale and reliability in mind, not performance.
- [Narrator] One thing about Big Data that always gives me a little chuckle is when people think of it as something that is so blazing fast and we can do all these amazing things with it. I don't think people really understand that Big Data wasn't designed for speed originally. Let's go back and take a look at the description of Hadoop. Hadoop is a framework for distributed processing of large data sets: large data sets, very large data sets. People like Facebook and Google have hundreds of petabytes of information, of data, in their clusters.
Now this information is distributed across clusters of computers using simple programming models. So the deal there is that we have many, many clusters. So the data is actually spread out across a very wide array of machines, causing you to actually have to bring that data back together when you want to ask a question or retrieve a bit of it. Let's take a look at some of the speed challenges here. The first reason I'll call out as a challenge with the speed of Big Data systems, and it's related to Hives specifically, is it's dependency on MapReduce.
Hive is arguably the most widely adopted Big Data analytics interface, and yet the majority of implementations make it basically an abstraction on top of MapReduce. So what that means, is that you issue a SQL query and it gets translated into a MapReduce job. Why is that a limitation? Well, MapReduce is batch oriented, and it has a process that again, is very slow, and meant to operate in the entire batch at a single time. So if you're working with a large data set or a really complex query, it's going to want to process that whole thing in one fell swoop.
Which just means it's slow. Now you can run Hive on different platforms that don't translate into MapReduce, and those have shown some improvements, but still, the large majority of implementations out there don't do that. They're still dependent upon MapReduce. The next one is simply the fact that the data is distributed. By it's nature, this system is designed to be scalable, and to be resilient, so that way you can have a failure in your nodes without having any loss of data. If you think about it, a Big Data system, let's say, has 10,000 hard drives that are running it.
10,000 disks that are actually spread across different geographic locations. If 1% of those are failing at every time, or even a fraction of a percent, that means that you're always going to have a number of hard disks that are bad or misfunctioning or really just giving you a hard time and maybe causing errors. So in order to protect against that when you want a scale to this size, you have to have multiple copies of the data. That means that when you want to retrieve that data, when you want to run a query against it, or even just access a piece of it, you may have to look in a lot of different places, and actually reach far across the country or even across the world to retrieve those bits, decreasing the speed at which it performs.
Related to that as well is the low cost disk that is used. To keep cost down, and to handle such a high volume of data, we often implement these with these really low-powered hard disks. This is great when the data is cold and mostly just sits there and we don't need to access it that frequently. But when you want to use it for data analytics or data science, that can be a real performance killer. Getting back to the idea of why these systems were created, when you want to perform analytical functions like count distinct or calculate a moving average let's say, the system really has a hard time.
And often these systems don't even support these types of queries at all, meaning you either need to take the data into a different system, or you need to write some custom code to actually execute that. All again things that really challenge Big Data systems for performing in a fast manner.