Learn how complex and expansive big data really is. Many times the media talks about "big data" as if it were a singular thing like a spreadsheet. The truth is that these systems are distributed computing platforms containing potentially hundreds of subsystems that make them run.
- [Instructor] Often when people refer to Big Data, they talk of it as if it were a singular thing. And that's just not true. So, if we think about Big Data and we look just at one of the most popular frameworks, Hadoop, we can really understand how Big Data is an entire framework of things. In fact if you look at the definition of what Hadoop is and how it's described, we find that Hadoop is a framework for distributed processing of large data sets across clusters of computers using simple programming models.
So none of this here talks about it as a singular thing, unless you consider a framework a singular thing. But by the nature, a framework is a combination of things. If we just take a look at the Apache Hadoop Ecosystem here, what we'll find is that there are a number of systems that all work together to achieve this goal, this distributed computing platform goal. If you want to implement Hadoop, you'll need to understand at least the basic level of most of these platforms. For instance, to adjust data there are Flume and Scoop.
These are agents that run their separate systems, that look for data and wait for data to come in, and they throw that data in to the Hadoop Distributed File System known as HDFS. Now the data in HDFS is sharded to multiple copies across many different clusters, or many different nodes. And so what that means is that the data is reliable. If any one of those nodes were to go down, you would still have at least two other copies. In order for you to actually lose data with Hadoop, you'd have to have a complete failure, which is why you spread out your nodes across data centers, across regions, so you can essentially never actually lose data.
Any problems with your network, any partitions that happen, all of those things are fine because the way HDFS works, it distributes the data across multiple nodes so you're always protected. Now with the data in HDFS, it's kind of tough to get to. The main way people have been working with data in HDFS for a long time now has been using the MapReduce framework, and MapReduce is a Java-based programming language. It's a framework for writing Java to access this data. Well analysts and data scientists and a lot of other folks really had a hard time with this.
So Hive was created. And Hive is a platform that runs on top of HDFS that allows you to write a Sequel query which is a language that is really familiar to data scientists and data analysts. So if you know Sequel and your Hadoop Ecosystem is running Hive, then you can issue Sequel queries against it and retrieve results. With this capability, other things can actually do this for you, such as Tableau and R, which are really popular analytics and statistical processing packages.
Now if you need to treat Hadoop like a relational database, you'll want to run Hbase, which simulates the ability to execute a transaction against your data, and then stores that data in HDFS behind the scenes. All of this, plus many more options exist. Not to mention the thousands of platform vendors and platforms as a service option from companies like Microsoft, Amazon and Google. Just to stand up your Big Data platform. So when you think about Big Data, know that there are countless varieties and platforms in that one term, and here we're only really looking at one section of them, the Hadoop Ecosystem.
There are many more Ecosystems like Hadoop that fall into this Big Data category. So the myth that Big Data is one thing, is really just false.