It's important to understand where platforms come from and what they are designed to do. In this video explore the origins of HBase and why it was created.
- [Presenter] Let's get started here talking about where HBase comes from. We can't really talk about Hbase without talking about Google and its contributions to this platform. When Yahoo released Hadoop as an open source data storage system, just the file system, Google added onto that the programming interface known as MapReduce. So with what we now call HDFS, or the Hadoop Distributed File System, and MapReduce, big data was born.
And companies were starting to adopt this new way of thinking when it comes to how to work with vast quantities of data. Google also did their own version of HDFS known as the Google File System, but they work pretty much the same way, and MapReduce soon became the defacto standard for working with large quantities of data in these file systems. HDFS stands for the Hadoop Distributed File System and what it does is spread data apart across a bunch of different nodes in its cluster.
So in this way, it is distributed. Now, when it does this, the data that it's using doesn't have a schema like a regular database does. It is without a schema being defined. In fact, it is just documents, it's files. Now because the data is split across many different nodes in this cluster, it is fault tolerant. In fact, by design, Hadoop expects that some portion of the data nodes that contain these documents will be failing at any given time.
That's because the design here is meant to be at such a large scale, that even a 1% fail rate would mean that some portion of the system is always offline. Now HDFS is storage only, so there really isn't much to it than distributing and storing these data in a way that you can be assured you won't lose any data even if parts of the system are having challenges. MapReduce works by identifying the bits of where the data live in your cluster and what operations need to be performed on them and then coming up with way to execute these.
So the nature of MapReduce is that it's all focused on data processing. To write MapReduce jobs, you need to know Java because this was the prevailing language at the time and in many ways still is across all different kinds of platforms. And once you execute a MapReduce job, the operations are to find the data as I mentioned and come up with a list of tasks that it needs to execute and then execute them using what they call reducers. Now one of the big downsides of this methods is that it's batch oriented.
If you need only one row from one logical table or one document in HDFS, it will need to read that entire file just to get that one bit of data. In fact, you shouldn't really even think of Hadoop as having rows like a regular database. Rather, it's like having files on your computer's hard drive. If you just want to identify one order from a customer history and all those live in an Excel spreadhsheet, you need to open up the entire spreadsheet just to find that one row. So you can think of HDFS as being as very batch oriented, just as it would be if you were trying to look for one line in a Word document on your computer itself.
So this leaves us with the big question. Well where's my database? I thought Hadoop was a data platform, and data platforms are built on databases, so where is it? Well, what we've identified is that Hadoop has some limitations here. One is that it's unstructured data or semi-structured data, meaning we can't really depend on getting a specific type of data back when we ask for a bit of information. Because we're pulling in all of the files when we read data from Hadoop, we're not just looking for a little bits of data here and there, we don't have random access.
So if we want to do something where we want to look at one little bit of data, or we do some analytics on one segment of data, we're going to have to bring in so much more information just to make it happen and possible that it really doesn't work. In fact this is where the nature of it being batch oriented is. That it's slow. Now another thing that it lacks is the ability to support transactions. And transactions are key to many database applications, so if you think about your bank, if you log in and you make a trade or you move money from one account to another, you can't later have that transaction disappear or not be represented properly.
And with the way Hadoop works is that because it spreads everything out, it's batch oriented, it has all these different challenges, it doesn't really support application needs that need to have consistent transactions when users are using the application. This is where HBase comes from. We often call it the Hadoop database. So unlike Hadoop or HDFS, HBase has a schema. Which means that we can depend on the types of data we're going to get back when we ask for them from our database.
Now it distributes the data just like everything else, but it also has an in-memory component that allows us to read information very quickly. Just like a real database, it also provides us with the ability to isolate very fine grained bits in our database and work with just them. Just those random bits spread out through many different tables which could be spread across thousands and thousands of nodes in our cluster. Unlike Hadoop, it allows us also to perform what are known as CRUD operations.
That is, creating a new document. Read, so pulling in the information into our application or our process. Update, so actually changing that value which is very useful sometimes. Or delete, where we can actually delete that data from our system entirely. In addition, HBase leverages all of the scalability and reliability that HDFS gives you because that's where it actually stores the data. So you get all the benefits of Hadoop in addition to the typical benefits that you see with a relational database.
This course can help professionals further their career in big data analytics using HBase and the Hadoop framework. Learn to describe HBase in the context of the NoSQL landscape, build simple architecture models, and explore basic HBase commands. Instructor Ben Sullins shows how all the concepts fit together, resulting in the kind of distributed big data storage you need for scalable, enterprise-level applications.
- What is HBase?
- Who uses HBase?
- Comparing HBase and an RDBMS
- How data is stored in HBase
- Data model operations
- HBase architecture
- Creating tables
- Querying data