From the course: Learning Data Science: Understanding the Basics

Let go of the past with NoSQL

From the course: Learning Data Science: Understanding the Basics

Let go of the past with NoSQL

- Relational databases still dominate many organizations. They're used as the backbone for online transactions. Data warehouses are still seen by many, as the cornerstone of enterprise analytics. Relational databases have had a good run, now newer applications have challenges that exceed this relational model. Often, a data science team needs a more flexible way to store their data. Remember that relational databases rely on a schema. You need to know a lot about your data before you put it into the database. So you need to plan ahead. You have to know if your data is an audio file or a text file or even video. Then you'll organize these fields into tables. Finally, the tables need relationships. Think about our website that sells running shoes. Imagine that you're a customer and you found a shoe. Now the website joins your shoe to an address. Now you're ready to check out on the order page. That one page needs access to four different database tables. The shoe table, the customer table, the address table, and finally, the shipping table. That's a lot of work for a transactional database. The harder your database works, the slower it will make your website. It's also difficult to speed things up. Do you need to buy a bigger server, or do you try to split your tables on several servers, or you have several servers that synchronize across the network? When you're talking about really large websites, these options start to seem cumbersome. Now imagine a database that stores everything in one checkout page as one transaction. You create a record for your shoe, your customer, their address and shipping, all in one transaction. You don't have to worry about splitting up the data into tables. You don't have to worry about creating relationships. Just dump it in and you're done. That's the idea behind NoSQL. NoSQL was first used as a Twitter hashtag for developers who wanted to move beyond relational databases. It's actually not a slam against SQL. In fact, NoSQL doesn't have very much to do with SQL at all. It's about the limitations of the relational database model. In general, a NoSQL database should be nonrelational, schemaless, cluster friendly and hopefully, open source. All of these qualities should appeal to a data science team. When a database isn't relational, it's easier to change and simpler to use. There doesn't have to be a big difference between how your web application works and the way you store data in your database. You won't have to go through the ugly process of creating and splitting tables that already exist, to create different views. This is commonly referred to as normalizing your database. Without a schema, you don't have to worry about knowing everything up front. Let's say your running shoe website was bought by a larger company. This company wants to add your customers to their frequent buyer program. With a relational database, this is a serious architectural challenge. Should you have the frequent buyer identified in the customer table? Maybe you need to create a whole new table of frequent buyer numbers. Can a customer have more than one buyer number? Can two customers share the same buyer number? All this needs to be sorted out. You have to re-work the database and figure out how to correct for missing data. Without a schema, any new fields become almost trivial. You can just store it as one transaction. If the customer has a frequent buyer number, then it's loaded as part of the transaction. If they don't, then the field doesn't exist. Finally, a NoSQL database should be cluster-friendly. You should be able to store the data in several hundred, or even thousand database servers. In a NoSQL database, the records saved in a transaction is called an aggregate. These aggregates hold all of the data. They have the shoe, the customer address and shipping information. These aggregates are easier to synchronize across many database servers. Many of the servers work in clusters. That way they can synchronize and then send out their updates to other clusters. The word cluster should sound familiar. It's also how Hadoop works with its data sets. In fact, much of Hadoop is built on HBase, which is an open source NoSQL database. When you're working in a data science team, you're almost certainly going to run into NoSQL. For many organizations, this is the best way to deal with large data sets. Because of its simpler design, it's also easier for developers to create web applications. These applications can quickly grow to an enterprise scale.

Contents