Join Dan Sullivan for an in-depth discussion in this video The limits of relational databases, part of Advanced NoSQL for Data Science.
- [Instructor] We're here to talk about NoSQL and it's role in Data Science. But before we get too far into digging into why we're using NoSQL, it's really important to look at what's wrong with SQL, what are the limits of relational databases and why should any of us even bother turning our attention to NoSQL databases. Relational databases have been around at least since the 1970s and they work really well and they're several reasons for that. The data's fairly well structured, records are organized into tables. Tables consists of rows, which are identified by unique keys, or primary keys.
We can organize our data into tables and then join them together, or link them together so we don't have to lump all of our data into one large structure. Another important feature is support for something called transactions. And there's an acronym for that that we use, it's called ACID. And I'd like to dig into that a little bit. Atomicity is a feature that supports transactions, so that multi-step operations like transferring funds from your checking account to your savings account all have to occur for a transaction to succeed. Consistency means that the database is always kept in a consistent state.
It follows all of the rules and constraints that you've specified for your database. Isolation means the transactions don't interrupt each other. And finally Durability means that data is stored persistently so that you don't have to worry about losing your data if power is lost or your server crashes. Now, those were some of the main features of relational databases that are important, but another one that is also important is the data in models are normalized. Which means we structure the data in ways that minimize the chance of introducing mistakes or anomalies as they're known.
Now relational databases, they have a lot of advantages. They're widely used across industries and scientific domains. The normalized data minimizes chances of introducing data problems. There's a really effective query language called SQL, which is useful in a lot of different domains. And also relational databases are widely supported in terms of other application development tools and programming languages. Now there are some disadvantages. For example, in a relational database our schemas fixed, or our data structures are fixed and we have to know what our schemas going to be before we start building our programs.
Another disadvantage is that joins are computationally costly. We're limited in the kinds of data structures that we can store in tables. And finally, relational databases are difficult to scale and this is becoming a problem as data volumes become bigger and bigger. Now, how can we work around this? There are a few ways. Denormalization for example is one technique that allows us to avoid joins. And the way that we do that is by expanding the number of columns in a single table and organize those tables in ways that when a query is executed, it only needs to query a single table.
This does improve read performance, but it also introduces the possibility of data anomalies. Sharding is another technique. This is a way of breaking up a database and storing pieces of the database on different servers. This has the advantage that we can now query from subsets of the data so we don't have to query across the entire data sets. This improves both read and write performance, but it is more complex to organize and manage. Replication is another technique. In this case, we make copies of data that's stored in tables and indexes and we store those copies on different servers so that the servers can be used to respond to different queries.
Now this really improves read performance, but it introduces the possibility of inconsistencies between the copies, so that's something we need to manage for. So, what we've seen is that relational databases are really quite useful and have many features, but there are some disadvantages that we can sometimes work around. But these workarounds also come with some disadvantages. So what we're doing now as we move on to NoSQL, is that we're looking to have a database system that naturally allows us to denormalize the data. We want something that will support scalability.
We also are willing to exchange for these benefits some trade offs. For example, relaxing ACID constraints. ACID is very important to many application areas, but not all. And NoSQL is a good option for those applications that don't require full ACID compliance. And also, NoSQL databases support sharding which make it very easy to scale and improve read and write performance. And finally, perhaps most importantly, NoSQL databases offer us new ways to query our data. This is especially important in Data Science where we're dealing with large data sets, complex data sets.
Sometimes simply queries across tables are insufficient, but with NoSQL databases, we have new ways to find patterns and document databases in hierarchical structures and even to navigate and traverse graph structures. So NoSQL databases offer us new ways to query more complex data structures then we're able to do in relational databases. And that's one of the key drivers to using NoSQL in Data Science.
The course begins with an introduction to NoSQL, and then delves into the specifics of document, wide-column, and graph databases. Learn key details for performing data preparation, exploration, and extraction for each type of NoSQL database. Review case studies that show how to use various NoSQL databases with popular data science tools, including the document database MongoDB, the wide-column database Cassandra, and the graph database Neo4j.
- NoSQL compared to traditional relational databases
- Performing common data science tasks
- Preparing data with document databases
- Manipulating data in NoSQL
- Preparing, exploring, extracting, and model building
- Working with document, wide-column, and graph databases
- Reviewing case studies using MongoDB, Cassandra, and Neo4j