Understand that—unlike relational databases that are modeled based on rules of normalization—tables in Cassandra are modeled to answer specific queries.
[Instructor] Queries drive data model design in Cassandra. If you're familiar with data modeling for relational databases, you're probably used to thinking first about the entities and then about their relationships. For example, if you're designing a database to collect data on the performance and utilization of servers, you'd probably start by listing entities like the physical servers, the virtual machines that run on those servers, as well as applications, and the processes that implement those applications.
These entities have relationships. Virtual machines run on physical servers. Applications execute in processes, and there can be multiple instances of processes running the same application. When we design data models in Cassandra, we don't start with entities. We start with the queries we want to run. What is it that we want to report on? In our example, we are concerned about application performance, so we might want a report on servers that are running with high CPU utilization.
In a normalized model, we would have a table for servers and a table for server metrics, such as this. In Cassandra, we don't necessarily use multiple tables. We often use a single table. In this case, let's call it servers by CPU utilization. This table would contain attributes like the server name, operating system data, and then physical machine characteristics like the memory size and storage size, as well as configuration parameters like the host name and IP address.
But our table would also have data about performance metrics at different points in time. So this would include things like a timestamp, which identifies the time period, as well as measures of CPU utilization, the number of processes that we're running, IO operations, and the free memory at some point in time. Now this may seem unusual to a relational data modeler. We're mixing information about a server entity along with many different sets of performance metrics.
Shouldn't we use a one-to-many relationship? Not in Cassandra. Cassandra is optimized for fast writes and fast reads over very large volumes of data. To achieve these levels of performance, Cassandra does things differently than relational databases. Two big differences you'll notice right away are there are no joins and there's a lot of duplication of data. Joins can be time and resource-consuming operations. Cassandra avoids those by duplicating data in rows of a table.
This goes against the best practices for relational data modeling. That's okay. Relational data modeling rules were designed to reduce the chance of introducing data anomalies, like reporting an outdated address for a customer. When working with Cassandra, we trade higher performance with big data for using extra storage space and risking some data anomalies. You might wonder, wouldn't we want to reduce the amount of storage we use when working with big data? After all, duplicating data at large scales can lead to very large databases.
This is true. We don't want to waste space, but more importantly, we want to be able to respond to queries quickly. Cassandra is a good choice for a database when your top priority is being able to write and read data quickly. Now, this does not mean we waste space with Cassandra. We should use data types that are sufficient for what we need, but not more. If you're storing a large number of data points that could be stored in a tinyint, which uses one byte, don't use a bigint, which uses eight bytes.
Cassandra also is flexible enough to allow some useful optimizations. For example, instead of storing an entire row each time a new set of server metrics come in, we can store those metrics in a new column in an existing row. We'll get into the details of how this works later. Well, that concludes our brief introduction of how queries drive data model design in Cassandra.
- Cassandra architecture
- Keyspaces, tables, and columns
- Installing Java and Cassandra
- CQL data types
- Designing Cassandra tables
- Tuning tables to optimize queries
- When to use secondary indexes and materialized views
- Physical data modeling and distributing data
- Cassandra architecture and its impact on data modeling