Learn how to estimate the size of partitions and when to refactor partitions.
- [Instructor] Now let's turn our attention to estimating data size. We'll first look at some formulas. Once we have a physical model, we can estimate the size of the data we'll be storing. We'll base these estimates on three building blocks, column data, row data and index data. In addition to the data taken up by each column's data type, for example, an integer takes up four bytes, a timestamp takes eight bytes, and a UUID takes 16 bytes, there is overhead associated with columns, rows and indexes.
To determine the storage need for a column, we need to account for the column name, because Cassandra stores column names with data, column value, and this is determined by the data type, and the overhead value, usually 15 bytes, but there could be as many as 23 if a counter field or a TTL is being used. Text or VARCHAR are a little more difficult to estimate because they can vary in size. A good rule of thumb is that the VARCHAR size or text size will be the average length of text that appears in that column, plus one byte for each of those texts or VARCHAR values.
For a description of data type sizes and other details about Cassandra data types, see the documentation at the URL on the screen. Determining the row size has its own challenges as well. Not all rows need to store all columns so the size of the row is the sum of the size of the stored columns, plus a bit of space for row overhead. Each row has an overhead of 23 bytes in Cassandra. So, we can sum the stored column size and add 23 bytes to get an estimate of row size.
You can get an upper bound on the row size by assuming all rows will have all columns. This is a worst case scenario from a storage perspective. You can also estimate likely average row size using this formula. We determine the column size based on the data type. We then estimate the percentage of rows that will have column values, and then estimate the number of rows in the table. Now we take those values, and we calculate the expected column storage, which is simply the column size multiplied by the column use percentage.
Expected row storage is the sum of all columns. We also want to add 23 bytes for overhead. And the table size is the expected row storage times the row count. We also want to factor in index size. Each table has to keep an index of the primary key. So the size of the column in the primary key determines the size of the index. Also, each index entry requires 32 bytes of overhead. So those are the building blocks for estimating storage size.
- Cassandra architecture
- Keyspaces, tables, and columns
- Installing Java and Cassandra
- CQL data types
- Designing Cassandra tables
- Tuning tables to optimize queries
- When to use secondary indexes and materialized views
- Physical data modeling and distributing data
- Cassandra architecture and its impact on data modeling