One of the issues we constantly come across is users trying to perform a COUNT() in Cassandra. More often than not, the question they ask is "why does it timeout?". Why indeed? As a DBA for relational databases (RDBMS), performing a count of records in a table is something you do regularly. The answer to the question is simple enough but it requires understanding the fundamentals of distributed architectures and how Cassandra works. Let us begin...

Cassandra uses log-structured merge-tree for storing data. Briefly it means that writes (in Cassandra these are inserts, updates and deletes) are written to disk at some interval in a sequential manner like a log file. These files known as SSTables are immutable meaning they do not change once written to disk thus avoiding a read-before-write making writes very fast.

Consider a simple table with multiple columns. In the normal course of a day, a new partition is inserted into the table with values for 2 columns. At some point, the memtable gets flushed to disk written as an SSTable. This new inserted partition is the equivalent of one record.

Some time later, another column in the same partition is inserted into the table. This too counts as one record. After a few seconds, another column in the same partition is inserted then the memtable is flushed to disk. Since these 2 fragments for the same partition were in the memtable at the same time, they are merged together before getting written to an SSTable thus only "counting" as one record.

Later that day, one of the existing columns for the same partition gets updated with a new value. Since Cassandra does not perform a read before it writes to disk, Cassandra does not know whether the column for the partition already exists in other SSTables so the update is just inserted as another record which eventually makes it to another SSTable (yes, updates in Cassandra are just inserts under-the-hood aka "upserts").

In this example, 3 records exist in 3 different SSTables which are in fact fragments of the same single partition. But Cassandra does not know that these 3 records are for the same partition until it reads that partition. And really, this is what COUNT() does -- it reads all the partitions across all the SSTables (and memtables) across all the nodes in the Cassandra ring in order to come up with the result.

On a small table in a small cluster, sure it's not a big deal running a count query. But imagine if you had hundreds of SSTables worth over 500GB per node on a cluster with dozens [if not hundreds] of nodes. See how such an innocent query becomes so expensive? It has to perform a full table scan across your cluster which is why your query will timeout before it has the chance to complete.

To complicate matters, consider that while the count is in progress, partitions are constantly being created and updated -- this is big data after all and you're using Cassandra because you have a scale problem. Even if you were lucky enough to get a result, is that count valid? You might as well be counting stars. Because by the time you've counted half the stars in the night sky given enough time, chances are your count is obsolete because the universe is not static -- new stars are born and old stars die all the time. Similarly, your table is not static -- new partitions are created, existing partitions are updated and even some are deleted.

"Can I use a counter?", you ask. Maybe. If your use case is such that you only write a partition once and never overwrite it again, ever. If your use case involves overwrites then you have no way of working out whether it is the first write or not so your counter becomes useless. You might come up with some sort of algorithm and store a count somehow but in all likelihood, your "counter" solution won't scale since it will probably involve (a) a read-before-write, (b) an additional write for every operation, or (c) both.

You might as well get an estimate from nodetool cfstats but that's exactly what you would get -- an estimate. This is because it suffers from the same problem. When Cassandra flushes the memtable to disk, it knows how many keys are in that SSTable and updates the table stats accordingly [1]. While we're here, be aware that deletes are also inserts under-the-hood, i.e. a tombstone is inserted into the table so it's counted in cfstats. It only gets decremented after the tombstones get compacted out after gc_grace_seconds.

Finally if you really must do a count, do it with DSE Analytics or DSE Search but those are for another blog post. I'll get around to it sometime after I finish counting stars, maybe...

[1] I've oversimplified what happens with the table stats updates to make it simple to digest for the intended audience. There are in fact algorithms used such as HyperLogLog which improves the accuracy of the aggregated values.