Cleaning Up Tombstones in DataStax DSE and Apache Cassandra | DataStax Academy: Free Cassandra Tutorials and Training

In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. It uses sstables, which are immutable once written. So, a delete is actually an update, and updates are actually inserts (into new sstables). A “tombstone” marker is written to indicate that the data is now (logically) deleted. (See here for a good writeup about tombstones from The Last Pickle.)

Heavy deletes can lead to not only extra disk space usage, but also decreasing performance on reads. (Since we might have to read through all those tombstones to find any “live” data.)  And this doesn’t only affect deletes.  For the same underlying reasons, TTLs and, in some cases NULLs also cause tombstones.  (See here for a nice writeup about this from OpenCredo.)

In an RDBMS, deletes aren’t immediately cleared, either, they’re also marked as logically deleted.  After the data is no longer needed for rollback segments or multiversion concurrency control (MVCC), the physical blocks can be overwritten with new data.  Or, you run something like VACUUM for PostgreSQL.

In Cassandra, as mentioned, data is written to immutable sstables, so those approaches won’t work.   Instead, over time, tombstones get cleaned up during anti-entropy compactions (the process that cleans up out-of-date records by writing new sstables to replace the older ones). However, how quickly this happens depends on many things – compaction strategies, sstable sizes, gc_grace_seconds, whether relevant rows are in one sstable or spread out over multiple sstables, etc.  (If a row is spread out over multiple sstables, and anti-entropy is compacting some of them, it can’t drop the tombstoned data unless it is compacting all of those sstables, or the data left over in the others would then be seen as live.)

People often ask how they can clean up unwanted tombstones, especially if they’re getting tombstone warnings or errors in their logs. Here are some ways:

1. We can tune the compaction to be more aggressive using compaction subproperties, then sit back and wait again.  Most commonly, we adjust tombstone_compaction_interval and tombstone_threashold.  These control how soon after creation an sstable is checked for tombstone compaction, and what ratio of tombstones-to-live data might trigger a compaction.  But this can still take time, for the reasons given above.  (Also, there are different kinds of tombstones - cell, row or range tombstones - and in some cases range tombstones don’t get counted.)  So we still have to wait for anti-entropy repairs to do their job, but it can make it a bit more aggressive in its working.

2. Or, in some cases customers use forced major compactions to get immediate results, but this can lead to issues down the road - having one big sstable that won’t get compacted for some time - again, leading to wasted disk space and performance issues down the road.  (Eg, for STCS, we need to wait for 3 other sstables of the size of the new, huge sstable we created, and that time may never come if the sstable is large enough.)  In that case, they might then need to use sstablesplit to split up the large resulting sstable, but that is a bit of tricky process in itself - you have to decide how many sstables of what size you should split the big sstable into.  (Eg, if you go too small, you’re just adding lots of extra compaction to your node, as these get compacted up the levels again.  Too big and you run into the same big-sstable issue down the road.  And you also have to stop DSE on the node before running it.)

3. Or, a hack for smaller tables is to alter the table to use a different compaction strategy, then alter it back (forcing all sstables to be rewritten), but this isn’t really feasible for larger tables, as it ties up the table, involves schema changes (which have their own risks), and adds a good amount of load to every node at once.

4. Now there’s a newer and better way!  Recently, it was pointed out to me that DSE 5.1 (and Apache Cassandra 3.10) offers a new nodetool command, ‘nodetool garbagecollect’, introduced in jira CASSANDRA-7019 .  This was added for just this situation, and is designed to clean up a table’s droppable tombstones.

This looks to be a very useful update, but doesn’t seem to be widely known or used. I haven’t played with it much yet – if you have any good or bad experiences with it, we’d love to hear them!  In a followup post, we’ll go over technical details of its implementation and usage, so stay tuned!