Dude! Where’s my data? | Apache Cassandra and DataStax Enterprise - DataStax Academy

Overview

One of the most common problems we see in support is data going “missing”. However it's not quite as scary as it might sound (databases are designed not to “lose” data). The thing to remember here is that Cassandra is a distributed system, writes can come into any node in the cluster and end up on any other node. Factors such as client location, network speed, network routing, token distribution, node placement and cluster time sync can all add up to influence exactly when the write reaches the cluster and where it ends up. An in-flight write in a distributed system might be in a number of places at any point in time. If I send write 1 and write 2 into my cluster it could very well be that write 2 gets there first, just as a stream of TCP packets may arrive at their destination network interface in a different order to what they were sent. So how does this affect writes then? Why should you care? Well, in a lot of cases a database will be handling transactions of some sort. If your data model is idempotent that’s ideal but not all users may be able to design their application like this.

Let's consider an account balance, if I were updating the actual balance amount, the order in which these updates arrive begins to matter. We can’t just “fire and forget”. Of course our data model can be such that we apply a timestamp also so we only read the latest one, this is more ideal but it is not always possible for some customers to design a model like this. So in such cases order of updates does matter, common techniques like client side timestamping are commonly used.

However even with the best will in the world things go wrong. This blog attempts to show users how we in support help our customers to determine where this “missing” data might be and how to make sense of such situations. Note; we’ll use a very simplistic data model here that would probably have no practical use however remember we are simply illustrating how to find this data, not how to avoid bad design.
 

Setup

For this experiment we used a small cluster of 4 nodes, single tokens, injected a rudimentary few lines of data and then manually tweaked some of the timestamps to show how things can easily go amiss

Cluster

Here’s our cluster, simple. All in the same racks, single tokens running DSE with a Cassandra workload. Pretty much a default testing setup

$ nodetool status
Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  172.31.27.13   126.14 KB  1       ?       2f39baa5-6cbe-4450-aa23-57715a0b7d98  rack1
UN  172.31.25.103  153.69 KB  1       ?       95063d9f-caa7-4629-b388-b6b76640ed90  rack1
UN  172.31.25.22   125.9 KB   1       ?       f98d8989-a7ad-44a0-a305-76adfe0127a4  rack1
UN  172.31.23.3    170.26 KB  1       ?       edf28049-4d7f-4b2d-9e20-5b49d407b54e  rack1

Table

Here’s our table, all defaults, two columns, no compound or clustering keys

CREATE TABLE data.account (
    anum int PRIMARY KEY,
    balance int
) WITH bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Data

Here’s our initial test data. Nothing else to say here!

cqlsh> select * from data.account ;

 anum | balance
------+---------
 7700 |     100
  880 |     100
  990 |     100

(3 rows)

 

Scenarios

Here we look at some typical manifestations of these kind of problems

Update but new value not seen

This is a common scenario, the upstream application will update the value but its not showing in the cluster!

Let’s update some of the accounts

cqlsh> insert into data.account (anum, balance) VALUES ( 990, 200);
cqlsh> select * from data.account ;

 anum | balance
------+---------
 7700 |     100
  880 |     100
  990 |     200

(3 rows)

That's all good, however let's look at the update value timestamps

cqlsh> select anum, balance,  writetime(balance) from data.account ;

 anum | balance | writetime(balance)
------+---------+--------------------
 7700 |     100 |   1502355447590958
  880 |     100 |   1502355442251444
  990 |     200 |   1502355558298234

(3 rows)

Note the times are in microseconds:

$ date -d @1502355447590958
Sun Dec  4 00:09:18 UTC 47609718

So we need to trim off 6 digits:

$ date -d @1502355447
Thu Aug 10 08:57:27 UTC 2017

(thats more like it!)

Now we’re going to simulate something here, we’re using cqlsh but this could be any upstream application. We write our data with a timestamp that's a bit advanced from now. (This could easily happen in a real system if the clocks are out of sync, client time wrong etc)

cqlsh> insert into data.account (anum, balance) VALUES ( 880, 200) USING TIMESTAMP 1502356880123456;
cqlsh> select * from data.account ;

 anum | balance
------+---------
 7700 |     100
  880 |     200
  990 |     200

(3 rows)
cqlsh> insert into data.account (anum, balance) VALUES ( 880, 300);
cqlsh> select * from data.account ;

 anum | balance
------+---------
 7700 |     100
  880 |     200
  990 |     200

(3 rows)

Wait a minute, we just wrote 300 into that account balance! Where has it gone?

Ok so let's start by finding out which nodes have this key

Note we can check writetime at this point too, that might be all we need, but here I'm showing how to trace right back to sstables

$ nodetool getendpoints data account 880
172.31.25.103
172.31.23.3

Let's get onto one of these nodes and check out the data

First we flush (using the keyspace here but we can also specify the table)

automaton@ip-172-31-23-3$ nodetool flush data

Now let's find the sstable

$ nodetool getsstables data account 880
/var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db

 

Ok great! Now let's find the contents of this table note in a lot of cases you’ll have to run this as root and then sudo with the cassandra user because of permissions

$ sudo su
# sudo -u cassandra sstable2json /var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db
[
{"key": "880",
 "cells": [["","",1502356880123456],
           ["balance","200",1502356880123456]]}
]

Ok now let's check the other node for completeness

automaton@ip-172-31-25-103:~$ nodetool flush data

$ nodetool getsstables data account 880
/var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db

$ sudo su
# sudo -u cassandra sstable2json /var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db
[
{"key": "7700",
"cells": [["","",1502355447590958],
           ["balance","100",1502355447590958]]},
{"key": "880",
"cells": [["","",1502356880123456],
           ["balance","200",1502356880123456]]}
]

So we can see that our writetime was ahead of when we updated earlier (at the time it was around 09:00 when we inserted the 300). So the client upstream must have written a future date!

 date -d @1502356880
Thu Aug 10 09:21:20 UTC 2017

Updates but nothing showing

Now we see the same kind of thing with tombstones too, but this time its much stranger because the user will see no data, lets use the same table but remove the 7700 account it had the wrong number, it should have been 770!

We delete the wrong account first (this bit is ok)

cqlsh> delete from data.account where anum = 7700;
cqlsh> select * from data.account ;

 anum | balance
------+---------
  880 |     200
  990 |     200

(2 rows)

Now lets simulate a programming error in the application, for some reason the application sends a delete out for the key before it creates the key, and for some reason its not using a client side timestamp, therefore it will default to coordinator time on the cluster.

cqlsh> delete from data.account where anum = 770;

Now our application does its normal run of setting up the account and inserts the initial opening balance with a client side timestamp

cqlsh> insert into data.account (anum, balance ) VALUES ( 770, 500) using timestamp 1502350880123456 ;
cqlsh> select * from data.account ;

 anum | balance
------+---------
  880 |     200
  990 |     200

(2 rows)

Wait! Our account isn’t there?!

So let’s use the same tools again

automaton@ip-172-31-25-103:~$ nodetool flush data
$ nodetool getsstables data account 770
/var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-2-Data.db
$ sudo su
# sudo -u cassandra sstable2json /var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-2-Data.db
[
{"key": "770",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1502357456028973,"localDeletionTime":1502357456}},
 "cells": []},
{"key": "7700",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1502357411301404,"localDeletionTime":1502357411}},
 "cells": []}
]

automaton@ip-172-31-25-22:~$ nodetool flush data
$ nodetool getsstables data account 770
/var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db
$ sudo su
# sudo -u cassandra sstable2json /var/lib/cassandra/data/data/account-d3a4afe17da911e79b17671ad47f157b/data-account-ka-1-Data.db
[
{"key": "770",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1502357456028973,"localDeletionTime":1502357456}},
 "cells": []},
{"key": "7700",
 "metadata": {"deletionInfo": {"markedForDeleteAt":1502357411301404,"localDeletionTime":1502357411}},
 "cells": []},
{"key": "990",
 "cells": [["","",1502355558298234],
           ["balance","200",1502355558298234]]}
]

 

Ah ha! There we have it, we have a tombstone for our account record.

Conclusion

I hope the examples here have helped to give an overview of how we can see what might initially look strange in a DSE cluster but which in fact have a plausible explanation. Using the writetime in cqlsh and nodetool get* commands can really help a user drill down on their data and understand how cassandra handles data updates across the cluster.