About this Short Course

KillrChat is an advanced hands-on exercise designed to show how to model data with Cassandra on a real world application.

Reference Application
Apache Cassandra
Data Modeling

This posting was created by Duy Hai Doan, Apache Cassandra Evangelist at DataStax.

 

A Technical Stack

For the database, the choice was obviously Apache Cassandra™. To take care of the repository layer, we used the Datastax Java Driver with the Object Mapper module and Achilles framework to provide some support for Unit testing and embedded Cassandra server management.

For the front-end, we choose AngularJS for several reasons:

  • it is easy to learn

  • it is very extensible thanks to the directive system

  • the framework focuses on developer productivity, with a declarative programming style

  • technical resources and tutorial are quite abundant on the web

  • last but not least, there is a nice integration module for Twitter Bootstrap

We pulled an integration of AngularJS with Twitter Bootstrap, called UI-Bootstrap. For the glue between the front-end interface and the database, we opted for a classical Spring stack.

  • Spring Boot with an embedded Jetty server for the application skeleton

  • Spring Boot Security to take care of the security part (at least account creation + login/logout)

  • Spring Boot Web for the HTTP REST communication with the front-end

  • Spring Boot Messaging as the broker for the Web Socket server-side

Some interesting notes about the Web Socket feature. Spring Boot proposed a very nicely packaged Web Socket architecture with SockJS on the client-side and Spring Messaging at the back-end. Indeed, a Web Socket can be seen as an abstraction of a publish/subscribe broker. Many clients can subscribe to the same chat room to receive updates on new chat messages. The Spring team did a good job of extracting the broker part from their very mature Spring Integration project to create a more lightweight Spring Messaging module. So in a nutshell, from top to bottom, below is the technology stack:

B Technical Architecture

On a single node, the architecture would resemble this:

 

In this mode, on the back-end side, everything can be embedded in the Tomcat/Jetty server, including an in-memory broker for the Web Socket and an embedded Cassandra.

 

On a full-fledged production deployment, the architecture would be:

To cope with the high traffic, the number of instances of Tomcat/Jetty servers can be increased. Since the back-end application is stateless (there will be still some session-stickiness handled by the load balancer), we can throw in new boxes to support the load.

On the broker side (necessary for the Web Socket publish/subscribe pattern), we need to use a dedicated infrastructure. A solution like RabbitMQ/ZeroMQ can be a good fit. If you need extreme scalability/fast response time, Apache Kafka could be also a good choice. All you’ll need to implement is a connector between this broker and the Spring Messaging abstraction on each back-end server

For the database, a full Cassandra cluster with horizontal scaling-out is sufficient to cope with any increase in term of traffic.

C The Data Model

KillrChat data model focus on 3 main components:

  • users

  • chat rooms

  • chat messages

 

1) Users management

For a [successful] chat application, the number of users will grow fast. To be able to scale out with this growth, we need to choose a partition key that will scale out nicely. A good candidate is the user login.

 

 

The users table is quite simple and straightforward:

CREATE TABLE killrchat.users(

login text,

pass text, //password is not allowed as column name because reserved word

lastname text,

firstname text,

bio text,

email text,

chat_rooms set,

PRIMARY KEY(login));
 

 

The corresponding Java mapping for this table:

@Table(keyspace = KEYSPACE, name = USERS)

public class UserEntity {

 

   @PartitionKey

   private String login;

 

   @NotEmpty

   @Column

   private String pass;

 

   @Column

   private String firstname;

 

   @Column

   private String lastname;

 

   @Column

   private String email;

 

   @Column

   private String bio;

 

   @Column(name = "chat_rooms")

   private Set chatRooms = new HashSet<>();

 

   //Omitted getters & setters

 

Please notice the columns chatRooms in the table users. This columns will store a list of chat rooms to which the current user has registered. We use a set to store all the chat room ids instead of using a separated table. Clearly, collections in CQL are suited for this use case where the cardinality of the set is pretty reduce (it does not make sense for an user to register to 106 chat rooms). Furthermore, since the firstname and lastname of an user is immutable (unless your application decides to allow people changing their name), we do not have to take care of update scenarios.

 

To avoid multiple users creating an account with the same login (partition key), we rely on Cassandra LightWeight Transaction. The operation is expensive but the trade-off is still acceptable with regards to the chat usage lifecycle. After all you only create account once.
 

INSERT INTO killrchat.users(login,...) VALUES(‘jdoe’,...)IF NOT EXISTS;

 

2) Chat rooms data model

To scale the chat rooms number nicely, we apply the same approach as with the users. Let’s use room name as partition key.

 

The chat_rooms table skeleton is defined below:

CREATE TABLE killrchat.chat_rooms( room_name text,

room_name text,

creation_date timestamp,

banner text,

creator ???,

creator_login text,

participants ???,

PRIMARY KEY(room_name));

 

Whenever an user enters a chat room, we should load:

  • the room details

  • the room creator details

  • the list of all current participants in the room

Room details are pretty easy to store. For room creator, we can either persist only the user login and issue an additional select to fetch creator details or de-normalize. The same logic applies to the list of participants, either persisting only participants login and for each of them, issue a SELECT to load his/her details or de-normalize.

 

Clearly, persisting only user and participants login will raise the N+1 SELECT issue, so the choice of de-normalizing is obvious. For this, we create an user-defined type user and re-use it in the chat_rooms table

CREATE TYPE killrchat.user(

login text,

firstname text,

lastname text);

 

CREATE TABLE killrchat.chat_rooms(

...

creator frozen,

...

participants set>,

PRIMARY KEY(room_name));

 

The corresponding Java mapping for this table:

@Table(keyspace = KEYSPACE, name = CHATROOMS)

public class ChatRoomEntity {

   @PartitionKey

   @Column(name = "room_name")

   private String roomName;

 

   @Column

   @Frozen

   private LightUserModel creator;

 

   @Column(name = "creator_login")

   private String creatorLogin;

 

   @Column(name = "creation_date")

   private Date creationDate;

 

   @Column

   private String banner;

 

   @Column

   @Frozen("set>")

   private Set participants = new HashSet<>();

 

Again, as for users, chat room creation will use LightWeight Transaction to guarantee unicity constraint.

 

3) Chat rooms participants management

There are some tricky scenarios with chat rooms participants to handle, especially in a fully distributed architecture not using global lock.

 

Indeed, this architecture raises some interesting questions about concurrency issues:

  • what happens if a participant just joins a room when it is deleted ? How can we remove this room name from this participant chat rooms list ?

  • what happens if a participant just leaves a room when it is deleted ? Same question as above

  • how can we guarantee that only the room creator can delete his own rooms and not someone else ?

 

a. Participant joining room

Let’s focus first on the participant joining a “just deleted” room scenario. Naively, to add a participant to a room, we can issue this CQL statement:

UPDATE killrchat.chat_rooms

SET participants = participants + {...}

WHERE room_name = ‘games’;

 

In that case, with a concurrent room deletion by the creator, depending on the ordering of both statements, we can have data corruption as shown below:

The creator just deletes the chat room “games” (e.g. creating tombstones on all columns) and right after that, we update the participant list, thus just adding an extra column with the new participant details. The final result is that the chat room “games” still exists but with only 1 participant and no other information (creator = null, banner = null, creation_date = null, …).
 

To prevent this data corruption, we rely again on LightWeight Transaction.

UPDATE killrchat.chat_rooms

SET participants = participants + {...}

WHERE room_name = ‘games’ IF EXISTS;

 

The IF EXISTS clauses will reject the update if the room has been already removed.

 

b. Participant leaving room

What’s about a participant leaving a room right after it is deleted ? Can we again rely on LightWeight Transaction to handle it properly ? The answer is Yes, but it’s not necessary. Indeed, removing in CQL means creating tombstone markers. So no matter in which order we remove data (removing a room first and then removing a participant or removing a participant first and then the chat room), the result is identical, we’ll have tombstone columns. This particular scenario does not require LightWeight Transaction  because no data corruption is possible.

 

c. Removing a chat room

Last but not least, removing the entire chat room. For this, there are 2 strong requirements:

  • only the creator can remove his own rooms

  • we must also, upon chat room deletion, remove this chat room name from the chat room list of all its current participants

 

The appropriate CQL statement for a safe chat rooms removal is

DELETE killrchat.chat_rooms

WHERE room_name = ‘games

IF creator_login =

AND participants = {...};

 

Again, we use LightWeight Transaction to enforce some invariants.

 

The IF creator_login = … condition ensures that only a creator can delete his own room. How can we inject the creator_login value into this query ? By fetching it from the Spring Security context! Indeed the Spring Security authentication process gives us a strong guarantee about an user identity, no cheating/hacking is possible and we can safely use this security context login to compare with the actual creator_login stored in Cassandra.

 

The second condition about participants will ensure that when deleting a room, we do not miss any new entering participants. Indeed, upon room deletion we must update each participant room list to remove the current room. Any participant that concurrently joins the room when it is being deleted may not be taken into account, unless we rely on LightWeight Transaction.

 

Please note that the room deletion and the current participants’ room list update is not atomic, e.g. there might be a tiny time frame where the room is already deleted but one participant room list not updated yet.

 

Using LightWeight Transaction has an impact on performance but considering the frequency at which people create and join rooms compared to the time spent on chatting, it is worthwhile.

 

4) Chat messages management

The last data model is about chat messages. A simple and naive approach could be:

 

CREATE TABLE killrchat.chat_room_messages(

room_name text,

message_id timeuuid,

content text,

author frozen, // denormalization system_message boolean,

PRIMARY KEY((room_name), message_id))

WITH CLUSTERING ORDER BY (message_id DESC);

The table has room_name as partition key and message_id as clustering column. The clustering column is sorted in reverse order to fetch the latest chat messages first. We also use user-defined type to persist the message author detail.

 

The above data model is very convenient for chat message display. To retrieve the last 50 messages:

SELECT * FROM killrchat.chat_room_messages

WHERE room_name = ‘games

LIMIT 50;

 

To retrieve the previous page of 50 message, starting from the last fetched message_id:

SELECT * FROM killrchat.chat_room_messages

WHERE room_name = ‘games

AND message_id < last_message_id

LIMIT 50;

 

There is still a caveat with this design, the chat room messages can not scale to billions, unless you set a fixed TTL on each message.

 

Indeed, with a very popular chat room, having thousands of participants, the message count can grow very fast over time and reach the technical limit of 2.109 physical columns for the partition.

 

A remedy is to split the same partition between many bucket (base on date). Depending on the traffic, you may want to create a bucket by day/week/month/year…

CREATE TABLE killrchat.chat_room_messages(

room_name text,

time_bucket int, // format yyyyMMdd

message_id timeuuid,

content text,

author frozen, // denormalization system_message boolean,

PRIMARY KEY((room_name,time_bucket), message_id))

WITH CLUSTERING ORDER BY (message_id DESC);

 

This design allows to scale with the message count but makes querying the table more difficult. While it was trivial to fetch the previous page of message with a simple SELECT in the previous data model, you’ll have to take into account the time_bucket to switch partition when necessary.

 

We can devise a helper table to store time_bucket information for each chat room:

CREATE TABLE killrchat.messages_bucket(

room_name text,

time_bucket int, // format yyyyMMdd

PRIMARY KEY((room_name), time_bucket))

WITH CLUSTERING ORDER BY (time_bucket DESC);

 

The content of this table, very small, can be put on cache for fast access. Whenever a participant creates a message, we update an application-managed weak HashMap>. We flush this map regularly to Cassandra to update the message_bucket table. This map acts as a buffer to avoid hammering Cassandra with frequent mutations.