Write Techniques

DSE Version: 6.0


Using Apache Cassandra at the core of DataStax Enterprise, write operations are designed to be as efficient as possible. But there are some techniques that can be used in terms of data modelling for dealing with specific challenges. We discuss this further in this unit, with a focus on batches and lightweight transactions.


As we continue along this discussion of data modeling with DataStax Enterprise, you might have noticed that there hasn't really been any significant data modeling topics in regards to writes. Using Apache Cassandra as the core, write operations in DSE are designed to be as efficient as possible. As such, there really isn't too much you need to plan ahead of time with your write workloads. However, there are some techniques that can be useful for solving specific challenges, which we'll be covering in this video.

One specific challenge that we run into when data modeling in DataStax Enterprise is how we can maintain data consistency across duplicated data. With the usage of denormalized tables, it's possible attributes for the same entity to appear across multiple tables. In this case, how can we make sure that the data stays the same? It's possible that differents parts of your application may read the same data from different tables, and you wouldn't want the data to be different, would you?

Therefore it's important that your data stays consistent. How does that happen normally? In a simple model, let's say you have some sort of entity whose attributes are in multiple tables. When you are adding or updating an entity, you would then use INSERT or UPDATE statements to update all of those tables. This should work fine in most situations, but what happens if one of those INSERT or UPDATE statements happen to fail? That means that the data may no longer be consistent across all of the tables used in your application.

You'll find that because of these type of situations, you'll need some way of tracking if your writes fail, and have an appropriate way to handle these errors when they occur. This could mean building application logic to handle that.

There's also a way that DataStax Enterprise can handle these type of writes as well, using something called a batch… specifically a logged batch. Similar to batches in relational databases, a logged batch is combining the write operations written in the batch and sending them all at once to a coordinator node, rather than executing the operations one by one. However once the batch is accepted by the coordinator, DSE will manage its execution and ensures that all of the write operations will eventually be successfully completed. This is one useful way that can help divert the burden of managing data consistency across tables to the database itself.

In our example here we have two tables in our killrvideo application, a videos table and a videos by title table. You can also see that we have certain duplicated columns, such as the title, uploaded timestamp, description, and more.

A simple example of adding a new video to these tables then would be by using two insert statements, one for each of these tables. You can see that the video_id of 1 should be the same for both of these insert statements, and we'll assume that the rest of the columns all have the same value.

When making changes to a video, we'll need to make the same changes to both of the tables. For the videos table, we can use an UPDATE to modify the column with the new value. In this case, the title of the video is being changed from Jaw to Jaws. With the videos_by_title table, the procedure is a bit different since the column we're changing, title, is actually part of the primary key as you can see in our Chebotko diagram. If we were to do an update in this case, the write operation would either create a new row with the title column value of Jaws, or update an existing one with same title and video id. The original row with the title Jaw and video id of 1 would still exist. It is because of this we'll actually need to do two operations, an INSERT so that there is a row with the updated information, and then a DELETE to remove the original row in videos_by_title. Although it doesn't show it, you'll need to add values for all the columns that you have data for in the INSERT operation.

Note that the statements shown here would be executed individually. It is possible that one statement could succeed and the other fail, meeting we have an inconsistency for the data for the video in the videos and videos_by_title table. In this kind of situation, the application would need to handle the error in some way so that the tables stay consistent.

With logged batches, you would need to worry about individual write operations failing. All we need to do here is use the syntax for a logged batch, starting with BEGIN BATCH, writing the operations that should be included in the batch, and then finishing it with the keywords APPLY BATCH. In this case, the write operations are all sent together to a coordinator node. If the batch is accepted by the coordinator, then it is then up to database to ensure that the statements in the batch are all executed successfully.

When a client sends a batch to a coordinator node, the first thing that the node does is to save the batch into a special log, both on that coordinator node as well as one several replica nodes. This ensures that if the coordinator fails during execution of the batch, it can replay the batch again when the coordinator is available to do so. If for some reason the coordinator is not able to replay the batch, then one of the replicas will take over the responsibilities of the executing the batch instead.

Once saved in the batch log, the coordinator will then attempt to execute the write operations. The batch operation only succeeds once all of the write operations succeeds, with the changes either applied or hints saved for the replica nodes.

This is a very useful technique to help manage data consistency in DataStax Enterprise, and makes it possible to have ACID-like properties. However, what the batch cannot do is allow for batch isolation. Ultimately the write operations are still executed individually meaning that there may be a window in which it is possible to read both old and new data while the batch is running.

There are some caveats to be aware of when using batches. For one, it is not necessarily a good way to bulk load data, as it is used in relational databases. Logged batches actually require more work to be done compared to normal writes, so it will always be slower. There is also an unlogged batch that can be used in DataStax Enterprise, but even then if you're creating huge batches, this may be generating a lot of work on coordinator nodes causes resource contention and possibly performance bottlenecks. This isn't to say that batches cannot be used as a way to bulk load data, but it only makes sense in specific scenarios. This may be a topic that can visit again once you become more familiar with the way DSE works.

Another important point is that batches do not necessarily execute and complete in the order written. Generally when you execute multiple statements, each write may have a slightly different timestamp depending on fast and close together those statements were executed. With batches, all of the write operations uses the same timestamp. Therefore there isn't a concept of ordering. This will cause issues if you're thinking about using batches to make changes to the same row multiple times. Let's say that you are inserting a row, and then modifying some columns, then deleting the row, and finally inserting a new row with the same primary key , all in the same batch. Since all of these writes will be executed with the same timestamp, you may not get the result you may expect when you try to read the row.

One other useful feature that's available in DataStax Enterprise is lightweight transactions. This is a special type of write operation that is performed by first checking some sort of condition in the table, and then executing that write only if that condition is met. This can be used with any of the write operations, INSERT, UPDATE or DELETE. This essentially is a way to perform an ACID transaction, though it works with one write operation at a time. It is more expensive than a normal write, which basically means it takes longer to complete due to the additional internal operations that also need to be done in addition to the write itself. In the diagram here, we are showing a lightweight transaction for an INSERT operation, which is identified with the keywords IF NOT EXISTS at the end of the statement. When executing, the request goes a coordinator node and then routed to the corresponding replica node. The replica will check if there is already an existing partition with the same primary key. If there is an existing partition, then the replica node will respond saying that the partition exists, without attempting the write. However, if there is no partition then the replica will go ahead and write the partition.

Here we'll show you two different lightweight transactions, one using an INSERT statement and the other using an UPDATE.

The first one is an insert statement where we are trying to create a new user in our users table. In the Chebotko diagram, you can see that the primary key is the user_id column. Our actual INSERT statement then is adding this new user with the user_id pmcfadin, and includes the first name, last name, etc. Again, note the keywords IF NOT EXISTS, which makes this a lightweight transaction instead of a normal insert.

What this allows us to do is to make sure that there isn't already a pmcfadin user in our table already, and avoid accidentally overwriting that user's data if he already does exist.

Now assuming that the user creation succeeded, let's take a look at an example of an update statement using a lightweight transaction. One possibility is with a procedure to allow users to reset their password if they forget it. The procedure may be that if a user requests for their password to be reset, then they need to use a reset token that will be provided through some secure channel. The lightweight transaction here will be helpful since it will allow us to check the value of a specific column, and only execute the update if the column matches the value provided in the lightweight transaction.

Therefore this first update statement is setting up a reset password token, made up as a UUID for the user name pmcfadin.

The actual update statement will then set the password only if the value stored in the reset_token column matches what is provided here in this lightweight transaction, which is identified by that IF keyword. Any column can be checked for the lightweight transaction, with the limitation being that it must be a column in the same row that's being updated. In this way, we can enforce that the password is changed only with the appropriate reset password token.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.