Data Model Migration

DSE Version: 6.0


How do we migrate our existing data model into a new one when using DataStax Enterprise? In this unit, we will explore different way that it can be done, and go through some important points to keep in mind while doing so.


Although nobody should really want to change their data model, it is something that may come up from time to time. When it happens, one common issue that people run into is how to migrate their data from their existing data model into the new one. In this video, we'll be exploring different ways it can be done, and reviewing some of the important points to keep in mind when doing so.

Earlier when we talked about designing the primary key for a table, there was a lot of emphasis on the idea that it cannot be changed. Once you've created the table with the primary key and started loading data, it's not simple to change it afterwards. However when you come to the end of the road, there just may be no alternative but to make use of a new primary key, and what that means is that you'll need to create a completely new table to use the new primary key, and then migrate your data into that table. Assuming that you complete that without any issues, you can either just keep the old table or drop it, depending on your use case.

When creating tables, it's important to know the effects of changing your schema. Often enough, a large data model can be stored completely in a CQL file as a set of CREATE TABLE statements, which can then be easily executed. Be careful if you intend to make data model changes using a script though. When it comes to schema changes, Cassandra does that in a way that is separate from normal write operations. In particular, the schema is something that is propagated to all of the nodes in a cluster. With sufficiently large clusters, this can be problematic if trying to execute these schema changes as quickly as possible. There is no guaranteed ordering in which the schema propagates so it is possible to run into race conditions which can cause schemas on different nodes to be out of sync. When out of sync this can cause all sort of bad things to happen.

You can use the following as a guideline to safely make changes to your table schemas. First, make sure that you only execute one CREATE OR DROP statement at a time. Make sure you wait long enough so that the schema change completely propagates to every node in the cluster before moving to the next one. You can use the nodetool command describecluster to take a check the schema version for each node and ensure that they match before moving on to the next CREATE DROP Command and so on.

With a new set of tables in place, you'll need to move your data somehow from the old table to the new. Fortunately with DataStax Enterprise, there are several options that will allow you to do. In particular is Analytics component, which makes use of the integrated Apache Spark platform. Since it allows you do perform analytics across your entire cluster, it is the perfect candidate to help migrate data across tables, especially since the ETL operations needed to migrate the data is generally straightforward, and shouldn't need any complicated coding.

If it's just the primary key being changed, migrating your data may be as simple as extracting and loading using the cassandraTable and saveToCassandra functions in Spark. All you need to do is specify the old table name and the new table name in the appropriate functions. However, even if the tables have changed significantly, Spark will enable you to transform the old data so that it match how it needs to be loaded in the new table. This would require getting more in-depth into Spark then I'd want to right now, but feel free to check out our DS320 course on DSE Analytics, which will cover not only reading and writing data into tables, but also the different ways you can transform and manipulate that data.

Another consideration is with live data. What happens if you are trying to migrate to the new data model while keeping your application online and available? That basically means that new updates are being made to the base table while the ETL process is running, and that means that it's possible that the new table may miss out on that new data. To handle that, it's important to plan ahead. Change the application so that it writes to both your base and your new table. As any new updates are applied to both tables, you can go ahead with the ETL, while any read requests will still query from the base table. Once completed, the data in both tables should be exactly the same. Go ahead and then have your application switch over to only using the new table for both reads and writes. Finally, you have the option to drop the old table if it's not still being used in other parts of your application. However if you want to stay fairly cautious, you can always continue to write to the old table for some period of time, until you know for sure that the new table is working as expected.

Although Spark would be one of the most convenient ways to migrate your data, there are some other alternatives. They may not necessarily be better, but they are alternatives, and in some cases it may make more sense. For one, you can use CQLSH tool to export your data using the copy command. With the data saved as a CSV file, you can then transform the data as needed for your new data model, and then COPY to export your data back into the database. Another way would be to use the DataStax drivers to actually query the data, transform as necessary on the application side, and then write it back into the new table.  Yeah… maybe it's better to just stick with Spark if you can.

So I'm saying that you may need to change the primary key for a table, but what's the justification behind it? Well obviously we wouldn't want to change anything if it's working. However, if there is something that's causing a problem, it's better to fix that sooner rather than later. Are there certain partitions that are too large? Did new requirements get passed down for your application. Keep in mind that even if you do need to create a new table with a different primary key, the original table may still be used in other parts of the application, so don't drop that just yet.

With the new tables that you create, don't forget to go over the analysis and validation steps again to make sure that your new data model does not contain easily avoidable problems.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.