MySQL to Cassandra Migrations
For 15+ years, Oracle’s MySQL has been a de facto infrastructure piece in web applications, enjoying wide adoption. This is for good reason: MySQL provides a solid relational database that enables companies to build systems that perform well in many use cases. Yet, even its strongest supporters admit that it is not architected to tackle the new wave of big data applications. Modern businesses that need to manage big data use cases are turning to Apache Cassandra to replace MySQL.
Migrating from MySQL to Cassandra: General Info
Is Cassandra Right for Your Application?
A new class of databases (sometimes referred to as “NoSQL”) have been developed and designed with 18+ years worth of lessons learned from traditional relational databases such as MySQL. Cassandra (and other distributed or “NoSQL” databases) aim to make the “right” tradeoffs to
ultimately deliver a database that provides the scalability, redundancy, and performance needed in todays applications. Although MySQL may have performed well for you in the past, new business requirements and/or the need to both scale and improve the reliability of your application might mean that MySQL is no longer the correct fit.
Before committing any further time towards a MySQL to Cassandra migration, ask yourself:
“Is MySQL currently preventing development of new features or providing acceptable uptime, reliability, and scalability for my users?”
“No”: Not only should you not migrate to Cassandra, but also you most likely should not be considering a migration to any alternative database. Migrating an application to a new database is a very difficult, time consuming, and error-prone process.
“Yes”: Then hopefully you’ve found a helpful resource to help guide and plan your migration from MySQL to Cassandra. There are many databases
available, all with their various advantages, disadvantages and tradeoffs. This article is not an attempt to portray Cassandra as a perfect solution; in fact, Cassandra’s tradeoffs, advantages, and disadvantages will be highlighted. Hopefully this will help you make a decision that is both informed and educated; not one motivated by marketing hype or change for the sake of change.
Don’t try to shove a square peg in a round hole!
- Cassandra is not a relational database.
- Cassandra is not a 100%/“drop-in” replacement for MySQL.
- Simply migrating existing code to Cassandra without modifying and rethinking your existing data model will not result in perfect uptime or fix performance bottlenecks for your application. In fact, it might make things worse.
The following overview of Cassandra terminology provides descriptions and their MySQL equivalent. The goal is to introduce the most basic terms and concepts required to get a basic understanding of Cassandra. To read more on the key terms and architecture of Cassandra you can find more detail in the Cassandra architecture documentation or for a higher level overview visit the “What is Cassandra” page on Planet Cassandra.
At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as aring. Nodes generally run on commodity hardware. Each Cassandra node in the cluster is responsible for and assigned a token range (which is essentially a range of hashes defined by a partitioner, which defaults to Murmur3Partitioner in Cassandra v1.2+). By default this hash range is defined with a maximum number of possible hash values ranging from 0 to 2^127-1.
Each update or addition of data contains a unique row key (also known as a primary key). The primary key is hashed to determine a replica (or node) responsible for a token range inclusive of a given row key. The data is then stored in the clustern times (where n is defined by the keyspace’s replication factor), or once on each replica responsible a given query’s row key. All nodes in Cassandra are peers and a client’s read or write request can be sent to any node in the cluster, regardless of whether or not that node actually contains and is responsible for the requested data. There is no concept of a master or slave, and nodes dynamically learn about each other and the state and health of other nodes thru the gossip protocol. A node that receives a client query is referred to as the coordinator for the client operation; it facilitates communication between all replica nodes responsible for the query (contacting at least n replica nodes to satisfy the query’s consistency level) and prepares and returns a result to the client.
Clients may interface with Cassandra for reads and writes via either the native binary protocol or Thrift. CQL queries can be made over both transports. As a general recommendation, if you are just getting started with Cassandra you should stick to the native binary protocol and CQL and ignore Thrift.
When a client performs a read or write request, the coordinator node contacts the number of required replicas to satisfy the consistency level included with each request. For example, if a read request is processed using QUORUM consistency, and the keyspace was created with a “replication factor” of 3, 2 of the 3 replicas for the requested data would be contacted, their results merged, and a single result returned to the client. With write requests, the coordinator node will send a write requests with all mutated columns to all replica nodes for a given row key.
When an update is processed – also known as a mutation — an entry is first added to the commit log, which ensures durability of the transaction. Next, it is also added to the memtable. A memtable is a bounded in memory write-back cache that contains recent writes which have not yet been flushed to an SSTable (a permanent, immutable, and serialized on disk copy of the tables data).
When updates cause a memtable to reach it’s configured maximum in-memory size, the memtable is flushed to an immutable SSTable, persisting the data from the memtable permanently on disk while making room for future updates. In the event of a crash or node failure, events are replayed from the commit log, which prevents the loss of any data from memtables that had not been flushed to disk prior to an unexpected event such as a power outage or crash.
Distributed logic and designs will inevitably cause an increase in complexity in application logic. When done right however, the rewards are obvious and easy to appreciate. Operationally, while it might be possible to get away with a single non-sharded MySQL instance installed via apt-get/emerge/yum/etc., operations with Cassandra need to be taken seriously to achieve desired performance and uptime of the cluster. Or, if you currently shard data across multiple MySQL instances, knowing that Cassandra deals with sharding and replication for you might be a huge benefit and upsell for Cassandra. But, unfortunately there is no such thing as a free lunch. For example, although Cassandra will remove all of your homegrown database abstraction and sharding code, you ultimately ended simply moving that logic from your code to Cassandra. Luckily, given the number of people and corporations of all sizes using Cassandra in production combined with an engaged and involved community, it’s fair to assume and argue that Cassandra’s equivalent of your MySQL sharding code will be better than your old homegrown solution.
Creating a thoughtful and conscious data model in Cassandra from the very beginning is very important. A bad data model can easily ruin and erase any of the benefits you want by migrating to Cassandra in the first place. With MySQL, the lack of a thoughtful or poor data model can frequently be worked around and accommodated thanks to the various relational database features (for example, the use of complex JOINS).
While these MySQL queries might be slow and expensive, given enough time and resources it’s possible to get the exact desired result from the dataset. With Cassandra, it is much harder to retroactively “fix” a poor data model. First, the lack of JOINS in Cassandra removes complex reads as a hacked solution to a bad data model. Additionally, thanks to the power and architecture of Cassandra, it becomes very easy to store more rows and data than imaginable with MySQL. With increased amounts of data stored, comes an increased complexity in successfully getting the exact data needed within the given performance boundaries required by your application. A SELECT query containing only 30 rows will return quickly and predictably. Performing a query over 5 million rows requires processing significantly more IO. Just as more data in MySQL made complex JOINS more difficult, accommodating for a Cassandra data model that requires the iteration over multiple nodes and rows will be slow, inefficient, and most likely not work at all. Obviously, faster database responses are always better in any application; so don’t let your data model be the cause of slow database latency in your application!
Denormalization is the concept that a data model should be designed so that a given query can be served from the results from one row and query. Instead of doing multiple reads from multiple tables and rows to gather all the required data for a response, instead modify your application logic to insert the required data multiple times into every row that might need it in the future. This way, all required data can be available in just one read which prevents multiple lookups.
There are lots of options to tweak in Cassandra. Much like turning the treble, bass, and volume nobs of your car’s sound system all to 11 won’t sound very good to your ears, it’s easy to do more harm than good when “optimizing” Cassandra and it’s many nobs and dials.
Options such as key cache and row cache are two great examples. In a MySQL world, much of the configuration tuning is spent on optimizing the various amounts of cache allocated. In the Cassandra world, these settings actually tend to decrease node and cluster stability. Cassandra is written in Java, and thus it must operate within the limitations of Java. One of the biggest considerations is Garbage Collection and the maximum size of the heap possible without running into large garbage collection related issues, which will crater the performance of Cassandra. As of JDK7 with CMS (the default in Cassandra 1.2.x and 2.0.x) the maximum recommended size of the heap is 8GB. This 8GB must be shared between all of the various Cassandra components. 2GB allocated to the key cache will (obviously) put another 2GB of pressure on the heap. Caches are an optimization not a requirement, so allocating more memory to caches should be considered as part of the big picture. If you can allocate the full 8GB to Cassandra, a suggestion would be to start with allocating no more than 768MB to the key cache (key_cache_size_in_mb) and 0MB to the row cache (row_cache_size_in_mb).
Another example is multithreaded_compaction. While this might seem like an obvious option to enable, in most cases leaving this option disabled can actually improve overall cluster stability and performance. In many cases, less is more.
Migration Plan Considerations
Sometimes the most difficult component of a migration is not in writing a set of reliable scripts to read from MySQL and insert into Cassandra, but trivial coding mistakes that can cause significant data discrepancies between the MySQL and Cassandra versions of the data.
Because migrating from MySQL to Cassandra will most likely require a change in your data model, the logic required to “convert” your relational MySQL data to it’s de-normalized form is the hardest part of the migration and certainly has the biggest risk.
Treat your migration scripts and logic not as one-off instances, but production quality code that can be run in any order, at any time. Mistakes in migration logic that result in an inconsistent version of the migrated data in Cassandra most likely will have a much greater impact than other dataset migration related bugs.
Regardless of your migration strategy, in almost all cases you will have to perform an initial bulk import of your existing MySQL data into Cassandra. While it might be tempting to simply iterate over every MySQL result and then insert that result one mutation at a time into Cassandra, a more efficient way is to use the Cassandra Bulk Loader. At a high level, the Bulk Loader requires you to create a CSV file containing all of the rows and columns that need to be loaded into Cassandra. Using the Java class SSTableSimpleUnsortedWriter, you can create an SSTable from your CSV file, which can then be loaded directly into Cassandra using SSTableloader.
For more details and code samples reference the article at http://www.datastax.com/dev/blog/bulk-loading
Sync Data Method:
When migrating to Cassandra and choosing a new data model might significantly increase your database workload. Alternatively, you might still need a live dataset in MySQL after the initial migration for legacy scripts that have not yet been migrated to use Cassandra.
Syncing from MySQL to Cassandra
In some cases it might not be practicable to add Cassandra to a legacy application. In this case it might be necessary to have an external process sync data from MySQL to Cassandra while running both new and old logic in parallel.
Add a timestamp column to the MySQL table to be synced. With each update to MySQL also update the timestamp with the last updated time. At a scheduled interval then do a SELECT query from all MySQL shards where the last updated timestamp is greater than or equal to the time your last sync started.
Syncing from Cassandra back to MySQL
Some data models will be hard to sync from Cassandra back to MySQL (for example time series data). However, rows containing more de-normalized
“metadata”-like information can be synced.
What won’t work: Creating a sync script that executes via cron every n minutes and attempts to do a SELECT * FROM TABLE from Cassandra (and
then update and insert all of those records into MySQL) is a recipe for failure. Inherent to Cassandra’s design is that data is sharded across multiple nodes by a hash of it’s key. Performing a SELECT * query is a Cassandra anti-pattern and should be avoided. Iterating through every key across all nodes and returning a single paged dataset is both inefficient and impractical.
Implement a queue that your application additionally writes to when it modifies a row in Cassandra. Have a script consume from this queue and de-duplicate the modified keys on a time interval and then bulk insert updates into MySQL.
If the data can be updated less frequently into MySQL, you could write a Hadoop Map/Reduce job that iterates over the column families that you need to sync. This solution gives a practicable and reproducible way to iterate through all keys in a column family. Using this approach as an additional sanity option to resolve missed updates from other incremental sync options.
Another option if you can afford a greater delay in the delta between updates from Cassandra back to MySQL is to use a tool such as SSTable2JSON to dump a column families SSTables into a JSON format, which can then be parsed and then used to update MySQL. This is a pretty heavy-handed method. Additionally, you’ll have to write logic to ensure you dump the SSTables from all nodes to get the entire column family.
Write Twice and Forget Method:
If you are able to modify your existing application to also interface with Cassandra, you can initially start your migration by writing database updates twice, once to MySQL and an additional time to Cassandra. Once you have all new updates being written to both MySQL and Cassandra, you can run a migration script that pages through all your existing MySQL data and inserts those records into Cassandra.
Initially, you might want to implement this second write to Cassandra as a completely non-blocking, write and forget, operation. If you experience initial issues during your Cassandra deployment, make sure not to impact your existing application when Cassandra is down.
Once you are satisfied with the fire-and-forget writes, you can slowly modify your application logic to start performing reads from Cassandra instead of MySQL. Thanks to the dual writes, if you run into issues, simply revert back to doing reads from MySQL.
Use Cases and Migration Resources
AOL migrated their article index, in use for several AOL technologies form MySQL. The result was an 8X increase in writes, and considering the move to Cassandra as a “big win”.
Coursera was experiencing unexpected downtime, due to the RDBMS’ single point of failure. In addition, Cassandra has enabled Coursera to become more dynamic; introducing their over 9 million users to an always available, on-demand course system.
Mahalo’s search technology was forced to move off of MySQL to Cassandra as their primary data store in order to realize lower costs and higher performance and scalability.
Pantheon Systems, offering a platform for Drupal websites in the cloud, migrated to Cassandra primarily for greater scalability and ease of use.
Scoop.it’s content curation publishing platform experienced the limitations of MySQL for handling their data growth and moved to Apache Cassandra for scalability and requirement of no downtime.
Ampush’s migration from MySQL to Cassandra due to their increase in data volume, high availability and performance requirements which only Cassandra could satisfy.
Barracudna Networks were not able to monitor customer threats in real-time with MySQL and went to Cassandra for the scalability and availability benefits.
Hailo has leveraged Cassandra to build one of the most successful startups in European history. This presentation looks at how Hailo grew from a simple MySQL-backed infrastructure to a resilient Cassandra-backed system running in three data centers globally.
Ooyala chose Apache Cassandra for its elastic scalability and high performance – especially when their MySQL environment was not meeting customer service levels – to help their customers take a more strategic approach when delivering a digital video experience.
AppsSavvy’s targeted advertising delivery solution moved from MySQL to Cassandra for increased scalability and performance under load.
Zoosk’s persistent notification system was moved off of MySQL and onto Apache Cassandra because it is a superior database for their high volume of writes of time series data.
Agentis Energy had to move to Cassandra once the scale of their data became unmanageable on MySQL as they now store over 15 billion records of time series usage energy usage data.
Whitepaper: Why Migrate From MySQL to Cassandra? By Robin Schumacher
This whitepaper discusses the ‘why’ and ‘how’ to migrate from MySQL to Cassandra as well as what a good migration candidate looks like.
Hindsight is 20/20: MySQL to Cassandra. This webinar offers a brief intro to how Barracuda Networks uses Cassandra and the ways in which they are replacing their MySQL infrastructure, with Cassandra including lessons learned. A slideshare from this presentation is available as well: Hindsight is 20/20: MySQL to Cassandra
5 lessons learned by Zoosk for moving persistent notifications from MySQL to Apache Cassandra in order to support very high volumes of write while minimizing write latency.
Michael Kjellman is a San Francisco based Software Engineer. Michael works across multiple products, technologies, and languages. He primarily works on Barracuda’s spam infrastructure and web filter classification data. Follow him on Twitter at@mkjellman.