MongoDB to Cassandra Migrations
MongoDB is a document oriented database from the company of the same name, formerly known as 10gen. It’s a schemaless database, meaning that you aren’t required to define tables or fields up front like you would a traditional RDBMS. Instead of tables, MongoDB uses the term collections. Like an RDBMS, it offers many options for indexing, such as geospatial, full text, and unique fields within a collection. It also offers a very flexible query model, but without JOIN’s like a RDBMS.
The database you choose depends on several factors, the type of data you store, the types of queries you do, your operations requirements and infrastructure, and the tools available for it. Opscenter and nodetool provide cluster wide monitoring and node level administration respectively. There are mature cassandra drivers for pretty much every widely used language, and there is a very active community that is accessible to newcomers.
While every database has some administrative work, especially as you scale it, Cassandra has the benefit of only a single server type, which simplifies the overhead on ops. Additionally, since the node configuration is homogeneous, understanding what happens when you query Cassandra is very easy to get your head around once you learn the basic concepts.
If you need your data to be replicated across multiple datacenters, or it’s likely that you will need to do this in the future, you would benefit from the multi-datacenter support that comes baked in to Cassandra. All you need to do is make Cassandra aware of the other datacenter, and it will begin replicating your data between them. Using query-level consistency, you can have you queries talk only to the local datacenter, or query all of them. Additionally, keyspaces can be set up to automatically partition data between racks in a datacenter.
Adding and removing capacity from Cassandra cluster’s is very simple. You simply boot up a new machine and tell Cassandra where the other nodes are, and it takes care of the rest. The new node will stream data from the other nodes and start processing requests. Removing a node is just as easy, you tell a node to remove itself from the ring, and it will stream it’s data to the other nodes and stop processing requests. There is no additional configuration required.
Another area Cassandra shines is in it’s predictable query performance. Writes are super fast, since all you’re doing is appending to the end of a log file. Reads also have good performance if you model your data correctly. Since Cassandra stores it’s rows sorted on disk, taking slices of data typically only takes one disk seek. This brings a huge gain in performance when compared to doing a seek for each object.
Because of the sequential ordering of data on disk, Cassandra is extremely efficient when it comes to working with time series data, since each datapoint is stored on disk sequentially. Taking arbitrary slices of time periods is very fast, and read time does not grow linearly with the number of records fetched. Reading the data points for a single day, or several months will have similar response times.
Since Cassandra uses a write-ahead log only during writes, writes are extremely fast, since each write is simply appended to a file. Cassandra doesn’t bother overwriting old data in SSTables. The mutations applied to a table are periodically flushed to disk, and reconciled during reads.
MongoDB and Cassandra have a common goal – high scalability – but they take very different architectural approaches to get there. In this section we’ll cover each of the different aspects of the two architectures and highlight how the two databases differ.
Mongo has a schemaless data model, allowing for wildy flexible through heterogeneous documents. A single collection can store a hundred thousand documents and all of them could have a different structure. Whether this benefits you will be determined by your application’s needs. In practice, it’s uncommon to be on this extreme. The analog to MongoDB’s documents in Cassandra is called a row.
With Cassandra 1.2 and 2.0, working with tables created via CQL is the preferred and encouraged way to store data. CQL looks similar to SQL, in that it’s used for defining schema as well as querying. Fields added via CQL are optional, but if they are provided they must match the type of the field. Unlike most RDBMS, altering the schema does not require waiting for all rows to be updated, the alter is applied immediately. In this regard, Cassandra enforces a more structured schema than a MongoDB document. Schema definitions are straightforward, and look like this:
And you can think of the rows in a CQL3 table looking something like this:
Both databases allow for complex fields like sets, lists, and maps. In the case of Cassandra, the data type stored in the collections must be defined ahead of time.
One way that Cassandra deviates from Mongo is that it offers much more control on how it’s data is laid out. Consider a scenario where we are interested in laying out large quantities of data that are related, like a friend’s list. Storing this in MongoDB can be a bit tricky – it’s not great at storing lists that are continuously growing. If you don’t store the friends in a single document, you end up risking pulling data from several different locations on disk (or on different servers) which can slow down your entire application. Under heavy load this will impact other queries being performed concurrently.
In Cassandra, we have available to us a concept called “wide rows”. Cassandra is actually built on top of a data structure called an SSTable, which allows for many columns in a row. If you define a primary key with more than one component, Cassandra will keep all the records together that share the first key. For example, in the below table, for a given user_id, all the friend_id’s will be stored together:
This enables us to have a lot of flexibility with our data model, and predictable performance even when dealing with large data sets.
Data in MongoDB is fully replicated in a replica set. Replica sets consist of a primary and multiple secondaries. When a master fails, the remaining replicas elect a new master. Replica sets are assembled into a cluster.
Cassandra has no concept of master, slave, or replica sets. Instead, it uses a hash ring to determine data ownership. A Cassandra cluster has a replication factor, which is the number of nodes a piece of data lives on. When storing any data, a hash function applied to the primary key of the row, which determines which node is responsible for the key (similar to a shard key in MongoDB).
DataStax offers good documentation explaining replication strategies.
Along with a flexible data model, MongoDB offers a ton of options for querying. In practice, some of the query flexibility can result in performance problems with large datasets and should be avoided. Specifically, any query which results in every shard being queried should be avoided in a production environment.
Instead of offering flexible queries, Cassandra users are encourage to model their data in such a way as to minimize the total number of queries through more careful planning and denormalization. In practice this should be done with any database that will have serious performance requirements in production. Cassandra offers basic secondary indexes but for the best performance it’s recommended to model your data as to use them infrequently.
This may seem counter intuitive – but the reality is that a small investment in planning up front to properly model your data will pay off by allowing you to avoid multiple refactorings as your database grows in size and in usage.
The issue of scalability is complex, since different teams have different requirements. For some, allowing a database to continue working despite hardware failure is a priority. For others, easily adding nodes to a running cluster to distribute your data and workload is most important.
Both MongoDB and Cassandra have options to allow for certain levels of database failure. Cassandra offers it’s replication factor to determine how many copies of a piece of data are stored, and MongoDB controls this via the number of machines in each replica set.
Adding nodes to a Cassandra cluster is as trivial as starting a new Cassandra box, and including some of the other machines in the cluster so it can join and “bootstrap”. This is the process by which Cassandra joins the ring and accepts data from the other machines.
To add capacity to a MongoDB cluster, you must first create a replica set, and then tell the cluster to add a new shard.
Due to it’s nature of only having one type of node, Cassandra is much easier to manage operationally. If you’re using a tool like Puppet or Chef (or Salt!), you only need to worry about a single machine.
The files that Cassandra writes to disk are immutable – they are never changed once they are written. They are periodically merged as an optimization.
Backups are trivial due to the nature of SSTables being immutable. If you’re working in an environment like AWS, you can simply copy any created SSTables into an S3 bucket.
Backups in Mongo are more complicated because the files are constantly being modified. As a result, you will need to take backups using disk based snapshots, or with dedicated backup servers.
The migration process is laid out into some best practices.
Migrate One Component at a Time
To reduce the complexity, it’s a good idea to break up any application migration into small parts. Try to migrate many small parts of your application one at a time, instead of one large migration. This has a few benefits. First, you have many projects, each with a small scope. This means that for each individual migration, you have a much smaller set of potential bugs that will occur. Second, if you start with some of the smaller and less complex parts of your application, by the time you get to the central parts, you will have a lot more experience under your belt, that will make avoiding problems easier.
The first step of a migration, is to translate your Mongo document structure into a Cassandra schema. Your Mongo docs will not translate 1-1 into Cassandra tables. Best practices for Cassandra data modeling are beyond the scope of this article, however, here are some patterns you should think about.
Tables that are read much more frequently than written should use leveled compaction. In short, leveled compaction minimizes the number of SSTables that a given row is stored on, and therefore, reduces the number of seeks you need to do when querying.
Objects that are typically queried using a mongo index, where multiple objects are returned, are usually good candidates for clustering keys, where you have multiple CQL rows grouped under a single primary key. When you group objects together like this, they are all stored together on disk, meaning you only need to do one seek per sstable. A benefit to structuring your tables this way, is that the number of disk seeks to read multiple rows remains constant.
Since Cassandra doesn’t have any concept of foreign keys or dbrefs, you’ll need to replace any dbrefs in your Mongo document. There are multiple ways to do this, each with it’s own strengths and weaknesses.
- In the best case, you can make the referenced objects part of your cassandra schema. This is good in cases where the referenced object is uniquely related to the parent object.
- Even if the referenced object is referenced by more than one ‘parent’ object, it may be worthwhile to duplicate the data onto each parent object, if the referenced object is read frequently. This approach has the downside of complicating the logic for updating the referenced object, since you’ll need to update the data on each of the rows it’s duplicated onto.
- If the referenced object is not needed on every read of the parent object, storing it’s id(s) on the Cassandra table, and performing a lookup on the application side is a good approach.
While your goal should be to replace as many of your Mongo queries with Cassandra queries as possible, that’s not always practical, especially if you’re doing any sort of complex querying on with Mongo (and if you’re reading this, you probably are). In those cases, it’s worth considering using a search application like Solr or ElasticSearch to supplement Cassandra. They are very good at that type of flexible querying.
Using time uuids as clustering keys is a powerful way to have both a unique id for a row, and have it sorted by a timestamp. If you’ve been using unique ids that are not uuid1s for Mongo objects, it may be worth looking into switching them to uuid1s. Unfortunately, there is a random component to uuid1s that usually makes this impractical. However, if you dig into your language’s uuid library, you can rework it to create a uuid1 from the object’s timestamp, and use the bytes from the unique id in place of the random bytes usually used. This is an advanced technique, which requires testing to make sure you’re doing everything correctly, but depending on your querying patterns, may be worth it.
The 2 branch migration strategy allow you to migrate your application seamlessly, while people are using it, without any downtime. Aside from the obvious benefits of no downtime, it’s also much less stressful, since you don’t have to worry about the migration being done by a certain time, you don’t need to do things like run migrations in the middle of the night, you can spend time checking the correctness of your migration script.
Make a branch off of your main development branch. In this branch, create the tables for the component you’re going to be migrating. Next, every time there you write to Mongo for the component you’re migrating, mirror the writes into the appropriate cassandra tables. Finally, write a migration script that iterates over all existing Mongo objects in the relevant collections, and writes them into Cassandra.
When you deploy the writes branch into production, your application will start writing data into both cassandra and Mongo. This is a good time to check the data being written to your table. Make sure that everything is being written as expected. If you messed anything up, it’s real easy to roll back, write some more tests, fix the problem, and try again. Once you’ve spot checked your data and everything looks good, run your migration script.
Before deploying your writes branch, you should write a reads branch. The reads branch is a branch off of the writes branch, and simply removes any reads or writes to Mongo for the component you’re migrating, and switches them over to Cassandra exclusively. In theory, this branch sounds simple, but in practice, takes a lot longer than the writes branch, since you’re replacing a bunch of (read) queries, and there may be small differences in the ways the 2 databases return data which your application doesn’t like. Hopefully your test coverage is good.
We strongly recommend finishing your reads branch before deploying your writes branch, for a few reasons. First, your reads branch validates the data modeling decisions made in the writes branch. If you’ve forgotten something, it’s a lot easier to just add it to your writes branch and merge it into your reads branch than it is to have to run yet another migration script, or worse, totally change the way your doing things.
Second, the less time your application is writing to two databases, the less chance there is for data inconsistency to occur. For instance, if your Cassandra is briefly unreachable due to a network problem, you’ll have data that exists in Mongo, but not Cassandra. In general, it’s best to minimize the time your application is doing weird stuff. You typically want to deploy writes, verify correctness, run your migration script, and then deploy reads shortly afterwards.
Use Cases and Migration Resources
The following offer real-world use cases where businesses have chosen to move off of MongoDB to Cassandra as well as extra migration resources to provide extra information to assist a migration.
Starting out with MongoDB to power Adobe’s Behance activity feed they met a scale wall; electing to migrate to Apache Cassandra for its linear horizontal scaling. Not only did this improve performance and fault-tolerance, but they cut their overall database costs by 70%.
Quandl’s collaboratively-curated portal to over 8 million time-series datasets is powered by Apache Cassandra. After evaluating numerous NoSQL platforms including MongoDB, Cassandra was deployed based on availability, scalability, write and read performance, failover tolerance, and strong user community.
Social Artisan’s path to Cassandra for their lead identification and content analysis platform was a rough one. Initially set up on MongoDB and then HBase they ran into scalability and management issues on each platform until finally deploying Apache Cassandra in their datacenter on 72 nodes.
Burt’s intelligence and analytics platform requires very high volume write capacity and rapid offloading of aged data. Cassandra was able to provide the speed and efficiency required on both fronts.
RelateIQ’s relationship management software was faced with a MongoDB bottleneck; they elected to migrate to Cassandra for a highly-available, cost effective scaling solution.
Shift offers a marketing communication platform that handles high velocity and high volume data which challenged their previous infrastructure that relied on MongoDB. They were able to provide many operational advantages with Cassandra and made the switch.
Shodan is a search engine that allows users to find specific types of devices that are accessible online and determine, among other things, how secure they are. Shodan turned to Cassandra after migrations from MySQL to MongoDB failed to be sustainable options due to the large data volumes being collected.
QAFE’s unique application framework offerings needed to find a large scale storage solution with predictable scalability, high availability and robustness of a mission-critical application. After reviewing options such as MongoDB and HBase, they went with Apache Cassandra for linear scalability and no single point of failure capabilities.
SiQueries offers their customers SaaS and for data analysis and visualization and chose Apache Cassandra over MongoDB to store their time series data thanks to superior scalability, read/write performance under massive load and manageability.
Gaming developer ecosystem Unity’s ad platform had trouble scaling MongoDB. Switching to Apache Cassandra their horizontal scaling problems went away, alongside a decrease in latency.
V2i’s solutions for solving problems related to structural dynamics relies on sensors monitoring physical structures at customer sites and had to steer away from MongoDB in favor of Cassandra in order to support their very write intensive application.
Cassandra embodies in its core the resilience and availability FullContact needed to continue serving their enterprise and internal customers even in the face of transient outages. Originally running their distributed search platform on MongoDB; FullContact migrated to Cassandra to handle their growing scaling requirements.
FullContact provides a suite of cloud-based contact management solutions for businesses, developers, and individuals. Their engineering team put together a blog post that outlines their migration off of MongoDB to Apache Cassandra.
Robert Thanh Parker , CEO and Founder at Rekko goes in depth on their Cassandra migration from MongoDB, yielding insane (10x) performance improvements.
SHIFT discusses their migration from MongoDB to Cassandra, as well as no downtime during your database migration. Topics will include reasons behind choosing to move to Cassandra, their zero downtime migration strategy, data modeling patterns, and the benefits of using CQL3.
From MongoDB to Cassandra: Architectural Lessons. This webinar covers more detail about SHIFT’s move from MongoDB to Cassandra including differences in their architectures and why Cassandra was a better choice for their application platform and analytics tool.
Patrick McFadin makes the case that Cassandra is the best NoSQL database option vs MongoDB. The architecture of Cassandra and its availability, consistency tuning and scalability make it a clear choice.
Blake is a Software Engineer at DataStax, a member of Tinkerpop and one of the primary authors of CQLengine, the python Cassandra CQL3 object mapper. You can follow Blake at @blakeeggleston.
Jon Haddad is an Apache Cassandra Technical Evangelist at DataStax. He is a Cassandra MVP and is also one of the maintainers of CQLengine. Jon has spent the last decade at various startups in the LA area. You can follow Jon at@rustyrazorblade.