Course Introduction

Welcome to DataStax Academy. My name's Patrick McFadin, and this is Cassandra core concepts. In this course we have a few objectives.

First one is we're gonna explain this thing called KillrVideo and understand the domain of that. This will be an example application that you'll be using throughout this course.

Next, we're gonna install and start Cassandra. Well, duh, because this is the Cassandra course, right? So yes, you're gonna have a running Cassandra node. We're gonna create tables, store, and retrieve data, which is pretty important for a database, and you're gonna learn that by the end of this with that running Cassandra node you got going.

And we're gonna understand the Cassandra data model. Now, this one won't be an in-depth data modeling class, but you're gonna understand enough to be a little dangerous and understand how to make it work for you.

You're also gonna understand Cassandra's architecture, and I think this is one of the most interesting parts of Cassandra. The architecture is very elegant but also so important to how it works and understanding that is really important if you wanna excel at using Apache Cassandra.

Killr Video Introduction

Hi there, I'm Patrick McFadin. Let's talk about KillrVideo. Now, what is this? This is an example application and one that we'll be using for any examples we try to use for data modeling, et cetera. It is a fully developed example and I'm going to show you a little bit about it.

So imagine if you will, KillrVideo is a website and it's part of a company and it could be your website. It's a video sharing website. Now, you may have seen some of these on the internet a couple of places but this is ours and it actually does exist. It sits up on killrvideo.com and we have lots of different things we can work with here to show you good examples of how to use Apache Cassandra. Pretty normal looking production website. So there's things like users and there's videos which are objects that the users own and comments on videos which is metadata, playback data. That's time series data. All of this combined gives you a pretty comprehensive view of how to use Apache Cassandra.

Here are some of the problems. These are probably some of the things you're faced with. For instance, scalability. Who knows, our website could be a big one and we can just need a lot of scale. Think about YouTube. Have you ever heard the stories about how they were hours away from running out of space? Well, what if you're like that too? Wouldn't it be nice just to add more nodes to your system and keep trucking?

The reliability. I think at this point if you're down any point, you're probably dead and reliability is critical. When you're offline, your users are going to run to another site so we want to make sure KillrVideo is online all the time.

And ease of use. We have to make it so that users want to use it so we have to give them the features that they will use for our system and those have to be enabled by Apache Cassandra.

Now, I may have done this several times in my life using a relational database. Of course I did because I've been doing this for a long time. You have a lot of choices there and they're probably pre-installed in the operating system that you're using but if you look at the problems that we're trying to get away from like a single point of failure which is a big problem, I have one database and a whole bunch of AV servers. Well, what if that one database goes down? I'm completely down. And if I tried to shard, well, that's going to be a scaling thing, great, but that means that now I have little tiny single points of failure everywhere. Where if I use a replication scheme that keeps things but you still have a failover and those reliability issues are really big. If I want to keep data across multiple data centers, that's important too.

This is where we're sitting in the 21st century. If you are not in two data centers or more, you're probably going to have some down time and you're not competing directly with people that are. Especially with video solutions like this, there's going to be a lot of competition so let's give yourself a leg up from the beginning.

Why Apache Cassandra? Well, this is a non-single point of failure system and this is what's going to give you the best scaling and the best reliability out of any database that's out there right now. So we have this good linear scaling. We have this always on which means that if you take a node out or you update a node and have to reboot it, your system is not offline and this is what we want for our KillrVideo website. Hundred percent up time. Awesome.

That's what I want all the time and more importantly, I know I have users all around the world. I need a replication scheme that's not only going to give me good reliability meaning that if I lose a data center, I'm still online but also puts them closer to my clients so if I have users in Europe or in India and I want to look at them and say, hey, you're going to have the best experience possible, it's not going to be that way if they have to go all the way back to the US. The latency between the US and India is two hundred milliseconds. I'm not going to put that in front of them. I want to give those users a great experience by putting the data closer to them.

Quick Wins: Install and Start Cassandra

I'm Patrick McFadin, and let's talk about installing and starting Apache Cassandra. Two steps: install and start. Okay, well, those were the easy ones, right? But installing, let's walk through that.

First of all, you have some options. And those options are in packages and locations.

The first one is we have the DataStax distribution of Apache Cassandra. This is just a community version that is taken from the Apache website and packaged up using things like RPM or a DEB file, and it's free for development and production, but it's just Apache Cassandra.

The next is we have DataStax Enterprise. Now DataStax Enterprise ships with many parts including OpsCenter, DevCenter, and Drivers. Now DevCenter and Drivers are free to use, of course, and the download is free to use. It's free for use and development. You have to get a license for production deployments but what you're getting with DataStax Enterprise is a lot more, which is search, analytics, and now graph.

Open Source Cassandra is if you want to go directly to the Apache website, and this is a straight download from the source of where it's all built. Keep in mind, of course, that DataStax does not own Cassandra. That is an Apache project, and that's where it lives, and that's its home. So the simplest way to do this was as a tarball install, and we'll walk through this. So you just download and extract that tar.gz file and what it does is it extracts an entire directory tree of all the stuff you need to run Apache Cassandra. The configuration is the first stop. To start Cassandra, you go into the bin directory, type cassandra. All the defaults in the .yaml will be fine for a normal, just basic installation. There is nothing to worry about in there. It will start just fine, and you're going to look in the log file for a state jump to normal. Now this perfect for running on, say, a laptop for development but we're going to get into a lot more configuration about how to set up for production environments. But for now, this should get the job done.

Quick Wins: CQL

I'm Patrick McFadin and let's talk about CQL or Cassandra Query Language. So, first of all CQL has some really basic parts to this and CQL itself is really the DDL and the query language for Apache Cassandra. So this is where you can create things like keyspaces, tables and you can go through some core datatypes and we're going to look at those.

So Cassandra Query Language, as I mentioned, that's what we call CQL, is very similar to SQL and don't get confused. You can't do things like joins and other things that maybe groups and aggregates, the same that SQL can do, but it does look very similar and it makes it so that you have a very familiar syntax to work from when you're working with this query language. So doing a select gives you back columns. You have a from where you can product predicates to get that data narrowed down. So it makes it easier as a programmer to start working with Cassandra immediately and be productive.

First stop in the data model, of course, is keyspaces and keyspaces are a very simple concept. They are a container and this is similar to a database schema or a database in say Oracle or MySQL, but the important thing about a keyspace is its primary purpose is storing the replication information. This is where you put in how many copies of your data that you want and where you want it. We will get into this much later in a different module, but just know this is where this is stored.

There is a USE command. This is how when you create a keyspace, this is how you can switch between those different keyspaces. Think of them as logical containers for all your tables. Speaking of tables, here's one now.

So, a table is probably where you're going to spend most of your time because this is what's going to be storing your data. Tables contain data and they have columns, a primary key and the primary key shows you the uniqueness of that record. There is some very basic datatypes that are included in those table create statements, so for instance, text, which is always defaulted to UTF8, not ASCII and integer and timestamp. These are there for a simple reason. First, to marshal your data to make sure that your data is set correctly. If you say integer, that means you can't put in a string and if it's a timestamp, it doesn't mean you can put anything else but a timestamp, but it also gives some sort order inside of the storage engine and that's another module we'll dig into quite in depth.

There's very specific types that aren't really used in a lot of other databases, but are really critical for how Apache works. The UUID and TIMEUUID. Of course, now the UUID is universally unique identifier. Now if you're using Microsoft technologies, that's also known as a GUID or GUID and I think it's funny, because the GUID, which is pretty similar to UUID, is Microsoft's attempt at saying, oh, this isn't quite universal, it's only global. Fine. But it's supposed to be a unique number and it's 128-bit number, so it's going to be really big.

Now, I have an example here on the screen that shows what it should look like. Now, if you look at a TIMEUUID, it looks really similar to a UUID, but it has a slightly different function. What it does is takes that 128-bit number, takes off the first 64 bits and replaces it with a timestamp and then uses the rest as a unique part of that record. Why is that useful? Well, if I have two exact same timestamps, I want to make sure that they're unique, so these could be stored separately. They are going to give you no collisions on the same timestamp and that is really critical when you don't want to lose data, something that we talk a lot about in data modeling.

So, why are we using UUID and TIMEUUID? Well, it's very simple. Distributed systems are almost impossible to generate a sequence. Now, in a relational database, I'd probably use a sequence to generate an ID number or something like that. Well, in these distributed systems, it's impossible to do. So, we use a UUID timestamp, put those together or just UUID proper, you get a guaranteed unique number and it is used as an ID number and if you look at the Killr video examples, yes, UUID is used all over.

Now, for the course and for the modules that we show you, we usually replace the UUID with an INT, very simple actually, because that number is a lot easier to look at than this. I mean, a UUID is not super friendly. It's great for computers, terrible for humans. So, just know that we're going to use integers instead of UUIDs in some cases.

The first stop on CQL and one of the commands you need to know is, of course, getting data into the system. This is an insert command. Doesn't it look familiar? It's very familiar, I would hope, if you've ever worked with a database. You say insert, here are some values, here is the fields that I want to put in. Now, the important thing here is that you have to include the primary key. And that's it.

The select syntax is very similar to relational syntax, however, it doesn't allow for complex wear clauses such as joins. But you can do a select star, you can select some columns, you can put in a user ID equals this really large ugly UUID. Don't worry, we'll use more INTs later on, but all of this looks really familiar and is easy to work with by a programmer. The copy command is included in the CQL shell. This is for the import and export of CSV data. So, very simple command, has very few parameters, but it is very effective at bringing in data, say like static data or test data and outputting CQL tables directly to a CSV file.

There is a lot more to CQL and data modeling. We just covered the very, very basics. We have an entire course called DS220 Data Modeling which goes through every aching detail and I recommend you take that after you do this course so you have a complete picture, but this is where you really want to start if you're building applications with Cassandra. You start with this course and you move on to data modeling, you'll have a great picture of being successful for your next project.

Cassandra Data Model: Partitions

I'm Patrick McFadin. Let's talk about partitions. What kind of partitions? Well, partitions of data. And how best to start? Well, of course, probably with what you're best familiar with. Relational data.

So here's a relational table. He's a table of datas that sits on disks. So as we insert data in there, we're gonna have one, we're gonna have a two, three, et cetera, building up a lot of data. So as we insert data into the table, everything's in insert order and by ID. That's how relational databases work. Now that's on a single system, single disk. Great. However, things start to fall apart when you start sharding and doing things like that.

So Cassandra does it a very different way. So what we're gonna do is we're gonna take that same type of data, same data, actually, in this case. And we're gonna break it up into parts. And we're gonna call those partitions. Partitions are an ordering scheme on disk. And allow you to take certain sets of data and order them. Now if you look at the order we've created here, these are partitioned by state. So we have a partition, say, for Texas, for New York, and for California. And each one of those is colocated. Now they can be on a single system or multiple systems, but they're a grouping. A partition is a grouping of data.

The partitioner's job is to take a partition key and create a token. So when we take Texas, for instance, and send it through the system, you'll get a 24. And New York gets you 58, and California gets you 83. So when we're done, we've now translated a partition key into a token value, one of those partition ranges. And so each one of these is assigned to a different partition of data.

How do we determine our partitioning scheme? Well, we put it in a primary key. The first value in a primary key is the partition key. And if we're including state, well, that's not quite unique enough. That means that when we insert a bunch of records and if they're all Texas, then at the end, we're just gonna have one record. So, we can add some things. Say, ID. And the ID, because we know that that's unique, we'll just take that and add that to the primary key, like so. And then we'll have a unique primary key record. In this case, in the primary key, state will be what's known as the partition key, and ID is this new constant called a clustering column.

So how do partitions relate to how data is stored in our cluster? And this is the next step in where our data is. Each partition, because it has a partition key, is given a range of data. So that's a token, it fits inside a range of data inside of a node. Each partition is probably gonna live on a separate node. Now it's random, so we don't know which node it is, but in our example, they're gonna be on three different nodes.

Cassandra Data Model: Clustering Columns

Hi, I'm Patrick McFadin and this is clustering columns. Now, I feel like this is one of the most important concepts to understand for both data modeling and understanding how Apache Cassandra works. We'll start out back here with our partition. Just one partition of data.

In this case, state is equal to Texas. Now, if you look at our example about how we set up the data model, the user table had state as a partition key. Going through the partitioner, you get a token value of 83. So, that means, now, all of our data is co-located together. That's all the data for people from the State of Texas. So we've created this grouping.

So now we just have a simple partition key, and if you look at what we've done here, we're gonna add in a state and the city. Now the city part is gonna be our sorting order. So the state is partitioning key, and the city is the clustering column. The clustering column is the very first thing after the partition key and everything after that as well. Now you'll notice we put parentheses or brackets around state to indicate that that's the partition key. Now we have clustering columns to add to enhance our data model. And that's gonna add sorting to the data as it sits on disc.

So we've added city, so what does that look like. Now we've reordered our data. We start with Texas, but if you look at the state, that's all gonna be the same, but the cities are different and now we've alpha-ordered them by UTF8 sorting order. I'm gonna change that to name, just for demonstrative purposes. You'll see "oh wow, that's great, "that's in alpha order as well, ascending" but I don't think that's a very good thing to sort with, so I'm gonna add city first, then name. So what that's gonna do is it's gonna create a sort order in a chain. So now the city is being sorted, and then the names are sorted. And then finally I add ID for uniqueness. So as long as we don't have collision between state, city, and name, ID, that collision's gonna be avoided because we have unique IDs. So how do you work with that?

So, clustering columns are, of course, part of the primary key, which create the order and they also create uniqueness. But now you have to do a special way of querying the data. And that querying has some very specific rules. So first of all, you have to use a partition key, you always have to provide a partition key, cause that gives you your data locality— where in this cluster is my data? Partition key gives you that.

Now, you're sitting at a single server, let's look at the data that's sitting on disc. Well, when you do a query with a clustering column, you can use an equality or a greater-than, less-than inequality query. If you do that, you have include the equality first, then the inequality. These are very simple rules. If you do more advanced indexing on the data, which we'll get to in another module, then of course you can use other types of queries, but for a basic Cassandra data model, this is how you operate with it.

What if you need to change the ordering of the data? Now, in our example, we had city, name, ID— those are all great in the order that they're at, but if you want to change that, you can do that inside your data model. So we are given a clause inside of our data model when we create the table, "clustering order by". And that gives us some really neat flexibility.

Now, the default sort order for any of these is ascending, if we want to flip this around to descending, we could do that in our data model. What does that give us? We do the sorting order on disc before we query it, so it's in a natural order that we want, and we can control that, and that makes it much more efficient. So you could change the order later if you need to using an order by, but this is eliminating that step and making it the order that you want. So what this does is it opens up a lot of interesting use cases and data models. We won't cover all those in this course, but you will see them a lot of places. The Killr Video data model does us this quite often, and you'll see it in there.

Finally, the allow filtering. Now if you're feeling the need to really stress out your cluster and do some very bad things, you have a command that allows you to do that. Now, of course, I'm joking around a little bit, but the allow filtering is only there for very specific things. So what does it do? Well, if you don't include a partition key that means that you are now having to look at the entire cluster. If you think that that a full table scan is bad, try a full cluster scan. It will not be good, and if you have a large cluster, it'll even be worse. Allow filtering basically drops any chance of CQL stopping you from doing something stupid, and you just pulled out the safety stop. So, I wouldn't say ever use it. If you find a use case for it, think twice before you do. Really, Allow Filtering is an option, but it should be a think that you look at as like "maybe I should use something like Spark instead of CQL." When are you going to run into it? I will say early on in your experience, you're probably gonna try to do a select star from some table. If you have more than 10,000 records in there, an error will be produced saying "hey, if you want more than that, use allow filtering." Yes, that's right. Part of the error message does tell you to use this. I'm telling you don't, because that's not a very proper use of Cassandra.

Application Connectivity: Drivers

Hi, I'm Patrick McFadin, and let's talk about drivers and connecting your application to Apache Cassandra. So, first things first, drivers. Why are they here? Cassandra the database would be pretty bored if you didn't talk to it, and let's face it, this is why you're probably here. You're a developer, or maybe an architect, maybe an operator. But, hey developers, I'm talking to you this time because when you connect a driver to Cassandra, you're gonna learn a little bit about how it works and why it works. And importantly, some of the nuance that makes the driver for Apache Cassandra different from a lot of other database drivers.

Now Datastax, as a company, has been open sourcing drivers for a variety of languages for a long time, Java, Python, C#, C++, Node.js, Ruby, PHP, and I'm sure there's more coming, but that's important for us, as Datastax goes, to do this because we want to create some commonality. The API is very similar across multiple languages and this is what we try to hold together with our drivers. The API, meaning that, whenever you use something like Python or Java, there should be some similarities. For instance, when you connect to a cluster, you get a cluster object. You just connect to the cluster, you're not connecting to a single server. We want you to think of in terms of a cluster. And if you've worked with relational databases heavily with a, say, JDBC driver, or a driver that's specific for a relational database, you know that you have to think about some things like, which servers I have online, load balancing, how do I do connection pools, and things like that.

Well, our drivers are meant to abstract that away and make it a lot easier to work with, and a lot more fun just to work with the data, instead of having to think about how all the databases involved in this whole thing. So, we also include a lot of policies. And those policies are really critical for running Apache Cassandra and using the driver with it, but these are all common across every driver. The same semantics across every single driver. So how do you do this? Well, we're going to use Python in this case, 'cause it's a pretty simple and easy to use language. Those of you out there who do not use Python, I'm sure you'll be able to follow along.

So, we have a create cluster object. Now, as I mentioned before, when you say create a cluster, that's just a single thing. That means that I'm gonna connect to this one entity, which could be multiple servers, five servers, two servers, what have you, but it's just a single cluster according to your application. When we connect to the cluster, we get a session, and that session is what we're gonna work with inside of the driver. So, the session is what manages all those connections to the cluster, so all the different servers in there. This is really the critical part of the session object, is that it's maintaining connection to the cluster itself and keeping track of things that are happening. It participates in what we call gossip, and when things are say, happening in your cluster, well your driver can react to that.

Let's look at a really simple case. For instance, whenever you add or delete a node from the system. Well, are you going to have to go around and change configuration files and reupload those to every single app server? No, I want my driver to manage that for me, and that's what this does. It's listening for changes in the cluster, it will respond to those, it's also pretty intelligent about who's faster, who's slower, who's having trouble. So, it gives you a very consistent, but very powerful, performance characteristic inside the driver itself.

So, when we insert, let's say a user right now, we can do a session execute. Now, what node does that go to? Not really sure, and from the driver's standpoint, that's all happening under the covers. I, as a programmer, don't get to really choose that, 'cause that's not really what I need to know. I don't need to know what this is and I don't have to worry about it. I'm just worried about running an insert command. Now, what it will do, it will look at the partition key, it will use what's called a token aware policy, set the right token, find out what node is responsible for that data, and avoid all that costly coordination.

Now, when I do a select, I can just select off of any of the data in the system, and it will give me the right information based on a partition key. But, look how friendly this is. I can do a select and what do I get back? I get a result set, which gives me rows and columns, and I can iterate over those, and that is really cool, 'cause that's what I want as a programmer. Now, this has been a very rudimentary view of the drivers. There are much more advanced ways to use the driver. For instance, we're using strings here, if you're trying to avoid a CQL injection attack, yes those could exist potentially, you'll want to use some of the more advanced features such as a query builder. But just keep in mind, this is a start. There is more to learn.

Distributed Architecture: Node

I'm Patrick McFadin, let's talk about the node. It all starts with this, the lowly node. But I love that lowly node, it's part of a bigger system. Now of course, Cassandra is a clustering system, it's a distributed system, but a node is pretty important. It has a lot to do. So in this node, we're running on a server or a VM, but it has a JVM, it's running a Java process. And that Java process is Cassandra. Cassandra is written in Java and that JVM is running on the node, all by itself. Now that node can run in the cloud, it can run in your on-premise data center, it can even run on a variety of disks. We always recommend local storage. Now if you have direct-to-task storage, that'll work too. But what we don't recommend ever is running it on a SAN.

Using a SAN is a way for Apache Cassandra to fall down in a heap, probably make you cry and I hope not lose your job, but maybe. Certain versions of EBS now in Amazon are supported because they've moved on to the next generation. Just as a rule of thumb, if your disk has an ethernet cable, it's probably the wrong choice.

So what does a node do? Well, a node is responsible for the data that it stores. All the data on that node is there, in a distributed hash table. A partition of that data that sits on the node so it can write data into it, it can read data from it, and all of those things happen on a single node. But if you look at a larger system, which is a cluster, each one of those nodes fulfills its own part of the problem. Much like ants in a forest, if they all work together they can move an entire tree. Now how much can it do, that single node? Well, typically we think about three to 5,000 transactions per second. Those are reads and writes per second, per core that you've installed on that node. So, if you have a lot of cores on there, you can get a lot of transactions going. And how much data? About one to three terabytes, SSD or rotational disk. Most times you can't find rotational disks anymore anyway, and that's great, we always recommend SSD's because they're so fast. If you want more on that node, there's ways to do it. But that takes a little bit more configuration and some more knowledge. And those are covered in a completely different course. But, just as a rule of thumb, one to three terabytes per node. How do we manage this system?

Well we have a tool, for the node, it's called Nodetool. Get it? So Nodetool runs on the node and lets you do some very specific things for the node, but it also has some cluster stuff in there too. So like what kind of stuff? So, for instance, you could do a Nodetool info, and that shows you all the information about the running node itself, just that single node. It'll give you JVM statistics, that sort of thing. We have Nodetool status, which is a clustering concept because it'll show you the status of the single node plus all the other nodes in the cluster and how they relate. Now keep in mind, this is the current state of what that node sees for the rest of the cluster. So this is a short introduction to what a node is. And again, this is a very small piece of the big picture. The big picture is a cluster and that's another module we're gonna talk about but I want you to understand that small piece of the node. That node is so important because it does all the work. The next thing you should look at is how a cluster works because the node participates in that.

Distributed Architecture: Ring

I'm Patrick McFadin, let's talk about the ring. Now this is a Cassandra ring, which means the cluster. And this is the most important part of how Cassandra works, because of course Cassandra is a clustering system. It doesn't like to be a single system. That's what you call a laptop.

Now why wouldn't you run it on a single node? Well, of course, this is about scale. And scaling is a hard problem. If you're running on a single node, you've got to be able to scale a single node. That means buying a bigger server. Oh boy! You can buy a really big server, but really, at a limit, you're going to find yourself at a six-use server and not anywhere to put it. So how does this work? Well, as the pressure increases on the single node, what happens? It falls apart. I've run plenty of these servers before in production and when it falls apart, it's pretty epic. So how do we spread the load? We add a lot of nodes. And this is how Cassandra scales. You need more scale, you add more nodes. That's pretty cool, especially whenever you're out of capacity and you can just add more. Now if you're in a cloud environment, this is really handy because then you're just renting more servers.

So how does this cluster work? Where does the data go? When we have this question, I have data, say, partition 59, where does that go? Well, you can write that data to any node in the cluster. Now, that's totally cool, because every node can act as a coordinator. And a coordinator's a very important role for each node. But how does it work?

Well, each node is gonna be assigned a different range of data. Those are called token ranges. And that range of data says, oh this data belongs to this node down here. Now that coordinator can then take that data and say, well, I'll hold that for you, hey client, got this, I'm going to send this to the proper node. Now that node gets the data, the acknowledgement gets sent back to the client.

So how much data are you going to store in this cluster? We just used a range of a hundred values. That's not enough, because when you're storing a lot of data you need a lot of values. So, a hundred? That's just an example for this video. The reality is we use this range from negative two to the 63 all the way around to two to the 63 minus one. What is that? I can't even pronounce what that word is, but that is a huge number. And that guarantees that you have enough room to store about as much data as you're going to have, and definitely I'm gonna have, within our jobs. So we'll retire nicely and still have plenty of data to store.

Now those token values across the ring. Now how does it distribute that? Well, the partitioner is important in this game, because now the partitioner is the one that's going to say, well, how do you distribute your data? If we didn't do it right, we'd wind up with these hot spot problems, and if you had all these lumps of data, what's gonna wind up happening is those nodes are going to get sadder and fatter and fatter and then eventually burn up. And the other two nodes are going to be pretty happy. And they're going to wonder where all their data is. Bad plan, let's not do that. Let's use a proper partitioner. So in this case, the proper partitioner that we use, number three or M-D five is going to do a nice distribution of that data in a random fashion, but it also is an even distribution.

So joining a cluster, this is the coolest part of Cassandra when it comes to cluster operations, because it doesn't mean downtime. That is so cool. When you do this the first time, you will be blown away. So what happens when a Cassandra node joins the ring? Well, first thing it does is it gossips out to those seed nodes, hey, I'm new here in the neighborhood, what do I need to know? Well the other nodes are gonna recalculate the token ranges, they're gonna figure out where that new node is gonna fit in there, and they're going to start streaming their data down to them, so that new node will start receiving data. Now there are four states for each node. There's joining, leaving, up, and down. This will be joining, and until it's fully joined in the system, it's not ready for read, it's still accepting data. But this is a pretty critical part of operations when you're joining a node to the cluster, because it can do it online.

Now you would think with all this data flying around, you need to have a better plan in your drivers. And you're right, because now the drivers are gonna be a part of the game. When a driver connects to a Cassandra cluster, it participates in all this data. So all those token ranges, they're not a big mystery. They're all a part of the system itself. So when the driver connects to that node, it says, “Hey, what's going on in the cluster?” Every single node has a token range. That driver is gonna accept all that information and hold it locally. So it's aware of what token ranges belong to which node and which replicas. So why is it important for the client to know what's going on in the cluster when it comes to token ranges? This is really the cool part of the driver, is that it is aware of these ranges, so it says, oh, this node has this token range, I will send data directly to it.

So one of the policies we have is TokenAwarePolicy understands the token ranges and knows which node has what and which replicas store that data as well. What does that do? Well, if you recall, I mentioned coordinator. Well, now the coordinator is not so important because we can go directly to the nodes that have the data and bypass this coordination phase. It's just more efficient. It eliminates a hop for your data. Now there are other policies as well, like RoundRobin, which do rely on coordination, and of course the DCAwareRoundRobinPolicy makes sure that that data is staying in the local data center. So of course, Cassandra being topology-aware, knows that there's data centers, well so should your driver.

So what does this mean for your scaling? Again, going back to this single server that gets completely hammered with load, that's bad, I used to live this world. In operations, if I had a server that was getting overloaded, my only choice was to try to shed some of the load or get a bigger server, and I didn't want to buy a bigger server because that means I'd have to bring it down, move all the data, just a bad idea. So if you go back to the beginnings of what Cassandra was built on, the Dynamo Paper, this was part of it, is when we add more servers to the system, it will increase the amount of scale and also the amount of density you have in your cluster.

So the graph that you see in front of you is a demonstration of how that load should look. As you add more servers you get more capacity. Now, of course, this is a read/write/modify load and of course you can do it remotely, or you can do balanced read/write, all of those are pretty much the same when it comes to scaling with Cassandra. That means that you add more nodes, you get more capacity, no matter what the workload. That is a really cool thing when you run out of capacity. Being able to add more servers is the solution, online, without any downtime? That will make you a hero. So years and years we've been talking about horizontal versus vertical scaling. Now they're still important. We still need to be able to vertically scale or else we're just always looking at horizontal. So we want to have good efficiency.

So if you look at this graph right here, you will see, that, yeah, a couple of nodes can do quite a bit, but four can do twice as much, and so on. So if you look at the size of clusters that are created in production environments, a ten node cluster, a hundred node cluster, or a thousand node cluster, they're used to map the load that they're being put against. And this is a really cool feature of Cassandra, is being able to scale as you need it.

We can also do the inverse, so if we scaled up for a certain workload, and we need to dial it back, think of like a Black Friday situation or something like that, great, no problem. We can also decommission nodes backwards. And that is an important thing. You can have tons and tons of servers out there but when you're done using them and you don't need them anymore, scaling them back saves you money in the long term.

Distributed Architecture: Peer-to-peer

Hi. I'm Patrick McFadin. Let's talk about Peer-to-Peer as it relates to Apache Cassandra. Now, this is a standard setup for relational databases. Client-Server Model. And this is what we've been using for years. The way it works with replication strategy with any relational database, is that you have the leader node. And that's responsible for everything: inserts, updates, reads. But then you have read replicas, or replicas, behind it. And those are set up at a little bit of a lag, but they're not exactly the same stature as the leader node because they're just copies at that point. And of course, what do we do with that? We shard it.

Friends don't let friends shard. Come on, this is the 21st century. We don't have to do this. What kind of problems can this create? We have our data spread out all over the place. Each shard could be, say a certain segment of your customers or a certain section of data for a say, IOT, whatever. But it's broken up. What happens then, now? Well, we have to route our data. So that means, probably in your application you have to write some code to figure out where your data is. Guess what? That means no more joins, no more aggregations, no more group bys because your data's spread out all over the place.

This is the worst part. Whenever I go to that leader that has the data, if it goes down for any reason, that's pretty bad. Cause now we have a situation where we have to wait for what? Failover. That's time. Time is money. We can't afford failovers because that's gonna be minutes or seconds, long seconds, ticking away while your customers are sitting there on their app wondering where you are. Finally, another leader takes over. Great, but you just lost something in that process. What happens in this situation? Where we have nodes that can't see each other anymore. So those servers, the leaders can't see the followers and it's chaos. What do we call that? A split brain problem.

A split brain is just as bad as it sounds. Those nodes, they can't see each other anymore. And when you're in a highly consistent system, this doesn't work. You can't read data from those nodes behind there, the followers, because they need to be consistent with the leader, and they can't. What happens? You're down. Guess what? Now those leaders are reelected, you have two sets of leaders, you have a lot of chaos. Dogs and cats, living together, don't let this happen to you.

So how does this work in a Peer-to-Peer system? And this is probably what you’re looking for anyway. If you're looking at an Apache Cassandra, you want this. How does it work? Simple, we have replicas of data inside of our cluster, and those replicas are independent, it's Peer-to-Peer. Now, of course, there's coordination with that. But you can do some really neat things. When you write data to the coordinator, it writes asynchronously to every single node in that replica set, no problem. Now your data's set. But, what happens when you write it from the other side? Well that coordinator will also write to those exact same three replicas. Perfect!

Now this opens up a great situation. What if we split this thing right down the middle? That's a problem, right? No! Cassandra manages this automatically. It is not a failure event. Each side that can be seen by the client is still online. As long as you can write to a replica, you're gonna be online. Now this is, of course, certain situations. You can dial the subs so you can be intolerant of the situation, and dial it down so you're completely tolerant of the situation. That's up to you as a programmer, as an architect. But, in this situation, you can manage it. So how does it work? Well, if I write data to a node on one side of the partition, it will coordinate it and write to the node and replica that's online, that it can see. And that's important. From the other side, it can see two of the replicas. Well, it will write to those as well. So this is managed by consistency level. Consistency level is talked about in a different module, but this is what's going on here.

This situation is something you can manage with your Cassandra cluster and keeps you online when bad things happen like, I don't know, a backhoe goes through a cable. Amazon could shut down a region, you could lose a single node. How do you plan for that? You use a system that could deal with it.

Distributed Architecture: Vnodes

I'm Patrick McFadin, let's talk about Vnodes, which is a shortened version for virtual nodes. What are we talking about here? Start out with just a regular token assignment range. When we have a token assigned to a node, that shows the range of data that it will be assigned. Now, if you look at this diagram, well you can notice that there's a little bit of a lopsidedness to this. Two of the nodes have just a short range. One of the nodes has a lot, and that's gonna be a problem. That's what we would call a hot spot in our cluster.

When I assign the data to that node, you'll see in the graph, there's a lot of them in one node, and that yellow node seems to be loaded up a little high. The blue and the green node, not so much, and that's gonna create a problem later on if I write data to my cluster, even though the partitioner is writing it out in an even fashion, well that one node is gonna get more than the others. Now that could be more disk space, more load on the server, what have you. Now if I was to add a new node to the system, well I'd have to make sure and bifurcate that range perfectly, so when it inserts itself into the cluster, all of that range of data that belongs to it will get streamed from the single node that has so many, so as it comes online, it says, "Here you go, here's all your data." Great, now we have a nice even looking ring. That's an operations challenge. Now that means that as an operator, I have to pick a token range that's proper for the cluster and making sure that it's distributed right, and that can be a bit of a challenge, and not always easy, especially if I have three nodes and I just need to add one. What if they were all distributed evenly? Now I have to figure out what the halfway point is, and I'll create a hot spot, not an easy thing to do.

So let's do this again with virtual nodes. Now in this case, here we have our token range individually assigned to each node, but virtual nodes allow us to create individual smaller ranges per node, and it breaks up these ranges across the cluster. Now, each single node doesn't have all the data. By default, pre-3.0, it is 256 ranges per node. Now, past 3.0, it's much less, and it's configurable by the user.

So if you look at the distribution of data inside the system, it will match up with what those smaller ranges are gonna be. Now when I add that new node to the system, it's gonna be able to take ranges of data from every node, so this is a much more even way to get that data. If you look at the way that it distributes the data, here comes the data after I add it, says I'm gonna re-partition my data. Now all the other nodes are gonna say, "Here, here's some data from me, load up, good to go," and that node is able to come online faster, but it also gives us the ability to add a single node to a lopsided cluster or one at a time, which is a lot easier as an operator to make work. If you have to assign tokens and make up the tokens, that can be a lot more work, and sometimes you don't get it right. With virtual nodes, it's much easier.

So, with Vnodes, this should make it that whenever you add nodes to the cluster, everything is even, and as I said before, the default is 256 Vnodes, but that's pre-3.0. Cassandra 3.0 and beyond, we have a different algorithm to assign tokens, so there's actually a lot less needed, and those token ranges are always automatic, and that's really the key here, is that you don't have to come up with those ranges by yourself, because I'll tell you, you're not gonna do it right every time, or you're gonna have to build some really interesting algorithms yourself to make it work. So, this is what you should use as a default in most cases.

Now, with configuration, simple. We have a Cassandra yaml setting for how many tokens you want to set up. Again, if you are now setting up a brand new cluster, and you're using something past 3.0, you should look at your own details. This is in the documentation, but that has to be set up first, or you're gonna be assigning a single token to that node, which you don't really want to do. The value of greater than one turns on Vnodes, and that's what we're looking for.

Distributed Architecture: Gossip

I'm Patrick McFadin and I wanna gossip. Well, I'm talking about gossip and not just people talking about each other, and of course that's what we think about when we think of gossip. Gossip is an important part of Cassandra and as in most situations, gossip can be a little tricky and of course whenever you're human to human gossip things get bad. What's funny about our gossip is it keeps our cluster working well, and that seems a little counterintuitive, I'm gonna try to explain that.

When we gossip in a cluster, it is based on a very established protocol. This is one of those things that have come from nature and it's great. Nature has done a really good job of figuring out this problem, and the problem is, is if I have one thing that needs to tell everything, or in this case maybe a virus, think of Ebola or your common flu, of course one person gets it, the whole office gets it but it's not because that one person went to every single person and gave them the flu. This is how Cassandra does it as well. One node will tell its information to another node, and then that node's job is to tell other nodes, and it will start spreading around and eventually every single node will have the correct information. It will all get the information but it's not synchronized. It doesn’t mean that one node has to tell everyone. If you're talking about large clusters, thousands node clusters, that's gonna keep your traffic to a minimum, and that's really important because when you're talking that much you can get out of control really fast. That's why a gossip protocol is very important for cluster state.

What I've listed here is how a gossip node is chosen. This is not something that happens all the time, it happens about every second, every few seconds, but it picks one of the nodes to gossip with and not every single node. Just like a disease as it spreads through a community it goes from carrier to person to person to person and it doesn’t mean that every person has to touch patient zero. That is the propagation and that propagation has a certain clear essence, so whenever it finally settles into the state that will mean that every node just talks to every other node and it will take some time to coalesce. It is very probabilistic and when that data gets spread around there is no absolute way that that data will find its way to the end node cluster with a determine set of time but you know it will eventually get to that state. This whole thing is very complicated inside and I'm gonna try to explain that to you, how it works from node to node. This is a very reliable way to get things done in a large cluster, so follow along with me if you can.

First of all, let's talk about what is being gossiped. This is cluster metadata. This is the state of the cluster. Gossip, it's transmitting state, what's happening in this big cluster from node to node to node. If you think about things like a node goes offline or it's under a high load, for other nodes to know that it's important, this keeps the state of the whole entire cluster in nice healthy way. On a single when we ask, what information is being sent? We have a few things.

First of all is that endpoint state. What is the health of this node? The heartbeat state is consisting of two things, the generation when it booted and a time stamp, and that is a unique part of that node. One of the state is status. In this case it's normal meaning it's online and ready for reads and writes. It can also be joining or leaving, those are two possible states in the system. We also get location information, where is the server at? We have this one in particular data center and a particular rack, and what schema version. That could be pretty critical for schema management if you update your schema on one node, and so to propagate you make sure it's the same schema across every node, so this is what stores this current state of that schema. After schema is the load. Load shows how much disk pressure there is on the system, so basically how much is being stored on this particular one. Severity shows the pressure on the system from the IO standpoint because every node in Cassandra is constrained by IO first, and this is a good indication of the health of this node. There are other things in there as well. We won't go into everything but of course think of the state of this node, that's what's gonna be in this gossip packet.

Let's look at a single gossip exchange inside of a cluster. When A talks to C, this is what it’s gonna be sending, the list of state for all these nodes. First, we're gonna compare A to B from left to right. In the sync packet it's going to put in all the information it has about what it currently sees. Each node, from left to right here, they have a current state as they see it. We're gonna make sure that they're up-to-date. That's where that generation and heartbeat come in really handy because that's how we know this is the current state of the node at a current time stamp. This is what we call a digest. The first thing on here, this one IP address, here is the heartbeat, and here is the time stamp. Second node, same thing, third node, same thing. Every one of those that it currently knows about. That's send to another node.

From there it does a comparison. It starts at the top and says, "Oh, okay," this does not match already. I'm gonna have to send that back with what I currently have. I need an update. The second one. The current node has newer information, so it needs to send that back. All right, we can do that. The third one, of course, is lower so we're gonna send that back and I need an update on that as well. We've listen through every single node in the system, got the current state, understand what we need to update, and that's gets acknowledged back to the other node. At this point it will take that payload of data and it will go through the ones that it knows that it needs to update. It says, "You have new information, I have new information, let's make a list." In this case I have two of these that need updating to the second node, so that acknowledgement will be put together and then finally that third part it will accept new information from the other node. It says, "You have newer information than me, I will accept it based on that." Finally the first node will send the data back to the second node with the newer information and it will update both. Now everyone is on the same page. This is the basic interaction and this is how it all works.

This may bring up a really important question in your mind, how much network traffic does this actually create? I think there has been a common misconception for a long time, that gossip generates enormous amounts of traffic because it's this shooting state all over the ring. No, not so much because if you're doing this at a very long interval, say seconds, then each node is just gonna be communicating with a couple other nodes every once in a while. It's not a constant stream of data, it's just not flying data everywhere. In a large cluster that data may take a long time to finally coalesce but it doesn’t cause enormous amount of traffic, and that is the important thing to understand, this is not going to create a need for better networking gear, gossip is not your problem.

It's the streaming of data, the actual data, that is more of a network problem. Way more. Where you get a very clear misconception is when you see a new node joining and the gossip goes to the old nodes, to the seed nodes, then you see a lot of traffic. That traffic is not gossip traffic, it's data. Of course when you bootstrap a new node that gossip has to happen, this is what's transmitting state around but it's very short. Once that's done the larger part of that normal traffic is gonna be the data going into the new node, and that's what you see. Don't confuse that with gossip, it's just data. That's probably gonna be a lot if you have a lot of data on each node. I hope that explains gossip in a clear way.

Distributed Architecture: Snitch

I'm Patrick McFadin. I have 99 problems and a snitch ain't one. Okay, of course, I'm joking but this is really about the snitch and Cassandra does have a snitch. Now, it has nothing to do with Harry Potter. I know, that's the other joke, right? But this is really an amazing thing that keeps things together in your cluster so it determines where each piece of your data should go. And when we talk about the typology awareness of Cassandra this is where all the magic is, inside the snitch. And there is just not one. The snitch is a concept but there are implementations of that and we're gonna walk through some of these.

So first, we have these main groups of snitches that I want to talk about. So we have the regular ones. The ones that we would normally see in any environment but then there's also some very specialized ones for cloud-based appointments and those cloud-based appointments need some very special stuff but you'll see in a minute why those are important for cloud. They all have to do the same thing and that's really determining where your data is. So in different environments, you can mix and match for what you need. We'll start with the simple snitch and it is the default snitch. I'll tell you, this is not the one you put in production. It doesn't give you the right kind of configuration and flexibility you need for a production deployment. Great for your laptop, but that's about it. And what it really does is it just says oh here's five nodes, spread it out evenly or ten notes, spread it out evenly. The problem is if you have nodes that are stretching over multiple data centers it'll still try to spread it out evenly over five data centers so you get this kind of lopsidedness around the world. Doesn't work very well.

If you want to be more deterministic in where your node placement is the property file snitch was the first stop on this and this is one of the original file snitches. Its job is to take a property file. That's cassandra-topologu.properties file and determine where that node exists inside a certain topology. In that, you have to list every single node in your cluster the IP address, the data center and the rack. Now, as I say that, you're probably thinking Wait, that's a lot. If I have a big cluster, that could really hurt. What if I update my system? What if I add more nodes? Well, good luck because you have to go update a file every single time. Okay, I'm really selling this, right? No, don't use property files snitch if you can help it. It's really old-school. It is an operations nightmare. It's there for a lot of reasons because people have been using it but there's a better way to do it.

That's the gossiping property file snitch. This is the one you probably want to use in most cases. Now, I say that because it gives you the best flexibility but it's also really deterministic. I can say exactly where I want this node to be inside my topologies. Here's the features One file, Cassandra-rackdc.properties file and inside there, you put the data center this node belongs to and the rack that it belongs to. How does it work? Simple When that node is added to the cluster it gossips out that information. It says, here I am. If you look at the module about gossip you'll see that that's part of the state. But when it sends that out to the cluster the rest of the cluster just takes that information and says, thank you very much. That means that I don't have to update all my property files if I add a new node. The single node adding to this system will have to add that information. Great. That is so much easier to work with from an operation standpoint.

Some of the older snitches that aren't as useful anymore for instance the rack inferring snitch. This snitch was fairly useful but it tried to come up with where your node is based on the IP address and it was based on a little bit of magic, some voodoo. You had to kill a chicken. It didn't work very well. But if you had a system that was very, very specific about the IP address. For instance, this quad is the data center this quad is a rack and this quad is a node. Great, works awesome. As you can tell, this is probably not the ideal way to run a system. This is why it's kind of faded away over time. Gossiping property file snitch is much easier in this case cause then you can exactly say where you want the node to be and it doesn't depend on some IP address scheme that you have set up.

The cloud-based snitches are there for what you think. For cloud? Anyone use a cloud? No? Everybody, good. So, when you're at a cloud you kinda get some information out of that. Now, it's not 100 percent but it started out with the Ec2Snitch and there were some very interesting things you can clean out of the IP address about where it was at. That was somewhat deterministic, but not 100 percent. It was really good for deploying automatically in Ec2 and then finally, the Ec2MultiRegionSnitch came up after and that was because the Ec2Snitch didn't deal with multiple regions so multiple data centers. Google, not wanting to be left out now has a GoogleCloudSnitch. There's one for Cloudstack that is out there as well. There are others that are coming for different clouds. These are tailored for very specific environments so they should be used only for that.

The dynamic snitch. This is not one that you would actually set up. All of the snitches use dynamic snitch under the covers. Now, why? Because dynamic snitch's job is to make sure that the nodes and all their information are used effectively. Like, who's naughty, who's nice. It's kind of like the Santa Claus of snitches. Now, if you wrap all those other snitches around dynamic snitch when it comes to performance and nodes say, one node is having a higher load versus the other it can be more selective in how that data gets sent around the nodes. So, as maybe one node is having more trouble with maybe, it's under an extreme amount of load we'll do a read off this other node because it's gonna be a little bit faster and isn't under as much load as the other one. And it's trying to sort out who's under a lot of load based on the gossip that we sent around. This is turned on by default for all the snitches.

So configuring snitches is pretty simple. You put this information directly in your yaml file. If you want to change it, you can make that change. Restart it and then just run repair on every node. What the repair does is because the snitch determines where that data is in your cluster it may want to rearrange that data. Repair lets you stream the data to the different nodes. So that's why you need to have that extra step.

Replication/Consistency: Replication

Hi, I'm Patrick McFadin. Let's talk about replication with Apache Cassandra. So, let's start out with the most simple way of setting up a cluster, which is probably not the right way, and that's with the replication factor of 1. So, this isn't really a replication factor, it's just that each node now has a set of data. Of course, each node has a different set of data, so if we lose a node, we don't lose all of our data, but this is pretty much the equivalent of sharding your data, it isn't a good idea. If you do this, you are gonna to lose data if one of those nodes go down. Not a good way to keep things going. Now, if you look at each one of the nodes, they have a color. Well, that color corresponds to the range of data that they own. In this example, we have, of course, this is not using the full partitioner range, but in this case, we're gonna to use an imaginary range of zero to 100.

So, looking at these nodes, you'll see that they start at the previous node and go up to the node that has the color. So, this is our example that we're gonna to use for this replication. So, now when we go to write data into the system, we still have some data being spread around, this is a partitioner, so when node 59 comes in, say, hmm, where does that fit inside of the ring? Well, it's gonna look at the ranges, and that coordinator's gonna say, ah, that belongs over here, you belong in the purple node, so it moves the data down there, great, but, again, this is not the best way to operate if you're trying to be resilient. Let's move this up a notch.

Let's move this to a replication factor of two. Okay, well, that's actually a pretty good stance, not the best, but what that means is each node now stores data for itself and its neighbor. So, if you look at the top here, we have a green range and a blue range. That is gonna to give us a little better resilience. Now we can lose one of those nodes and still have a copy of the data somewhere else, but this is what works whenever you write data into the system. So, again, here comes that data at partition 59. Well, now it's not gonna write to just one node, but two nodes asynchronously, that coordinator will write asynchronously. Now, if you use the drivers that are token aware, it's just gonna write that directly to those nodes.

All right, let's up it again. This is the replication factor I would recommend for production, and that's a replication factor of three. This means, now each node stores not only its own data and its neighbor's data, but its neighbors neighbor's data, so now you have three copies of data on each node. This is a good strategy for a lot of reasons, and it has to do more with consistency levels, but it also gives you a really strong resilience to failure. Nodes fail all the time, and with a replication factor of three, you can survive that. So, what happens when we write data? Here comes our token 59, it comes into the node, the coordinator, and then it writes out asynchronously to the three nodes that are the primary and the two replicas. Now we have it everywhere.

So, now we have a very well replicated ring. Why is this important? Well, bad things happen to good nodes. Shark attack, could happen. I see this on stack overflow all the time. Oh, my node got attacked by a shark, well that's a bummer, good thing you had a replication factor of three, you're okay. This is why we do it. Now, of course you don't get attacked by sharks, but I'll tell you what will attack, tornadoes, electrical storms, fires, everything that could happen, people pulling out hard drives at the wrong time, pulling power cables, Amazon rebooting your servers for you, this all happens all the time, so, just be ready for it.

So, here's a request that's gonna come in on a node failure moment. Let's say those two nodes were in a rack that had a power plug pulled. Uh oh, now what? If we were using a relational system, that means that server is offline, now that's not the best way to stay online, so let's see how Cassandra handles this. Remember that data at token 59? Well, whenever we asked a cluster for that data, the coordinator looks around and says, well, two of the replicas are offline, but one of them is online, so it's gonna get the data from that node, and you're still gonna be online, no one knows the difference. That is really the key to survival. Being able to survive this kind of an outage on two of your servers and still having your cluster online and your data available is a golden moment, and this is what lets you sleep at night.

So, we've talked about replication inside of a single data center, but I say this often, you have 0% chance of being 100% uptime if you're in one data center. And why? Because data centers fail all the time. I don't care if you're in a cloud when a region goes offline, on Premise where bad things happen, power goes out, fires, these all happen. So, being in multiple data centers gives you the maximum chance for uptime. So, I'm gonna set this up with network topology strategy. Now, again, network topology strategy is what I recommend for production environments. Why? Because it makes your cluster topology aware. I've set up these two data centers, a west and an east. Now, if you look at the token ranges, I have a zero at the top on the west, but I have a different range set up on the east, but actually these are complete ranges of data. Zero to 100 and zero to 100 on east and west. And, what I've done is bifurcated the ranges on the East Coast, it's still a complete range of data. Now we have a copy of data on our West Coast data center and a copy on our East Coast data center. But how does it look from a full replication strategy?

This is where we can use a setup inside of the keyspace. The keyspace holds replication information, and what I can specify is, per data center, how many copies of data I want. So, in this case I have two in the west and three in the east, and that's great for my use case 'cause I only wanted two and three, but that gives me that control. If I didn't want any data in the West Coast data center, I could just change it to zero. This is a really good scheme for things like data sovereignty. So, when I write data into the West Coast data center, and it has a specific home, it asynchronously copies locally, but then sends one copy to the other data center asynchronously as well, and the coordinator there replicates the data locally. Now I have three copies in the East Coast data center. I now have a complete distribution of my data and I can survive multiple problems. I can lose a node in either data center, still online, I can lose an entire data center, still be online, awesome. This keeps me gainfully employed and keeps my boss out of my hair 'cause this is gonna keep my apps online, my users happy, everyone is cool.

Replication/Consistency: Consistency

I'm Patrick McFadin. Let's talk about consistency of your data with Apache Cassandra. Let's start out with the basics here, and this is why consistency is important. We have some choices to make and in distributed systems there are a few. You're probably used to ACID compliance, A-C-I-D, which is what relational databases are built around. This is a different angle than that. This is about consistency, availability, and partition, and picking two. And in those cases, sometimes are very hard to choose, but whenever we're trying to keep applications online ACID means that you have very, very strong consistency guarantees with other things such as atomicity. And that's important to you, but it's only gonna really be useful in a single system. When we start moving data around a large system of multiple nodes, we have to start considering things like partitions. Strong consistency means that when I set my data it's on every single replica or everywhere the data should be. That is probably something in your mind which is really important, and we will get to how we can tune that. The availability is where we are today with 21st century applications. Availability is paramount. You have to be online all the time, or your competitor will beat you.

And then finally the partition tolerance. Partition tolerance is network partition tolerance. So one group of nodes can't see another group of nodes. How do you handle that? Now if we were in a client-server model, let's say with some replicas, with a leader node. That doesn't work. If everything is split in two and they can't see each other, you just get into chaos. Most times the default mode is just shut down everything until you figure it out. If you had a database that could deal with this you might be better off, because things like partitions in a network happen all the time.

Where does Apache Cassandra fit in all this? Right here inside the availability and partition tolerance. That is what it favors, and the reason being is because it's built for cloud applications. And knowing how the cloud can be, which means you can lose data really quickly if you have network events, you should be able to tolerate that and not lose the data. And you could lose nodes pretty quickly, or add them, or delete them, things happen. AP is a strong place to be when you're in a highly dynamic environment. Now what about consistency? Does that mean that Cassandra's not consistent at all? No, that's not what it means. What it means is that now we can choose certain things to give the consistency level we need for our application.

Let's look at some of the consistency levels that are available in Apache Cassandra. So when a client writes data or reads data into the cluster, the client gets to choose a consistency level. And this is paramount, every time a client reads or writes, it chooses a consistency level that you specify. So in the case of we have three replicas, the data goes to the coordinator and that data is still written out asynchronously and read, but now if we specify a consistency level, in this case ONE, we only want one node to reply. That's fine, we're good with that, so we get to choose the fastest node in that case. Now what if we choose something like, oh, I dunno, QUORUM? QUORUM now means that I want 51% of these replicas to respond to my request. So that means now I add another one.

Now here's the thing about replication, this is where it becomes really important. If I have a replication factor of three, that means that 51% is two. I only have a replication factor of two, that means that I have two nodes to satisfy 51%, which pretty much means all my nodes have to be online or I can't satisfy that consistency level. That doesn't keep your application online at all. But in this configuration I can still lose a node with replication factor three, operating at QUORUM. I could even reboot it. Let's say I need to do a security patch. Okay, fine, go ahead and reboot that node. It's not gonna change your consistency level. Now let's look at the strongest consistency level that you can ask for, and that is ALL.

Now I'm just gonna start out by saying this is not what I recommend in any production environment. But what does it even mean? When I ask for that, I'm expecting every single replica to respond to me. That could be great if you're looking for really strong consistency, not great if you're looking for resiliency. Because if I lose a single node in my cluster, that means that I have requests that can't get satisfied based on the all consistency level. So it's a little too tight in my opinion, and it probably isn't gonna get what you want anyway.

Let's look at a situation where we are asking for strong consistency. That means that I can read the data that I wrote into the system with some assurance. Starting out, we write some data into the system. And that's going to go into the coordinator and asynchronously copy to all the replicas. Great. Now if we do that in ALL, a consistency level ALL, that means that every single node has acknowledged the write to disk and acknowledges it back to the client. So this is the case where we get exactly what we want, which is every single node has data. So what happens when we go to read that data? Well we can use a read consistency of ONE, because we know that every single node has a copy of that data and has acknowledged it, which is pretty critical. So when that data comes back to the coordinator it's going to have the correct data. So in this configuration, I know that if I write the data into the system that I will always get to read the data back. Every node has the data and it's committed to the disk.

Let's look at a different scenario. When we write at consistency level QUORUM, now this is a good balance. I think this is where you probably will be in your production environment. So now when I write the data, it goes to the coordinator, that coordinator then copies the data to the two nodes. Good, now consistency level QUORUM, strong consistency there it guarantees because we're gonna get two nodes out of the three, that third node could be offline. We know that two nodes have committed that write to the disk, very critical. So what happens when we do a read? Well when we do a read at QUORUM, we're gonna get an overlap one way or the other, and we can read from two nodes. We know that at least one of those is gonna have the consistent set of data, and we'll get a consistent read back. This is a good balance. If we use consistency level QUORUM on our reads and writes, we're gonna get strong consistency across that. Except now we have some failure resistance. If I lose one of these nodes during the whole process, then we can still keep going.

Let's look at a situation that might happen where you don't have a strong consistency. Again, consistency level ONE is fast because we're writing to one node. Except if I write to this first node that gets acknowledged back to the client as a write, but if I try to read that data immediately I may read from a node that hasn't quite got the copy yet. And there's a timing thing, it's just-in-time problem, that I have a potential of getting inconsistent data. That's what we're expecting potentially with ONE. It's not always, this is just what I should expect potentially. I shouldn't be surprised if it happened. But again, this is a tunable consistency. I can do ONE, I can do QUORUM, I can do all. QUORUM seems like a good pick there, but maybe if I'm doing log data and I don't really care if I get an inconsistent read because it's something that may happen in five seconds, and I'll get it on the next time around. There's plenty of reasons to do this. What this does give you is the maximum amount of uptime. You are gonna get a ton of uptime with this, because you could lose two nodes now and still be online. That's really cool when it happens.

What about across data centers? Boy this is an interesting problem to solve, and one of the things that I absolutely love about Apache Cassandra is how it manages data across multiple data centers. Let's look at our simple QUORUM problem. If we were to do a QUORUM across two data centers, you have to get a QUORUM across all those replicas. Which means now when you ask for a QUORUM write, it has to show that it's committed that data in multiple nodes, including waiting for an acknowledgement across the data center. That can take a while. If you're in United States and you're using an east coast versus west coast, that's 70 or 80 milliseconds of latency that's just gonna hurt. So what we have available if you just want to stick to the local data center is LOCAL_QUORUM. That's a great option. What LOCAL_QUORUM means is that I will only ask for a QUORUM locally with the replicas. It will still asynchronously copy to the other data center. It doesn't mean only keep my data in this local data center. It just means the acknowledgement piece of this. This gives you a good balance of speed, performance, consistency, and keeps your data centers in line.

So we've discussed some of the consistency settings, but not all of them. There are several to pick from. And some of them are not smart, some of them are really smart. Let me explain a few of these. The least consistent setting that you can have is a setting called ANY. I'm gonna tell you right now, if I find anybody in production using ANY, I am gonna laugh at you so hard because that is a really poor choice. And only because it requires no replicas to be online. It just stores a hint and lets you keep going. Not the best plan at all. You should really stick with at least ONE, minimum is ONE, I would never go below ONE. Now ALL is another one of those that fits in this, well, it's there, I just don't expect to see it in a production environment, and why? Because it erases any fault tolerance at all with Cassandra, don't do it. It looks very enticing. It will not make your database an ACID compliant database. What it does is it creates this really strong consistency that's hard to manage any failure at all. So ONE, TWO, THREE, QUORUM, LOCAL_ONE, LOCAL_QUORUM, all of those are all about consistency and uptime. So use those.

Now that you understand all the different consistency settings and what they actually do, you should use that to mix and match in your application for your use case properly. If you're doing something like, say writing log data, using a full QUORUM is probably not the best plan. Using a ONE is a good plan. If you're doing something like user data and you wanna make sure it's strongly consistent, QUORUM is a good choice in that use case. So I've given you some of these different settings, what they mean, use them wisely.

So just some general rules of thumb about consistency settings. The higher consistency level that you ask for, you get less chance of stale data, but you're really piling on the chance for more latency and less resilience. I've given you some of those balances there, but that's a good rule of thumb. Higher, less, those are the scales that you're going to tip. It depends on your use case and what you're using it for, but again, use it wisely and be successful.

Replication/Consistency: Hinted Handoff

I'm Patrick McFadin. Let's talk about Hinted Handoffs in Apache Cassandra. So Failed Writes. This happens all the time because failures happen all the time and you should be ready for that. If you look at the Dynamo paper, which was really the beginning of where Cassandra started, it was about handling failures because in a large scale system, just by percentages, little things are gonna fail all over. You should really be ready for that and when you are, everything's gonna be a little easier for you. How does it work in Cassandra? That Failed Write scenario? This means a node's offline. Well, okay. What happens when we go to write data in there? So now we have data coming into the coordinator. Say, hey, I'm going to put this into the cluster and I'm going to write it out the replicas. Oh, but one's down. Maybe you rebooted it or you're patching it, something. I don't know, pulled a power plug, whatever. That node being down should not be a situation that could be unhandled. So what happens is, that data is stored locally on the coordinator for the node that is down. Think of it as like when your neighbor holds your mail when you're on vacation. I'll just hang onto this for a little while, while you're gone.

Now, the node comes back online and everybody's happy but what we have here is an inconsistent state. That node has old data on it until that write is satisfied. So when it comes back online, it's like Hey Neighbor, I'm back from my vacation. What did I miss? Well that's when the data comes from the coordinator. Now it can write the hints to that node. The node then takes the data. Everything's happy, everything's back online. Think about this in a situation where you need to reboot a server. If I need to reboot a server, I don't want to have to worry about a missed write or inconsistent data afterwards. So whenever I just do it, I can count on the hints to make this work.

So that Hinted Handoff, what I just talked about, what does that have to do with Consistency Levels? So I think I mentioned before, a Consistency Level of ANY means that the hint stored is good enough. That means you can have zero replicas online. I don't recommend that always because I think ANY just makes you feel safe but if you were to lose that node with the hint on it, then you lose the data. That's a potential data law situation. I don't like ANY at all. Now when you write at a level of ONE, that means at least one replica must successfully have that write. But a hint does not suffice in that case. If you can't write to at least ONE node, it rejects it. You're down, out. So that's good. I want to know if I have no nodes left online to write data to because that's a pretty important situation to take care of.

So how do I set this up? The hints themselves are stored in the local node. So there's some defaults that are in play here. For instance, we store Hinted Handoffs for a default of three hours. After three hours it stops because if you think about it, if you've got a node that's gonna be offline for say a week, you do not want this node doing infinite amount of double duty while it's waiting for the neighbor to come back to give the mail to. So we have a cap. That cap is there to limit the amount of stress you're putting on that coordinator. After that, we'd have to do what's called a Read Repair and we'll cover that in a different module, but for right now just know that that's the limit. If you don't want to do Hinted Handoffs, you can disable that completely, but I don't recommend that. It's still a pretty important thing. That means you have to run repairs constantly if you don't.

Now I have to mention that pre 3.0, the Hints are stored actually in Cassandra. They were a data table. After 3.0, we now stored hints on disc, which is way more efficient . So there's a difference there. Hinted Handoff can create a load on the system but post 3.0 it's a lot less.

Replication/Consistency: Read-repair

Hi, I'm Patrick McFadin. Let's talk about read repair with Apache Cassandra. Read repair is an interestingly-named operation, and it's not because things are broken so much, it's about entropy and anti-entropy operations. Entropy is just, if you've studied any thermodynamics, you know entropy happens in the universe. Things just kind of degrade over time. Mountains collapse, things get warmer, colder. It's just gonna happen, and your data is the same way. Bits can get out of sync, things can go bad, nodes can die. Entropy occurs in any system. And so, what we're doing here is really assuming that it will happen and then finding a way through it. Repair is the process of making sure your data is consistent across all replicas, and it's just really there because we know that it will happen.

Let's look at a normal read operation and how it works. So we're gonna satisfy the consistency level of ALL on this read, so when a read goes with a coordinator, we're going to tell every single replica "I need your data." Well, what's really happening under the scenes is that it will go out as an asynchronous request. The fastest node will return data. The other two nodes will return what's called a digest, which is really just a summary that says "Okay, I checked some. Here it is, here's your data as I see it," but only as a digest form, instead of having to return the entire data set. When that data comes back, it looks at the checksums. It says "Is my data the same checksum? "Oh good, okay so everybody has the same data, awesome." That gets sent back to the user.

Now, what happens if we have an inconsistency? Same read request, goes to a consistency level of ALL. Every replica will receive that request. But let's look what happens now. I have different digests, so when the data goes back to the coordinator, those digest values go back as well. Now we have some thinking to do, and this is where the coordinator has to make some decisions. The digests are now all different, so whose data is proper? Good question. This is when we have to fall back to the time stamp. So now that we know that we have an inconsistency, let's look at the time stamp. The time stamp of this particular data is 135. That was the response we got. Now, the other digests, if they were wrong, we need to go back and compare the actual data and the time stamp there as well. So we will go back to the other two nodes, say "What do you have for a time stamp?" Look at what we got, different answers. 122, 159, and 135. We want to take the newer time stamp. So it will bring those into the coordinator. Aha, 159 is the newest. So now what it's gonna do is return that to the client, here's your data, and it will asynchronously copy those out to the two nodes that had inconsistent data. Thereby, all the data is now consistent and proper. They all have the same time stamp and also the same digest.

So what we just saw is a situation where we have a visibility across all the data. Now, this is available for ALL or QUORUM, but what if we're using something like a read consistency level of ONE? You have a really high chance of getting an inconsistent read and having no idea that there's an inconsistency inside your cluster. Read repair chance is there to create a probabilistic structure around that. Now, 10% of your reads, by default, can generate what's called a read repair. And what it does is when it satisfies the read at ONE, it also in the background does a check. It says "Oh, is this data consistent with the rest of the replicas?" and does that check in the background. If it's wrong, it will resync the data, make everything right, so the next read will be consistent. This is really important to give yourself the best chance at consistent data in your data set without having to put a lot of extra load on the system, or just having to use QUORUM all the time, which may not be the right choice for your application.

Now, what if you know that the data's inconsistent? And that's possible. If you recall when we talked about hinted handoffs, hinted handoffs are there for three hours by default and after that, it stops storing hints for the node that was offline. So when that node comes back online, let's say after a day, sure it's gonna get three hours of data, but now it has a whole lot of inconsistency to make up. Repair is something that runs manually. So this is something the operator can do. It's in the no tool command. And no tool repair with its various options will manually go out and make sure that every data replica is consistent. This is something that has to happen all the time anyway. With a read repair chance, sure, that's something that goes in the background. Great, automatic. But repair is something that you want to run on your data sets at least once within a gc grace period, which is a default of 10 days. And why? Because it just keeps your data consistent all the time. It eliminates the need to rely on the read repair and it definitely gives you a better chance for consistent data all the time, even at a read of ONE.

Internal Architecture: Write-Path

I'm Patrick McFadin, let's talk talk about Write Path in Apache Cassandra. Probably one of my favorite parts of Cassandra internals, and why? Because it is so simple. And I'll tell ya, simple is awesome in my book. Complex things are very difficult to work with, because they fail in complex ways. This is something you can reason in your mind how it works, and it will really help out with things like data modeling in the long run. If you try to figure a relational database's Write Path, good luck. I tried it a long time ago. I got most of it, but it sure was a hard thing to figure out 'cause there's just a lot going on there. I'm gonna explain how Cassandra does it and hopefully you'll find it a lot easier.

So let's talk about this single write. When we write data into a node, we give it a partition key. That gives it the locality. Which node in my cluster am I going to write to? But what happens inside the node? What is going on here whenever we write data? So when the data arrives, it's gonna go into the server process. Now that server process is gonna have a couple of things going here. The server is running in a JVM which consumes RAM, and we have hard disk that's gonna store our data. So as the data comes in, it's gonna be written in two places. First, it gets written to the commit log. The commit log is an appendeling log on disk. This is the first stop. This is why Cassandra writes durably. Durable write on disk in the write path. The next stop is what's called a mem-table, and that is in RAM. This does not mean that Cassandra's an in-memory database. Not at all, but a mem-table is a representation of the table data in memory. And this is the second stop. Once it goes to the commit log, once it goes to the mem-table, that write is acknowledged and it goes back to the client.

We are done, but your data's not done. Your data still has a life. So as more data is added to the system, same process, it just keeps adding more and more and as that data is increasing inside memory, we have to do something with it. As we're writing data you'll notice that in the mem-table, it's adhering to your data model that you've created. If you wanted a certain sort order by the clustering columns, it's holding up to that. So this partition is about state, so Texas, and it's sorting by city. Great, not it's all sorted in memory. So once you're writing data into memory, because it's running in a JVM process, you know that there's not an infinite amount of memory in there. Eventually you're gonna have to clean out that heap space, because if you blow it out, you're gonna be doing things like garbage collection.

Cassandra has a process called a flush. So when you flush the mem-table to disk, that created what's called an SS-table, and an SS-table is the second stop for durability. First is the commit log, now the SS-table. Once it's written to the disk, it's also written in the order that was in the memory. Everything's ready to go for a read. But now that means we don't need the commit log anymore. So the commit log is removed for that segment, and the data is now durable in the SS-table. Now when that node comes online, instead of going to the commit log to re-read that to make sure that's consistent, now it can go to the SS-table and get my data. Now what I should point out here is the SS-table is 100 percent immutable. You will never ever write to that data file again. And that's pretty unique.

Relational databases do not do that. Relational databases are all about random reads and random writes to a data file. This is how they block out the data on disk. SS-tables store the data that you've given the node. That node will only store that much data. So, if you give it a gig of data, it will store a gig of data on disk, it won't create a large data file that's mostly empty. This is called sparse storage. If you are running a server that has spinning disk, we recommend that you run the commit log on one disk, and the data drive on another disk. This will keep the head contention to a minimum. That means as the commit log is writing in a sequential fashion on disk, the hard disk that stores your data as it's doing long sequential reads and long sequential writes on the SS-tables, don't contend with the commit log. That will keep your writes moving along quickly, and keep your reads moving along quickly because it doesn't have to worry about the disk head moving all over the place like it would have to do with commit log.

 So now that we understand the commit log, let's look at how SS-tables work with your data. So now we have more data coming into the system. Here comes a bunch more data. Look, it's all nicely sorted. These are all data points that are still in the same partition, Texas. So they're sorted in memory by city, and whenever we get to a certain point in memory and it has to flush out the heap, well, it goes to an SS-table again. Now you have two SS-tables with the same partition. This doesn't get too bad until you have multiple SS-tables with the same partition, and it starts reducing the amount of read speeds. So, there's a process called compaction that will blend these together, which we will talk about completely in a different module.

If you'll notice, we have this user called Lone Node, and they moved, they went from Snyder to Dallas. Well, that new record where they moved to Dallas is sitting in the second SS-table. What? How do we manage this inside of Cassandra? Well this is where we're gonna get into something in compaction, but for right now, what I want you to know, is that the way that we resolve these records is by time stamps. Because that newer record is there, when we look at the two time stamps, this is how it resolves. So this is what we call last write wins, and it's the resolution protocol for Cassandra data.

Internal Architecture: Read-Path

Hi, I'm Patrick McFadin, let's talk about Read Path in Apache Cassandra. Now this is a little more complicated than the Write Path. The Write Path, kinda simple, right? MemTables, CommitLogs, simple. Well, Read Path has to be a little bit more complicated because there's a lot more at play here. So what we have in play are the MemTables and SSTables, of which there could be more than one. You're not gonna just read from the MemTable and be done with this. This is not an in-memory database. We have to combine things. So we have the MemTable component that has to be considered. So that's the current data. Of course, that data is also stored in the CommitLog, but it's not readable in the CommitLog. CommitLog's only there for failure. The MemTable stores that data that's in the CommitLog. So that's a representation of your data. It has a time stamp.

We also have the SSTables which are the long term storage of our data. And that's after that MemTable gets flushed out. So we have lots of representations of our data. And if we have a single partition date, it could be spanning every bit of this. There is logic involved in trying to correlate all these things together and make it one big data set for you when you do a Select. So putting these together, let's see how it works.

Reading a MemTable. So we have inside our MemTable, some partitions of data. So we ask for a particular one. Where's 58? Well, it's right here. Now, it will read all of that out. Okay, that's great but we still have some data that's missing. And where is that? It's probably the data that's in the SSTables on disk. So let's take a look at a disk. An SSTable, of course, has sequential data that's written out from the MemTable but that's the durable storage. Recall that this is an immutable write so once it writes, it's done. Now, you're just doing reads. And most of those reads are sequential reads.

We can do random reads, but let's see how this works. So we're breaking up this SSTable into the different partitions that are stored inside of it. If you'll notice, we got 58 in there. They're in a really big file. So it could take a long time to find that data if we just started at the beginning. That's not the most efficient. Let's say you had a gigabyte file, you don't wanna read from the beginning and find it all the way at the end. Fast disks are great, but they're not that fast. And when read speeds require a read from disk, that disk speed is really critical. So what we're gonna do is we're gonna build a table. This is called a partition index. The partition index says this partition starts at this offset in the file. So if I'm looking for our 58, it's says, "Oh, it's right here." And it can grab that data immediately by advancing the pointer on the disk, zeemp! Right to the file, boom. You got your data, everything's good. You didn't have to read the entire SSTable to get your data. Now that SSTable can be really large or small but you still have a partition index that shows every single partition with every single offset.

What if you had a really, really big SSTable with a lot of partitions in it? Now that's a lot of data to read for the partition index. Now the partition index turns into the bottleneck. Over time, we've learned this in Apache Cassandra, people store a lot of data in these nodes. What we have to get in front of that problem, is a summery index. So this is like the beginning index that we hit. So now that we have a summary index and our partition index working together, let's look at another example. I need data partition number 36. Great, where is that? Well, it's inside of this range right here. Which will then, from the summary index, will get us the partition index. So that shorcut's the offset on the partition index and from there we can find the exact opposite on the SSTable, get the data back into the MemTable for a merge sort. So that's gonna cover all of the indexing portions of this.

But what if we get to a situation where we wanna get as fast as possible? Key cache is in the way of all of this. If you have a partition that's being constantly accessed, this is a super shortcut. This avoids all of those indexes by an in-memory index, this cache that says, "Oh, you're looking for partition number 36, "Okay, no problem, I know exactly where it is." So we don't even have to go through those other steps. If you can get something this fast, great. Because that means less time having to figure out where it is. That's an extrahop on disk you don't have to do. You can get your data right back, awesome. Those are all of the indexes involved.

There's one more data structure that's really critical and this is meant to keep you from having to scan a lot of files. And this is called a Bloom filter. When we asked for partition number 36, that Bloom filter is there to answer one question with two potential answers: It is absolutely not in this SSTable. Or it could be. And that's the probability. And that's what Bloom filters are supposed to do. It doesn't give you an absolute yes or no, it gives you a no and maybe. But that helps out a lot. If I have a thousand SSTables, I wanna eliminate the 999 files that it's not in and find the one that it is in. So we're narrowing down our choices. And eliminating the amount of disk seeks that happen. So if partition 36 exists in this SSTable, it passes it on through, then we go through in using the summary indexes and the partition indexes to find my data on disk. Now in this case, 48, which as you can see, is not in this SSTable, but the Bloom filter is gonna be there. So when we ask for 48, it's gonna say, "No, it's not in this SSTable." And that's absolute. What if we asked for something like, oh, I don't know, 74? So 74, is it in there? Is it not? Well, it's like maybe. So it has to still go though the procedure. And if it's not there, it'll be what's known as a false positive. This happens very rarely. And when it does, you can dial it back. This is attunable. Not something you normally do with Apache Cassandra but I just wanted to let you know that it is there.

So as you can see, this is just a little bit more complicated than the Write Path, but the Read Path has really got some important features. Because you do have data, and a lot of it, sitting on disk, you have some in memory, you need a way to resolve all that. So this Read Path is really important and when you do data modelling you have to think about this. But this brings up a really good point about compaction. Compaction makes a lot of this easier as well. And we'll discuss that a lot in another module.

Internal Architecture: Compaction

Hi, I'm Patrick McFadin. Let's talk about compaction in Apache Cassandra. So, compaction. Probably the most derided thing in Apache Cassandra, because people don't understand it, they generally associate it with some sort of voodoo evil. It's not that. It's something that you can understand, but it's very important. Think of it as like delayed I/O in your system that you need to understand, and it will create a specific load, and it is very important for your read speeds.

This is the SSTable and all of the files that are associated with it. These five files are associated together with a set of data. The SSTable stores the data, we have summary indexes and partition indexes that show you where the data is in the SSTable, a key cache that will, hopefully, get you there faster, and a bloom filter that gives you that probabilistic data structure that says it's absolutely not here or it might be here. Keep in mind that this is Cassandra pre-3.0. After 3.0 there are more files that are added to this, but in this case it'll be fine for what our discussion is about compaction. When you build out your SSTables, again, these are immutable files, you're gonna start building up files over time, a lot of files over time and if you're not really concerned about disc space, then what's the big deal? Right? Well, compaction isn't just about cleaning up disc space. It's also house keeping of your data. Old data needs to be deleted. Data records need to be merged together, et cetera. So, as these files are built out, we need to do some house keeping. So, how does it work?

Let's walk through a compaction process between two SSTables with the same partition. So, we have two files and each of them has a Texas partition. Some of these names look the same, some are different, there's different time stamps. This is what we have to walk through, and this is what compaction does in an algorithm. We have a new partition we're gonna write in. This is a merge shortened memory. So, we're gonna take the two partitions from different SSTables and read those. So, the first one we're gonna read is record number one. We have two records of Johnny. Well, one of them is newer than the other one. So, we're gonna take that one. Record number two. Here's Betsy. That record got deleted and it's newer and past GC grace. So, that record goes away. Now we get to number three. It was Nichole. Now it's Norman, but look at the time stamp. Again, it's newer, so here's those two records. We merge them together. Norman wins. Record number four, an old record, still in there. It just merges in. Record number five is Sam, and of course that was newly deleted. That happens but it's not outside of GC grace yet. It's gonna stay there until after GC grace and then it will be deleted. And we have a new record, Henrie, he comes in and he's merged. So, now we have a brand new partition of data, all cleaned up. Old records are gone. News records are updated. Everything's nice and clean. What happens now on disc is the two old SSTables get deleted, this file is written out to a brand new SSTable.

So, let's take an alternative look to this. Now, this is from the SSTable level, not the data level. You'll see that there's a group of partitions in one SSTable and another group of partitions in another table. So, let's see what happens there. So, we create a brand new SSTable. We bring down one partition. It doesn't exist, so it just copies over. We have seven and seven, bunch of tombstones in that first file, so we clean it up, compacts down a little bit. So, now we have partition 13. It moves down. Lot's of tombstones in that one too. Cleans up nicely. Then we have partition 18. Again, more clean up. Partition 21. Not so much. So, then we can move down 36. That cleans up. Partition 58. That actually merges together, because there's not a lot overlap. And then finally, 92 and 84, those are drop partitions completely. GC grace has passed a long time ago. It is time for those records just to disappear. Same with 92. This is a clean up operation, and what we're gonna do is see all that data get cleaned out eventually.

Now, compaction does all of this housekeeping. It's super important. Let me wrap up the compaction by stressing the reason compaction is so important. If you have your data all in one SSTable, that means, when you do a read, you're gonna have to do a very, very focused read. You don't have to go across multiple SSTables. Every SSTable is another seek, more partition indexes, more partition summaries, more bloom filters, more key cache that has to get involved. The less SSTables with that partition of data the better. So, keeping up with compaction in the back end is just normal housekeeping for Cassandra. Let it run. You don't have to do anything special. But by letting it run you're getting a much more efficient system, and that's what you want.

Course Conclusion

Okay, you are done listening to me talk, and congratulations, you got through all of it. You had to listen to me for this entire time, but I'm glad you did because you've learned quite a bit here.

So this is DS201 which is Core Concepts, but there are other courses you can take and I always recommend immediately after this doing Data Modeling. Because Data Modeling is close and dear to my heart, but it's the cornerstone of making a great Cassandra application. Understanding Data Modeling will make or break your application. What you've learned today is really important, but what you get out of Data Modeling will extend that and make you really ready to put things into production.

Now there are other courses that you should think about taking. Operations and Performance Tuning is really critical for whenever you're trying to get the most out of your Cassandra notes, and then extending that to a real data platform with DataStax Enterprise. We have Search, we have Analytics, and coming soon even Graphs. So you can use this all to enhance your career and make a lot of money doing what you love to do, which is write cool code and make killer applications. Go out there, show us what you've got.