Data Modeling with Buckets

DSE Version: 6.7

Intro

Exercises

Slides

Resources

When data modeling for Apache Cassandratm, sometimes an organic data model may have some performance problems. In these cases, it is useful to understand how to use buckets to help improve your performance. This unit should make you a bucket expert!

No write up.

Exercise: Data Modeling with Buckets

In this exercise, you will:

  • Understand what buckets are

  • See how to use them

  • Create a table you will eventually use to display recent videos

When data modeling for Apache Cassandratm, sometimes an organic data model may have some performance problems. In these cases, it is useful to understand how to use buckets to help improve your performance.

This unit explains what buckets are and how you can use them to improve performance of certain types of Cassandra data models.

Steps

1) Take a look at the KillrVideo landing page. Make sure your KillrVideo app is running (if you need to start it, use the blue dropdown). Then, go to the home page of the app.

http://<YOUR IP ADDRESS>:3000

Notice there is an area that should display the latest videos but instead tries to use an uncanny Star Wars trick to make you feel that nothing is amiss (you may need to hard-refresh the page CMD+SHIFT+R for Mac/Chrome or CTRL+F5 Windows/Chrome):

 

When a user submits a new video, the KillrVideo app needs to show that video on the landing page. This means the app needs to insert the video into an additional "videos" table that will support a different query. We'll name this new table, latest_videos. This table looks something like this (except it's missing some parts you will fill in):

CREATE TABLE killrvideo.latest_videos (
    /* YOU FILL THIS IN SECTION */
    added_date TIMESTAMP,
    videoid UUID,
    name TEXT,
    preview_image_location TEXT,
    userid UUID,
    PRIMARY KEY ( /* YOU FILL THIS IN SECTION TOO */)
);

Notice that this table has many of the same columns as the videos table. The goal is to query videos by the latest added_date.

Discuss the next few steps (2-5) with your neighbor...

You are trying to figure out how to fill in the missing pieces of the CREATE TABLE statement above.

2) What are we querying the latest_videos table by?

Hint

The goal is to grab the latest videos. In a relational model, we would order the videos in descending order of their added_date and pick off the top N videos (N would be around ten to twenty). But in Cassandra, we prepare our tables for the queries up front. This means we would order the video rows on disk by their added_date. However in Cassandra, we handle ordering via clustering columns (the columns that come after the partition key within the primary key). Hmmmm. What should the partition key be?

3) How should we partition latest_videos table?

Hint

Strictly speaking, we don't need a partition key for this query because we only want the top N videos. However, Cassandra requires it. So, we will artificially partition the videos. We call these artificial partitions buckets. Ideally, a bucket would contain just those videos you would want to display, like a single-day's worth. But for this exercise, we will place all of the latest videos into a single bucket (INT), with a bucket partition key value of zero.

A second approach would use several buckets, one per minute, day, hour, etc. whatever granularity you deem necessary. Then you would group the latest videos into buckets separated by your chosen time slice. If you needed to pull the latest N videos, your application could start at now and walk backwards in time pulling individual buckets until it had N videos to show. This second bucketing approach (i.e., using time as the partition key) would be better both, in terms of cluster performance, and for rolling time-windows. But, we want to keep this exercise simple. So we will use the simpler integer partition key approach.

To support the integer partition approach, you will have to add the bucket column to the table.

CREATE TABLE killrvideo.latest_videos (
    bucket INT,
    added_date TIMESTAMP,
    videoid UUID,
    name text,
    preview_image_location TEXT,
    userid UUID,
    PRIMARY KEY ((bucket), /* WHAT ELSE??? */)
);

4) What is the ordering criteria?

Hint

As we mentioned earlier, the ordering criteria is added_date, in descending order. We must add it as a clustering column so Cassandra orders by that column within our bucket. Add the WITH CLUSTERING ORDER BY criteria to our latest_videos data model to override the default ascending ordering:

CREATE TABLE killrvideo.latest_videos (
    bucket INT,
    added_date TIMESTAMP,
    videoid UUID,
    name TEXT,
    preview_image_location TEXT,
    userid UUID,
    PRIMARY KEY ((bucket), added_date, /* WHAT ELSE??? */)
) WITH CLUSTERING ORDER BY (added_date DESC);

5) Do we need to add any other columns to the primary key for uniqueness?

Hint

Absolutely. Look at the PRIMARY KEY as it currently stands. The combination of bucket and added_date will not be unique. Unless we add further columns to our PRIMARY KEY, two videos added to the site at the exact same time would upsert each other. Yi yi yi. That sounds like something Steve would do. Be smarter than Steve (admittedly a low-bar). What other column(s) could we add to the PRIMARY KEY to get uniqueness?

Ah, good job. Jamie knew you could be smarter than Steve. The videoid alone is enough to provide uniqueness. We will add it onto the end of our PRIMARY KEY.

CREATE TABLE killrvideo.latest_videos (
    bucket INT,
    added_date TIMESTAMP,
    videoid UUID,
    name TEXT,
    preview_image_location TEXT,
    userid UUID,
    PRIMARY KEY ((bucket), added_date, videoid)
) WITH CLUSTERING ORDER BY (added_date DESC);

In a relational model, the PRIMARY KEY serves a single purpose: provide uniqueness. Remember, however, that the PRIMARY KEY in an Apache Cassandra™ table serves two purposes: 1) service the query, and 2) provide uniqueness. Notice in this module that we modeled for the query first, and then the data. This is because Cassandra stores the data query-ready instead of coalescing the results at query time.

You may also be wondering what would happen as more and more videos come in...wouldn't the partition become excessively large? YES! Your application would have to occasionally drop older videos from the table. Remember, we previously mentioned improved partitioning techniques (e.g. by month, by week, by day, etc.).

6) Create the table. Now that we understand what the latest_videos table looks like and why, use cqlsh to create this table.

Need the command to launch cqlsh? Click here.
Launch cqlsh with the following (remember the password is KVPassword):
cqlsh --username KVUser --cqlshrc /projects/creds/cqlshrc
CREATE TABLE killrvideo.latest_videos (
    bucket INT,
    added_date TIMESTAMP,
    videoid UUID,
    name TEXT,
    preview_image_location TEXT,
    userid UUID,
    PRIMARY KEY ((bucket), added_date, videoid)
) WITH CLUSTERING ORDER BY (added_date DESC);

7) Fill that bucket! Let's kick the tires on our new table and fill the bucket. We will use bucket 0. Execute the following INSERT commands. Notice the added_date values are out of order.

INSERT INTO killrvideo.latest_videos (bucket, videoid, added_date, name)
VALUES (0, a24788e1-db2c-473d-8d7a-3916d0cb8c79, '2019-05-02', 'Best of Grumpy Cat 2014 || HAPPY NEW YEAR!!');
INSERT INTO killrvideo.latest_videos (bucket, videoid, added_date, name)
VALUES (0, b5876a73-b898-4515-a042-f9012ccd1eeb, '2019-05-04', 'Funniest Dogs and Cats - Awesome Funny Pet Animals Life Videos');
INSERT INTO killrvideo.latest_videos (bucket, videoid, added_date, name)
VALUES (0, bb76c5e7-666f-4876-b565-33eaa175dd42, '2019-05-03', 'Funny crazy cat videos  - Compilation 2016');
INSERT INTO killrvideo.latest_videos (bucket, videoid, added_date, name)
VALUES (0, 83201bd6-c336-4770-a281-237a6a1c5ff0, '2019-05-01', 'CAN YOU HOLD YOUR LAUGH? - Crazy CATS at their best - Funny and Cute!');
INSERT INTO killrvideo.latest_videos (bucket, videoid, added_date, name)
VALUES (0, cf0ee7e0-0d7e-400d-b548-7549deb1d210, '2019-05-05', 'The funniest and most humorous cat videos ever! - Funny cat compilation');

8) View that sorted data! Execute the following command to view the contents of your latest_videos table.

SELECT * FROM killrvideo.latest_videos;

Notice that Cassandra returns the rows in descending order of added_date. That means the most recently added videos will appear first. Remember, this is a basic Hello World introduction to bucketing. We could (and should) get fancier to handle more videos as they are added over time. You can learn more details on DataStax Academy.

END OF EXERCISE

No FAQs.

Partitioning with buckets: use buckets to break data into chunks for implementing composite partition keys.

Comments are closed.