About this Short Course

The use of time series data for business analysis is not a new movement. What is new is the ability to collect and analyze massive volumes of data in sequence at extremely high velocity to get the clearest picture to predict and forecast future market changes, user behavior, environmental conditions, resource consumption, health trends and much, much more.

Apache Cassandra
DataStax Enterprise
Data Modeling

Why Apache Cassandra™ for Time Series Data?

Apache Cassandra is a superior NoSQL database platform for these types of Big Data challenges. Cassandra’s data model is an excellent fit for handling data in sequence regardless of datatype or size. When writing data to Cassandra, data is sorted and written sequentially to disk. When retrieving data by row key and then by range, you get a fast and efficient access pattern due to minimal disk seeks – time series data is an excellent fit for this type of pattern. Apache Cassandra allows businesses to identify meaningful characteristics in their time series data as fast as possible to make clear decisions about expected future outcomes.  

Time Series Data Modeling with Apache Cassandra™

The following time series data model examples are based on a use case where a weather station is creating temperature data every minute. You will see how using the row key and sequence can be a powerful data modeling tool.  

Time Series Pattern 1- Single Device per Row

The simplest model for storing time series data is creating a wide row of data for each source. In this first example, we will use the weather station ID as the row key. The timestamp of the reading will be the column name and the temperature the column value (figure 1). Since each column is dynamic, our row will grow as needed to accommodate the data. We will also get the built-in sorting of Cassandra to keep everything in order.

Time Series Data on Cassandra - Single device per row

CREATE TABLE temperature (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time)
);

 

Now we can insert a few data points for our weather station:

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:01:00′,’72F’);

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’);

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:03:00′,’73F’);

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:04:00′,’74F’);

 

A simple query looking for all data on a single weather station:

SELECT event_time,temperature
FROM temperature
WHERE weatherstation_id=’1234ABCD’;

 

A range query looking for data between two dates. This is also known as a slice since it will read a sequence of data from disk:

SELECT temperature
FROM temperature
WHERE weatherstation_id=’1234ABCD’
AND event_time > ’2013-04-03 07:01:00′
AND event_time < ’2013-04-03 07:04:00′;

 

Time Series Pattern 2 - Partitioning to Limit Row Size

In some cases, the amount of data gathered for a single device isn’t practical to fit onto a single row. Cassandra can store up to 2 billion columns per row, but if we’re storing data every millisecond you wouldn’t even get a month’s worth of data. The solution is to use a pattern called row partitioning by adding data to the row key to limit the amount of columns you get per device. Using data already available in the event, we can use the date portion of the timestamp and add that to the weather station id. This will give us a row per day, per weather station, and an easy way to find the data. (figure 2)  

CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id,date),event_time)
);

Note the (weatherstation_id,date) portion. When we do that in the PRMARY KEY definition, the key will be compounded with the two elements. Now when we insert data, the key will group all weather data for a single day on a single row.

INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03′,’2013-04-03 07:01:00′,’72F’);

INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03′,’2013-04-03 07:02:00′,’73F’);

INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-04′,’2013-04-04 07:01:00′,’73F’);

INSERT INTO temperature_by_day(weatherstation_id,date,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-04′,’2013-04-04 07:02:00′,’74F’);

 

To get all the weather data for a single day, we can query using both elements of the key:

SELECT *
FROM temperature_by_day
WHERE weatherstation_id="1234ABCD"
AND date="2013-04-03";

 

Time Series Pattern 3 - Reverse Order Timeseries with Expiring Columns

Another common pattern with time series data is rolling storage. Imagine we are using this data for a dashboard application and we only want to show the last 10 temperature readings. Older data is no longer useful, so can be purged eventually. With many other databases, you would have to setup a background job to clean out older data. With Cassandra, we can take advantage of a feature called expiring columns to have our data quietly disappear after a set amount of seconds. (figure 3)

Time Series Data on Cassandra - Reverse order timeseries with expiring columns

CREATE TABLE latest_temperatures (
weatherstation_id text,
event_time timestamp,
temperature text,
PRIMARY KEY (weatherstation_id,event_time),
) WITH CLUSTERING ORDER BY (event_time DESC);

 

Now when we insert data. Note the TTL of 20 which means the data will expire in 20 seconds:

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:03:00′,’72F’) USING TTL 20;

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’) USING TTL 20;

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:01:00′,’73F’) USING TTL 20;

INSERT INTO latest_temperatures(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:04:00′,’74F’) USING TTL 20;

 

As soon as you insert the data, start selecting all rows over and over. Eventually, you will see all the data disappear. This is an example of the TTL period expiring. Imagine what kind of interesting things you could do with your application data model using these.  

 

In Conclusion

Time series is one of the most compelling data models for Cassandra. It’s a natural fit for the big table data model and scales well under a variety of variations. Many production use cases are similar to the examples above. For those users, the problem of storing data at machine generating speeds and still keeping it organized in a useful manner is no longer a challenge. Hopefully, this getting started content will get your creative juices flowing.  

 

Time Series Data Modeling Resources

• Cassandra is an excellent database platform for advanced time series data.

• See a presentation by the head of operations at Librato how Librato's metrics platform relies on Cassandra as its sole data storage platform for time-series data.