RECOMMENDED PATHS
ArchitectLOGIN TO START THIS PATH
VIEW COURSES
0%
DS101: Introduction to Apache Cassandra™
0%
DS201: DataStax Enterprise 6 Foundations of Apache Cassandra™
0%
DS210: DataStax Enterprise 6 Operations with Apache Cassandra™
0%
DS220: DataStax Enterprise 6 Practical Application Data Modeling with Apache Cassandra™
DeveloperLOGIN TO START THIS PATH
VIEW COURSES
0%
DS101: Introduction to Apache Cassandra™
0%
DS201: DataStax Enterprise 6 Foundations of Apache Cassandra™
0%
DS220: DataStax Enterprise 6 Practical Application Data Modeling with Apache Cassandra™
Graph SpecialistLOGIN TO START THIS PATH
VIEW COURSES
0%
DS101: Introduction to Apache Cassandra™
0%
DS201: DataStax Enterprise 6 Foundations of Apache Cassandra™
0%
DS220: DataStax Enterprise 6 Practical Application Data Modeling with Apache Cassandra™
0%
DS330: DataStax Enterprise 6 Graph
UNIT CONTENTS
Unit Status
 
 
  Previous Unit: Write Techniques
Finish & Next Unit: Table/Key Optimizations  

DSE Version: 6.0

Read Techniques

Video

In this unit, we will be taking a look at some alternate ways to perform reads in Apache Cassandra or DataStax Enterprise. For example, in some cases using an index may be helpful. Or if certain requirements or queries change, it may be possible to enable these changes without completely overhauling your data model. 

Transcript: 

In this video, we'll be taking a look at some alternate ways to perform reads. Although the data modeling methodology we have shown stresses on making sure that your tables are modeled so that your known queries work, there are times when it may make more sense to just make use of an index. Or if certain requirements or queries change, it may be possible to enable these changes without having to completely overhaul your data model.

The first way we can perform reads is through the use of secondary index. This is something that needs to be manually defined by the user, but allows DataStax Enterprise to create and maintain an index that can then be used to more quickly and efficiently retrieve data, especially for columns that do not make up part of the primary key. As this is a separate data structure that is created, it does not have any affect on the structure of the table. There are some limitations to using secondary indexes, however, which is that it can't be created for counter column or for columns declared as static.

As mentioned, the secondary index is a separate data structure that exists alongside the actual data. Each node will have it's own local index, which collectively make up a particular secondary index, and contains index entries for the partitions stored locally on that node. Since the secondary index is divided up into local indexes, it means that a query generally needs to go to every node in order to check each local index, making sure that the entire secondary index is consulted. This tends to be a relatively expensive query in terms of efficiency. The most efficient use cases for secondary indexes is when querying for rows inside a specific partition, meaning that the partition key is provided in the query. This allows direct access to the node that has the data, and only needs to check the local index on that node to retrieve the appropriate rows.

As far as how it works, the client sends a query to a coordinator node. If the query involves a column that has a secondary index, and no partition key is provided, then the coordinator will have to send the request to all nodes. When each node receives the request, it searches its local index to find the required partitions and rows, and returns the results to the coordinator. Once the coordinator has received a response from all nodes, it then combines the results and returns that to the client.  Not that when mentioning all nodes here, that's just the general idea when there is no replication. When there is replication being used, you do not need all of the nodes to retrieve the entire result set, and so only a subset of the nodes is contacted, whatever number is needed to make up the entire token range.

Here we have some guidelines on when to use and when not to use secondary indexes. Small datasets are fine, or whenever you can provide a partition key to make secondary index usage more efficient. Secondary indexes don't work with counter columns, and they don't make sense with columns that are frequently updated or deleted, since that is additional work maintaining the secondary index in the background as well.

A more complex topic is deciding to use secondary indexes depending on the cardinality of the column. This refers to the number of unique values that can be found for the column, with smaller unique values being low cardinality, and large number of unique values being high cardinality. With low cardinality, having less possible values for the column generally means that there are more rows that are returned, and makes it worth having to check all of the nodes to retrieve the query results. In the case of high cardinality, having a large number of unique values makes it more probably that there are less rows with that value. With less rows that mean you may be checking all of the nodes for a few rows that could have been found on one or two nodes. That makes it pretty inefficient and there are other ways that this type of query could be done instead.

This leads us to materialized views, a different type of index that is more suitable for high cardinality data. Basically a materialized view is a separate table built from the original one, with the same data, but with a different primary key. The new primary key, particularly the partition key, would be set to the column that you want to search on.

Since materialized views are created from a base table, there is a requirement that the primary key for the materialized view must include the primary key from the base table, so that there is a 1-to-1 correlation between the rows in the base table and the materialized view. Only one new column can be added to the primary key, with the restriction that it is not a static column.

In this example, we are making a materialized view based on a users table, which has a primary key using the user_id. In order to be able to search for users by their email address, we would need to create the materialized view with the email column as part of the new primary key.

From the CREATE MATERIALIZED VIEW example, the AS SELECT part determines which columns we want to include in the materialized view. The FROM specifies the base table, and the WHERE must specify that all primary key columns is not null. Since it's possible that the base users table have rows without an email, it should not actually be included in the materialized view. Finally with the primary key for the materialized view, we have to make sure that it includes the original primary key of the base table, which is user_id, along with the column we want to search on, which is email.

Once the materialized view is created, you can treat it just like any other table in your query. Here we are trying to retrieve the first name, last name and email of the user in the materialized view user_by_email where the email is iluvbigdata@datastax.com. The result will be return the row as normal, as can be seen below.

As far as things to watch out for with materialized views. You cannot actually write to a materialized view, it is updated only when the base table is updated as well. The updates from the base table to the materialized view are also asynchronous, meaning that there is always some sort of delay when updating. Since the partition key for the base table is different from the materialized view, that can mean that the node where the base table is updated is not necessarily the same node where materialized view is updated. In edge conditions the delay in update may be more noticeable. Finally note that read repairs are not a two-way street for materialized views. If triggering a read repair when reading from the base table, it will repair both the base table and the materialized view. However a read repair when reading the materialized view will only repair the materialized view itself.

Data aggregation is another type of "read" that is commonly used in databases, and DataStax Enterprise does support some of the most common aggregates. In DSE 6, this means SUM, AVG, COUNT, MIN, and MAX are all supported. Note that prior to DSE 5.0, only COUNT was available. Keep in mind that these aggregates are calculated at query time, meaning that DSE goes through the result set to calculate these aggregates instead of just pulling statistics from an index. This should be suitable for many use cases, but if you're looking to get the count of the number of rows in a table, this would not be the best way to do so.

Fortunately there are several techniques that can be used that can possibly get you the aggregates you want. Some can be done purely through the Cassandra Core part of DataStax Enterprise. Once using lightweight transactions to check and replace aggregates as rows get added or updated. Another might be to make use of the counter type to implement certain aggregates such as count or sum.

Otherwise, aggregates can also be done outside of Cassandra, either being done as part of the  application logic, or making use of DataStax Enterprise's Analytics or Search features.

Here's an example of using the counter type to implement aggregations for count and sum. Let's say that we want to be able to track the ratings given by users for a specific video. In our ratings by video table, we create two counters, a num_ratings and a sum_Ratings. Then whenever a new rating is submitted, we can increment the num_ratinsg counter by 1, and the sum_ratings counter by the rating given by the user. There are some caveats to this technique though, one being that updating a counter is slower than a normal write, since it does a read before write in the background. The other is that incrementing counters is not an idempotent operation. What that means is that if an update to a counter does not return a response, there is no way to retry the operation so that we know the counter reflects the true intended value. In this case the value stored is a very good approximation, but not necessarily accurate.

To perform an aggregate in an application, you are basically performing calculations on the client side. One common way is by retrieving the data needed for the aggregate calculation from the database, doing the calculation, and then storing it back in the database.

Using our counters example from the previous slide, we can make use of the num_ratings and sum_ratings counter to get the average. We just query the value of the num_ratings column and sum_Ratings column and then perform the calculation on the application side. If we want, we can also store that average as a avg_rating column in the videos table.

No write up.
No Exercises.
No FAQs.
No slides.
No resources.