Read Techniques

DSE Version: 6.0

Video

Exercises

In this unit, we will be taking a look at some alternate ways to perform reads in Apache Cassandra or DataStax Enterprise. For example, in some cases using an index may be helpful. Or if certain requirements or queries change, it may be possible to enable these changes without completely overhauling your data model. 

Transcript: 

In this video, we'll be taking a look at some alternate ways to perform reads. Although the data modeling methodology we have shown stresses on making sure that your tables are modeled so that your known queries work, there are times when it may make more sense to just make use of an index. Or if certain requirements or queries change, it may be possible to enable these changes without having to completely overhaul your data model.

The first way we can perform reads is through the use of secondary index. This is something that needs to be manually defined by the user, but allows DataStax Enterprise to create and maintain an index that can then be used to more quickly and efficiently retrieve data, especially for columns that do not make up part of the primary key. As this is a separate data structure that is created, it does not have any affect on the structure of the table. There are some limitations to using secondary indexes, however, which is that it can't be created for counter column or for columns declared as static.

As mentioned, the secondary index is a separate data structure that exists alongside the actual data. Each node will have it's own local index, which collectively make up a particular secondary index, and contains index entries for the partitions stored locally on that node. Since the secondary index is divided up into local indexes, it means that a query generally needs to go to every node in order to check each local index, making sure that the entire secondary index is consulted. This tends to be a relatively expensive query in terms of efficiency. The most efficient use cases for secondary indexes is when querying for rows inside a specific partition, meaning that the partition key is provided in the query. This allows direct access to the node that has the data, and only needs to check the local index on that node to retrieve the appropriate rows.

As far as how it works, the client sends a query to a coordinator node. If the query involves a column that has a secondary index, and no partition key is provided, then the coordinator will have to send the request to all nodes. When each node receives the request, it searches its local index to find the required partitions and rows, and returns the results to the coordinator. Once the coordinator has received a response from all nodes, it then combines the results and returns that to the client.  Not that when mentioning all nodes here, that's just the general idea when there is no replication. When there is replication being used, you do not need all of the nodes to retrieve the entire result set, and so only a subset of the nodes is contacted, whatever number is needed to make up the entire token range.

Here we have some guidelines on when to use and when not to use secondary indexes. Small datasets are fine, or whenever you can provide a partition key to make secondary index usage more efficient. Secondary indexes don't work with counter columns, and they don't make sense with columns that are frequently updated or deleted, since that is additional work maintaining the secondary index in the background as well.

A more complex topic is deciding to use secondary indexes depending on the cardinality of the column. This refers to the number of unique values that can be found for the column, with smaller unique values being low cardinality, and large number of unique values being high cardinality. With low cardinality, having less possible values for the column generally means that there are more rows that are returned, and makes it worth having to check all of the nodes to retrieve the query results. In the case of high cardinality, having a large number of unique values makes it more probably that there are less rows with that value. With less rows that mean you may be checking all of the nodes for a few rows that could have been found on one or two nodes. That makes it pretty inefficient and there are other ways that this type of query could be done instead.

This leads us to materialized views, a different type of index that is more suitable for high cardinality data. Basically a materialized view is a separate table built from the original one, with the same data, but with a different primary key. The new primary key, particularly the partition key, would be set to the column that you want to search on.

Since materialized views are created from a base table, there is a requirement that the primary key for the materialized view must include the primary key from the base table, so that there is a 1-to-1 correlation between the rows in the base table and the materialized view. Only one new column can be added to the primary key, with the restriction that it is not a static column.

In this example, we are making a materialized view based on a users table, which has a primary key using the user_id. In order to be able to search for users by their email address, we would need to create the materialized view with the email column as part of the new primary key.

From the CREATE MATERIALIZED VIEW example, the AS SELECT part determines which columns we want to include in the materialized view. The FROM specifies the base table, and the WHERE must specify that all primary key columns is not null. Since it's possible that the base users table have rows without an email, it should not actually be included in the materialized view. Finally with the primary key for the materialized view, we have to make sure that it includes the original primary key of the base table, which is user_id, along with the column we want to search on, which is email.

Once the materialized view is created, you can treat it just like any other table in your query. Here we are trying to retrieve the first name, last name and email of the user in the materialized view user_by_email where the email is iluvbigdata@datastax.com. The result will be return the row as normal, as can be seen below.

As far as things to watch out for with materialized views. You cannot actually write to a materialized view, it is updated only when the base table is updated as well. The updates from the base table to the materialized view are also asynchronous, meaning that there is always some sort of delay when updating. Since the partition key for the base table is different from the materialized view, that can mean that the node where the base table is updated is not necessarily the same node where materialized view is updated. In edge conditions the delay in update may be more noticeable. Finally note that read repairs are not a two-way street for materialized views. If triggering a read repair when reading from the base table, it will repair both the base table and the materialized view. However a read repair when reading the materialized view will only repair the materialized view itself.

Data aggregation is another type of "read" that is commonly used in databases, and DataStax Enterprise does support some of the most common aggregates. In DSE 6, this means SUM, AVG, COUNT, MIN, and MAX are all supported. Note that prior to DSE 5.0, only COUNT was available. Keep in mind that these aggregates are calculated at query time, meaning that DSE goes through the result set to calculate these aggregates instead of just pulling statistics from an index. This should be suitable for many use cases, but if you're looking to get the count of the number of rows in a table, this would not be the best way to do so.

Fortunately there are several techniques that can be used that can possibly get you the aggregates you want. Some can be done purely through the Cassandra Core part of DataStax Enterprise. Once using lightweight transactions to check and replace aggregates as rows get added or updated. Another might be to make use of the counter type to implement certain aggregates such as count or sum.

Otherwise, aggregates can also be done outside of Cassandra, either being done as part of the  application logic, or making use of DataStax Enterprise's Analytics or Search features.

Here's an example of using the counter type to implement aggregations for count and sum. Let's say that we want to be able to track the ratings given by users for a specific video. In our ratings by video table, we create two counters, a num_ratings and a sum_Ratings. Then whenever a new rating is submitted, we can increment the num_ratinsg counter by 1, and the sum_ratings counter by the rating given by the user. There are some caveats to this technique though, one being that updating a counter is slower than a normal write, since it does a read before write in the background. The other is that incrementing counters is not an idempotent operation. What that means is that if an update to a counter does not return a response, there is no way to retry the operation so that we know the counter reflects the true intended value. In this case the value stored is a very good approximation, but not necessarily accurate.

To perform an aggregate in an application, you are basically performing calculations on the client side. One common way is by retrieving the data needed for the aggregate calculation from the database, doing the calculation, and then storing it back in the database.

Using our counters example from the previous slide, we can make use of the num_ratings and sum_ratings counter to get the average. We just query the value of the num_ratings column and sum_Ratings column and then perform the calculation on the application side. If we want, we can also store that average as a avg_rating column in the videos table.

No write up.

Exercise 5.1: Creating Secondary Indexes

In this exercise, you will:

  • Create a secondary index on a table

Background

Based on surveys, focus groups and other user feedback, Killrvideo has learned that users want to be able to determine an actor's name based on the video the actor appeared in and the name of the character that actor played. Also, sometimes they want to determine the actor's name based on the character's name alone because they don't recall the name of the movie. Although secondary indexes aren’t a best practice, you have been studying the topic and know that this is one instance where a secondary index could make sense for this need.

Steps

  1.  Navigate to '/home/ubuntu/labwork/final'.
  2.  Launch 'cqlsh' and switch to 'killr_video' keyspace.
  3.  Create a secondary index on the actors_by_video table using the following syntax:

CREATE INDEX [index name] ON [table name]([field name])

4. Query the table as follows to validate your work:
  • Search for all rows where the video_id = 87c645e8-0ef2-11e5-98f3-8438355b7e3a AND the character is named Kelly La Fonda.
  • Search for all rows where the character name is George McFly.

Exercise 5.2: Creating Materialized Views

In this exercise, you will:

  • Create a materialized view for an existing table.
  • Execute a search to validate the design of the materialized view.

Scenario

You have been asked to extend the data model so that the application enables users to see a list of videos uploaded by a specific user. Although the videos table contains both the video details as well as the user that uploaded each video, the primary key only contains the video id. You have read up on materialized views and know you can use a materialized view to fulfill this requirement.

Steps

  1. Navigate to '/home/ubuntu/labwork/final'.
  2. Launch 'cqlsh' and switch to 'killr_video' keyspace.
  3. Create the materialized view per the requirements in the scenario above.
  4. Validate the design of your materialized view using user_id 723f6f5f-3658-4449-90d0-4391d63e50a8.You should retrieve 13 rows.

Exercise 5.3: Implementing Aggregation In Your Data Model

In this exercise, you will:

  • Design a logical data model that will be able to store and retrieve aggregated values

Background

During a peer review of an intern’s work, your team found that the intern created a schema with the assumption that the queries could use aggregate functions.

It's back to the drawing board, but this time with you working with the hapless intern to come up with a proper design.

  • Requirement #1: KillrVideo must be able to count the number of views for each video in a particular month and year. The count does not have to be 100% accurate.
  • Requirement #2: KillrVideo must be able to keep track of the total number of uploaded videos, the combined duration (in seconds) of all of these videos, and the average duration (in seconds) of an uploaded video. These statistics must be stored per day.

They do not have to be 100% accurate.

A logical table schema for the Video entity has already been created and looks like:



Steps

  1. Review the above requirement #1.
  2. The intern made a table to log an entry into whenever a user views a video. The original query would have then used the COUNT aggregate to retrieve the number of views for a particular video.
    • As far as Cassandra is concerned, what is inefficient about the intern’s design?
  3. Come up with an optimal table schema that meets this requirement using Cassandra.
    • What would be the query that can retrieve the daily count of the number of views for a video for a particular year and month?
    • Is it possible to retrieve the all-time number of views for a video? If possible, what would be the query?
    • Is there anything that needs to be done from the application side?
    • How would your design increment a video’s view count? Provide the queries and/or write statements needed to do this.
  4. The intern, being ever helpful, suggested that it would be useful to also display the top 10 videos for each month, based on the number of views.
    • Would it be possible to query this information in Cassandra? Why or why not?
    • If not, is there a way to do this outside of Cassandra using your schema?
  5. Review the above requirement #2.
  6. Modify the existing Videos table and/or come up with your own table schema that will meet these requirements.
    • What is the query to retrieve the total number of videos uploaded on a specific day?
    • What is the query to retrieve the combined duration of all videos uploaded on a specific day?
    • What is the query to get the average duration of all videos uploaded on a specific day?
    • Explain how these values are updated whenever a new video is uploaded.

Exercise 5.4: Using the Cast Function

In this exercise, you will:

  • Generate a list of videos along with their release date.

Background

Killrvideo wants to analyze how long films in certain genres remain popular from the release date until present. You've been asked to provide a solution that will enable analysts to view the title along with the release date in coordinated universal time (UTC) format.

Steps

  1. First consider the following questions regarding the killr_video keyspace:
    • What table would you use? Why?
    • What filter will you have to apply?
    • How will you filter on the column you decided to use?
  2. Navigate to the killr_video keyspace.
  3. Launch 'cqlsh'.
  4. Write the required cql code to satisfy the requirements and validate your work.
No FAQs.
No resources.
Comments are closed.