Analysis And Validation

DSE Version: 6.0


Now that we have implemented a working data model in Apache Cassandra or DataStax Enterprise, it is time to make sure that the data model works properly and performs the way we want it to. In this unit, you will learn about analysis and validation of your data model.


Welcome to the Optimization and Tuning module! At this point, assuming that you've been following along in order, you now have an understanding of the data modeling methodology leading up to the physical data model. This means that you have envisioned, designed, and have even implemented a working data model to run in DataStax Enterprise. Congratulations! Hold up though, before you start celebrating and asking your manager for some time off, let's first talk about making sure that the data model works properly. Sure if you followed the data model methodology, then the resulting data model is supposed to work right? Well that is true, but nothing is more definite than actually running your application and database with the data model. That's when it's possible to discover if the data model performs the way you want it to. Sometimes it works fine, and often times… it can do better. What this means is that you're not quite done with data modeling just yet. That's where tuning and optimization comes in. Long after you've designed the data model and have you application running in production, you may continue to make changes and tweaks that will help ensure that your data model works as efficiently as possible.

The next couple of topics we'll be discussing have to do with ways we can optimize a data model. The one we'll be talking about in this video is analysis and validation of the data model. Remember, just because you followed the data modeling methodology up to this point and have your CREATE TABLE statements, doesn't mean that it's going to work perfectly. We'll be going over and analyzing multiple things that will help to keep the data model from blowing up in our faces later.

If there are any issues with the data model, we can then address them using some techniques we'll take a look at a bit later. This includes techniques to optimize it the data model for write workloads and read workloads. We'll examine some techniques to optimize your table keys and columns to ensure that data is distributed as evenly as possible and avoids hotspotting on nodes or in large partitions.

At this point we're pretty much at the home stretch. You've seen the diagram time after time so you know that we aren't done just yet, but now that we reached optimization tuning, we're putting the last, finishing touches on our data model.

What we have here is something like a checklist of questions to consider regarding your completed data model. Following the methodology up to this point guarantees that you have your data available and can query in the order needed by your application, but there's are considerations not specific to just the data or application workflow that have not been taken into account until now. Although it is presented here, these are important considerations that should be reviewed from time to time even after your application is in production and the data model has been in use.

The first consideration is whether you are using natural or surrogate keys for your table. Both natural and surrogate keys work fine to uniquely identify your rows, but when evaluating a potential partition key, they do not have the same characteristics. The critical point here is really with using a natural key as the partition key. This means that the size of the partition is directly related to your data. This can be bad if there are not controls set for your data because that means that they can grow much larger than what is recommended for optimal reading. For example, we've previously talked about a videos_by_user table that contains the information of videos uploaded by different users. The partition key for this table would be the user_id, which happens to be a natural key in this case. What would be the consequence if there is one user that uploads as many videos as he possibly can? Well that means for each new video, another row is added to the partition identified by that user's user_id, and the partition will continue to grow. Theoretically the partition may grow large enough that it eventually becomes a hotspot when querying for that partition. Surrogate keys don't have this problem since they are generated for each new row, so if using that as a partition key, a partition would usually have one row only. A bit later we'll take a look at some optimization techniques to split or merge a partition if using a natural key.

Data integrity should always be a priority when handling data, and having a good primary key is necessary to avoid upserts in DataStax Enterprise. Since that is covered with our mapping rules, we don't need to worry about that here. However there are other scenarios where we do need to consider the possibility of write conflicts, primarily with race conditions. The most common type of race condition is when doing a read before a write, where the written data is dependent on what was read. Let's think about the registration system for new users… what happens when a new user wants to create an account? He or she would select a username, and in the background, the application will need to check if that username is already being used by someone else. Well there is a potential race condition if two users simultaneously try to register the same username. If the username check happens at the same time and both users are told that the username available, who actually gets the username?

Data types are also an important consideration. On one end, we don't want to use a data type that is not accurate enough, otherwise we may end up with a loss of precision or even wrong data. If you're familiar with the Year 2038 problem that was an integer overflow issue from using a signed 32-bit integer for storing timestamps that only went up to the year 2038 before resetting to 1901. That doesn't mean that you should mindlessly use largest data type either. Remember, going from float to double, or int to bigint, effectively doubles the amount of space needed for those values.

Similar to the first consideration, there is a point where partitions become large enough to have an impact on performance. Identify where they exist, and avoid creating them.

Data duplication is one of the radical concepts for data modeling in DataStax Enterprise. When it makes sense, you'll want to denormalize your data into separate tables to optimize reading, which results in duplicate data across multiple tables. Duplicating data sounds fine for certain tables when it is duplicated two or three times, but you should have a limit of how much data you are willing to duplicate for the sake of performance. Does it make sense to store 10TB of data in your cluster when the original data is only 100GB for example?

The general recommendation is to avoid client-side joins whenever possible. After all, DataStax Enterprise doesn't allow joins in queries, so there must be a good reason for that right? Nevertheless there will always be cases where someone truly believes that they know what the effects of using client side joins are in their data model. There is always going to be a price, so make sure you know that is if you're going to use it. What's the latency for these client-side joins? How much memory is being used to hold onto intermediate results before the join completes? How much data is being transferred from the database to the application. Also, will the costs increase as the amount of data increases?

Without getting into the consistency topics we've discussed in the foundations class, you also need to be mindful of possible data consistency anomalies within your data model, particularly if you are duplicating data across multiple tables. If you query an entity in one table, and then query the same entity in another table with duplicate data, will the values actually be the same? What if updates for that entity in one table succeeds, but fails for a different table? What can you do to ensure that all of this duplicate data is kept as consistent as possible?

Finally, what about transactions and aggregations? Do you really need them, or can the application or business process be changed to accommodate how your data model works in DataStax Enterprise? Can UDFs or UDAs be written to perform the aggregations you need?

If you are thinking about this for your own domain and application and now realize that there are some concerns that you have about the data model now, that's good! It's always better to be aware and to address these sooner rather than later. As I've said before, we'll be taking a look at some techniques to help address these concerns in other videos.

You may have worked on your data model for a long time, going through the data modeling methodology, analysis and validation, and all of this optimization tuning and now your application is running in production and works great. However, that does not mean that your data model will never have to change. It's possible that database performance may start to degrade, or additional requirements are needed for your application and data model, or even the characteristics of your data changes. Regardless, it's always a good idea to revisit the data model, review the validation questions, and determine if there are reasons to change or improve on the data model.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.