Data Model Anti-Patterns

DSE Version: 6.0


In this unit, we will be taking a look at some of the anti-patterns you should watch out for when data modeling with DataStax Enterprise.


By this time you've probably listened to both myself and others go on and on about what you should be doing when it comes to data modeling. Are you sick of this just yet? Well, let's take a quick break from all of that and take a look at the opposite of what you should be doing. In this video, we'll be taking a look at some of the anti-patterns found in DataStax Enterprise.

What is an anti-pattern? According to the man who coined the term back in 1995, it is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. Doesn't sound good right? In fact, these anti-patterns are exactly the kind of things you'll want to avoid as much as possible as you continue with you data modeling in DSE.

With anti-patterns, let's talk about one of the biggest anti-patterns… in terms of the amount of data processed, that is. That is doing queries that scan across an entire table. These might be certain queries where you are trying to read from the entire table for some reason. For example, getting the total count of the number of rows. This may be ok as a one-off query that is run occasionally, but definitely not something you should be running all the time.

Another anti-pattern is the use of IN statements to support getting specific results in a query. This is where you are trying to get back all of the results where a column matches any one of the values provided. The problem with these queries is that they generally cause a lot of overhead as they run. If the application is constantly running IN queries, this can lead to performance bottlenecks on those coordinator nodes and possibly even node failure in extreme cases. If there is a query that requires the use of IN to get back the appropriate results, there is also a better data model that can be used to have those queries run much more efficiently.

Queries that require a read before a write also tend to be an anti-pattern. When doing a normal read before a write, that puts you at risk of race conditions that may have unintended effects. When using lightweight transactions to mitigate that, you are performing a much slower write operation. Theoretically lightweight transactions are at least four times slower than a normal write operation, and should only be used when absolutely needed or when performance doesn't matter.

Making use of the ALLOW FILTERING override can enable a lot of different queries that you would not be able to execute otherwise. However, there is always a reason why this override is needed and that is because it's an expensive operation that needs to scan through each row one-by-one and filter out the few that are needed. If you are using this often, it probably makes more sense to create a table that supports the query you need instead. Of course, you are restricting the amount of data that's being filtered, for example the query is only filtering a single partition, then that's perfectly fine.

One very common anti-pattern that's seen at the table level is excessive use of secondary indexes. Hopefully by watching the videos you understand that DataStax Enterprise approaches data modeling quite a bit different from relational databases, and creating a secondary index for each column you have in the table is not the answer to just making queries work.  Each secondary index created does cause overhead to maintain it, and will accumulate in a way that is not scalable.

A not too common anti-pattern seen is in the use of non-frozen collections, particularly in UDTs or when nested within another collection. These types of collections have a huge performance hit, and should be avoided by using a frozen collection when possible.

It's also interesting to see that strings are often used to represent dates and timestamps. Although it is able to represent the value, it may not be as easy to work with as the native datatypes. Since there are rules set up to represent a timestamp, such as specifying a timezone, it's much easier to keep this type of data consistent across different geographies as well as to manipulate them.

At the keyspace level, there are several anti-patterns that can pop up.

One of them is creating too many keyspace, but more importantly, too many tables. This can cause flushing problems.

One common anti-pattern seen is the improper use of TTLs and deletes, which causes tombstones to build up in a partition and become a huge bottleneck in read performance. This is typically caused by an improper data model that is designed to frequently delete small parts of a partition, therefore creating these tombstones much faster than then can be compacted or evicted.

Finally there has been some situations where read performance degrade significantly enough, due to improper data modeling or use of anti-patterns, where the read operation times out before responses are received from nodes. An anti-pattern here is increasing the length of the time before a timeout, instead of finding out what is causing such slow performance and improving on it. In this case you are merely masking the problem and not really solving it.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.