Trouble Shooting Slow Queries

DSE Version: 6.0

Video

Transcript: 

I'm Joe Chu, and this is Troubleshooting Slow Queries. We try to avoid talking too much about operations in this course, but when it comes slow queries, I think every developer will want to know if a query is running slow, and why. In this video, we'll be going through several different methods to troubleshoot slow queries and isolate what part of the query is causing the performance to drop.

To start, it is Important to remember that pure CQL queries and search queries do not perform the same. There have been misconceptions that search queries can somehow be faster, due to the use of indexes. There should also not be any expectations of search queries to be able to consistently run at the same performance level as Cassandra. Referring back to read path, there are already extra steps that are done in a search query, one of them already being to read row data from Cassandra.

That is partly one reason why DSE search is called near real-time. However, the meaning of near real-time can change depending on who you ask. Fortunately, it should never be as bad as a batch job or a complex join query. However, it may not necessarily be a query that returns results without a noticeable delay. The important thing is whether or not it meets your expectations and SLAs, and if it does, then whatever it's called doesn't even matter.

There are a number of external factors that can influence the actual speed of a search query: the number of CPU cores, disk speeds, data cached in memory, network bandwidth and latency, configuration and settings, noisy neighbors, etc. If query performance isn’t to your satisfaction, there is almost always something you can blame.

So just how fast are search queries? Well without getting into the external factors again, it can be said that they typically run anywhere between tens of milliseconds and up to about 2 seconds on average. Queries that take longer than a few seconds to complete would be sufficiently slow enough to further investigate, if needed.

One of the staple tools for troubleshooting is the CQL traces that are saved during the execution of a CQL query. Although these are primarily meant to trace the various parts of the read and write path for Cassandra queries, there are traces related to DSE search, and can be used to approximate the amount of time spent going through the search index, and of course how long the CQL query took to run overall.

In the tiny example traces that you see here, there is an obvious trace activity that indicates work done from DSE Search. This is the line that says Processed response from shard 172.31.28.32:8609 slash solr slash killrvideo.videos etc, etc. The source_elapsed value for this is 71470, which indicates that it took about 71 milliseconds for the index to be read and search results returned. The later traces are actually retrieving row data based on unique ids provided by the search result, and the entire request completed at 121537, or about 121 milliseconds. This illustrates that the majority of the time spend for the search query was spent reading the index.

CQL tracing is available from all of the major CQL clients, including cqlsh, datastax studio, and the datastax drivers. Instructions for enabling tracing may vary, but for cqlsh this can be easily done by executing the command tracing on.

Another great troubleshooting tool can be found in DSE Search, and is the query name parameter that can be added to a search query. This can only be done with the Json search query, but when used, DSE Search will be able to collect and aggregate certain metrics for all executed queries that uses the same query.name.

In the example below, we have a search query that is searching for documents with the term Terminator in the title field, and has a query name set to Ahnold.

The query metrics are exposed through the JMX QueryMetrics mbean, which we'll take a look at next.

To read the metrics that are made available by the QueryMetrics mbean, you can use any JMX-compatible client, such as jconsole or visualvm, both which can be found in the java development kit, or JDK. DataStax Enterprise also has a special function built into the nodetool command-line utility, called sjk, or swiss java knife, that can read beans.

Accessing the mbean requires a specific domain, com dot datastax dot bdp and a key-property list that includes the keys type being set to search, index set to the name of the search index, and the name being Querymetrics.

The examples show two different uses of the nodetool sjk mx command which accesses the mbean. The first uses -mg flag to retrieve statistics that are available as an attribute.

The second example has the -mc flag to invoke an mbean method.

There are several mbean methods that provide useful performance metrics, including latency counts, latency percentile, and average latency. Latency values provided are in microseconds, and can also be split up by the individual read path phases.

There are four phases that are saved, with the first one being COORDINATE, which would be the total amount of time spent by the node to distribute the query and to gather and process results from shards, when has the role of the coordinator for a search query.  This value will be computed only on nodes that do coordinate queries, and the value is always 0 if there is only one search node.

Another phase is EXECUTE, which is the time spent by the node to read the index for a search query.

The RETRIEVE phase represents the time spent by a node to retrieve data from Cassandra when using the HTTP API for search queries. If using CQL, no metrics will be saved.

Finally the ENQUEUE phase is the time spent waiting for a thread to execute the search query.

Not listed here is a TOTAL phase, which contains the total server-side latency for a search query, and is the sum of both the COORDINATE AND ENQUEUE latency.

The latency counts can be retrieved with a mbean operation called getRecordedLatencyCount, and requires two arguments, the name of the phase you want to retrieve the latency count for, and the query name that was used when executing search query.

The example below shows using the nodetool sjk mx command to get the latency count for the EXECUTE phase and for the search queries with query name, Ahnold.

In this case there were 50 query executions that were recorded with this query name.

The mbean operation getLatencyPercentile can be used to retrieve the highest latency time for a specific percentile of search queries executed. The arguments for this operation is again, the phase and the query name, but will also the percentile rank, which is an integer representing the percentage.

Following our previous example, we are now trying to get the the highest latency among 90% of the queries executed, which we have found previously to have a sample size of 50. In this command, we are still getting the latency for the EXECUTE phase.

Here we see that 90% of the executed search queries were able to complete the EXECUTE phase within 3075 microseconds, or 3 milliseconds.

The last mbean operation that we'll be covering here is the getaveragelatency operation. This operation will retrieve the mean latency across all of the recorded query executions, again taking in the phase and query name as arguments.

Our example again is using the EXECUTE phase and the Ahnold query name, and returns an average latency of 527 microseconds.

Again the result may sound strange but this is only one phase out of several for a search query.

A great way to proactively monitor search query performance is with the CQL slow log that's available the DSE Performance Service. With the CQL slow log, DSE will record all CQL queries that run for longer than a set threshold in Cassandra under the dse_perf.node_slow_log table.

The slowest queries for each node can also be kept in memory, and reviewed using the dsetool command line tool.

The screenshot shows a snippet of the information logged about the slow query, but does also saves the CQL command that ran slow, which user ran the query, the keyspace and table involved, and the id of the session traces for the query.

Making changes to the CQL slow log, including enabling the log, can be done in the dse.yaml file. Among the changes you can make include changing threshold that determines if a query will get logged, the minimum number of samples needed before logging queries, the amount of time a record stays in the log table, whether the log will only be kept in memory or saved to the database, and the number of entries to keep in memory. Note that for the threshold setting, any value greater than 1 is assumed to be in milliseconds, and a value less than one is a percentile. For example if the threshold is set to 0.9, than means that any query that runs slower than 90% of the queries will be logged.

Another useful feature is the ability to temporarily make changes to the cql slow log settings using the dsetool perf cqlslowlog command. These are all the same settings you can change in dse.yaml, but the changes made will only be application until the next time DSE starts up on the node, when it reverts back to the settings in the dse.yaml.

Another useful log from the DSE performance service is the solr slow sub-query log. This is a search specific log that keeps track of sub-queries that runs on the local node which run longer than the specified threshold. Although it doesn't completely cover the execution of a search query, the ability to drill down to the individual parts of the read path on the search side can be useful to find performance bottlenecks. When a slow sub-query triggers logging, information about that sub-query will be saved for a set amount of time in the database under the dse_perf.solr_slow_sub_query_log. Here only a snippet of the information logged about the sub-query is shown in this screenshot.

The elapsed_millis shown here contains the amount of time the sub query needed to execute, and there are also additional columns, not shown here that breaks down the execution time between the prepare phase and the process phase of the read path. Otherwise, other potentially helpful information about the sub-query is also logged: the number of documents found, the ip of the coordinator for the sub-query, and the parameters used for the sub-query.

Like the other log we just saw, the solr slow sub-query log is configured in the dse.yaml file. There are fewer settings here, but you are able to tune the amount of time needed to trigger logging, how long logs will stay in the database, and how many concurrent writes that can be made to the log.

The dsetool perf solrslowlog command can also make temporary changes to the solr slow sub-query log settings as well, and can temporarily enable or disable logging, or to change the threshold value. Again, changes until stay in effect until the next time DSE start up, reverting back to the settings in the dse.yaml.

There are even more performance objects that are available with the DSE Performance Service, as shown in this table. All of the logs and performance objects are disabled by default, and you should not necessarily be enabling every one of them. If you do notice performance issues, take a look and see which of these might be useful to help troubleshoot, and temporarily enable them to collect statistics.

Refer to the DataStax documentation for more details about each of these performance objects.

We are at the end of another video, and the reward is another exercise to do. Go and do this exercise before continuing on.

Now go ahead and do this exercise.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.