Troubleshooting DataStax Enterprise | Apache Cassandra and DataStax Enterprise - DataStax Academy

I have been on the support team at DataStax for over 5 years now. Before that, I was a support engineer at another software company for almost 8 years. In that time, I’ve had a lot of opportunity to hone my troubleshooting skills, for software in general, and DataStax Enterprise in particular. In this blog post, I’d like to share some of the tips and techniques I’ve picked up along the way.

Troubleshooting Process

These are the general steps that I follow for diagnosing a DSE issue:

  1. Determine which nodes have problems
  2. Examine bottlenecks
  3. Find and understand errors
  4. Ask what changed
  5. Determine root cause
  6. Take corrective action

In this blog I will focus on giving you the tools to do steps 1-4. The last two steps are the most difficult part of the process, but luckily that’s what we’re here to help you with at DataStax Support.  When you open a ticket, we will be able to assist much more quickly and effectively if you have done the groundwork on steps 1-4 beforehand. Often the process is iterative and some of the steps have to be repeated as more information is uncovered.  

What Changed?

Ah, the most important and underappreciated question in all tech support!  Asking what changed provides a frame of reference that helps you focus your investigation.

This question needs to be asked in a very broad sense, because in a complex system, all the parts exert influence on each other, and an effect observed in one part of the system can be due to a cause in a completely different part.  The following are just some of the potential areas of change you should consider:

  • Settings
  • Application code
  • Read/write load
  • Data volume
  • Hardware changes
  • Network bandwidth and latency
  • Software versions:
    • DSE components
    • Linux Kernel
    • Java Virtual Machine
    • Client Drivers

The ability to answer this question depends on good operating hygiene. The following practices should be strictly followed, especially in a production environment:  

  • Change one thing at a time.
  • Log everything you do, and when you did it.  
  • Use configuration management.
  • Always test major changes in a non-production environment.
  • Compare with another environment.
  • Keep historic metrics.

Tools of the Trade

In this section, I’ll discuss the many tools available to make troubleshooting DSE easier. I will provide general information about the tools here. I will leave discussion of using the tools to diagnose specific problems to later posts.

OpsCenter Metrics

Keeping historical metrics is critical so that you know what is normal for your system and can spot anomalies. OpsCenter is a great option for capturing metrics. It will capture the raw metrics from each DSE node, perform aggregation and time-series rollups, and allow you to graph the metrics side-by-side in a dashboard.  The OpsCenter documentation contains more information on available metrics.

 

Starting in 6.0, agents can also be configured to forward metrics to Graphite so that you can view the OpsCenter metrics together with other metrics you’re  (hopefully) capturing for your application.  For more information, refer to the section of the OpsCenter user guide on how to configure OpsCenter Graphite Integration.

 

https://lh5.googleusercontent.com/PjUfkVZv8xdFecMK_u3WlvM4HciEwIioXcfJ-a2truUw9FLua2y-7KN99XYbwWOpqyMXH1TnLGjMa5OE9bT0uCVGS9IMWEHhSQYoF-vxNo1AcX3kiKgCE_Z916C65UIl7PHw0fSi

OpsCenter Diagnostic Tarball

If you’ve interacted with DataStax Support before, chances are you’ve been asked for a diagnostic tarball from OpsCenter.  The diagnostic tarball allows you to quickly and easily capture troubleshooting info from every node in your cluster in one shot. Here is just some of the information captured:

  • Logs
    • system.log
    • debug.log
    • Spark logs
    • OpsCenter logs
  • Configuration files
    • cassandra.yaml
    • cassandra-env.sh
    • dse.yaml
    • OpsCenter conf/yaml files
  • Cassandra schema
  • Nodetool commands
    • status
    • tablestats
    • tpstats
    • describecluster
    • netstats
    • And more…
  • OS and hardware metrics

Even if you’re not working with DataStax Support, the tarball can be a very useful tool because it’s a much easier way to get the info rather than logging into every node, running multiple commands and capturing their output, and downloading various log and configuration files from each node.

Look for more in-depth coverage of the OpsCenter tarball in an upcoming blog post!

Accessing the Tarball

The tarball can be downloaded using the help menu in the upper right corner of the OpsCenter web UI.

https://lh5.googleusercontent.com/2eiclI0JkgqC7dMsknah6qSKHRhUuGI4CMi1vNDvIiUOBS7Nx4f9pfaD64YklvjNzIgSsGGHPxiXSjeR1ret01lenuoLAqfgFTAlLV2PKfCMQx2aLfPjQ44M6vPZ1EBy84JsEubt

Cassandra Logs

The Cassandra system.log is one of the most important tools for diagnosing any Cassandra problem.  

Location

By default, the logs are located in /var/log/cassandra. system.log is the main log, and debug.log is a secondary log that contains more detailed troubleshooting information. Since 5.0, many of the more verbose messages have been moved from INFO to DEBUG level and now appear only in the debug.log.

Log Format

The first part of each message logged follows a common format, shown below.  After the common portion, there is a message whose format varies by the type of message logged.

Level

Thread Type & ID

Date & Time

Source File

Line No.

INFO

[CompactionExecutor:155]

2015-02-13 02:18:40,986

CompactionTask.java

:287

WARN

[GossipTasks:1]

2015-02-17 19:47:37,331

Gossiper.java

:648

ERROR

[AntiEntropySessions:1]

2015-02-17 20:32:11,959

CassandraDaemon.java

:199

DEBUG

[OptionalTasks:1]

2015-02-20 11:29:14,056

ColumnFamilyStore.java

:298

Logging Level

The logging levels indicate the severity of the message.  The levels, in decreasing order of severity and ascending order of verbosity, are FATAL, ERROR, WARN, INFO, DEBUG, and TRACE.

Thread Type and ID

The thread type provides a good indicator about what kind of internal process is logging the message (e.g., compaction, gossip, repair, etc.). These names will oftentimes correspond to a thread pool name in the output of nodetool tpstats.  

When correlating related messages, pay attention to the thread ID to make sure the messages came from the same thread.  There can be multiple threads of a particular type running concurrently, and the messages from each thread can be interleaved. For example, if looking at compaction begin and end messages, make sure that the thread type and ID is the same on both.

Date and Time

Date and time is especially important for correlating events between different nodes, or between the node and a client.  For example, if you see a message on one node indicating that another node has gone down momentarily, you might want to look for a garbage collection message in the other node’s log around the same time to see if the node went down because of a stop the world GC pause.  The timestamp can also be used to calculate the duration of an event by subtracting the timestamp on the end message from the timestamp on the begin message.

Source File and Line Number

This tells you the source file and line number that logged a message.  It can be helpful to look at the Cassandra code and see the context surrounding the code if you don’t understand a message. Keep in mind that line numbers can change between versions when code is added or removed from a class, so if you go code diving, make sure you’re looking at code for the same version.

The filename shown here can sometimes be misleading. For example, if an exception occurs, the class logged here will be the class that caught the exception, not the class that threw it.  To find that information, you would need to look at the stack trace (more on that later).  

Changing Logging Level

Sometimes when troubleshooting a specific issue, it’s helpful to increase the logging level to get more details about what’s going on.  Usually you will want to temporarily change the logging level rather than changing it permanently in the configuration file. The nodetool setlogginglevel will allow you to do this on the fly.  The following command will enable TRACE logging for all cassandra classes (except the ones overridden at a lower level):

$ nodetool setlogginglevel org.apache.cassandra TRACE

In order to avoid flooding the log with unwanted messages, you should usually increase the logging level only on the specific subsystem (package) you’re troubleshooting:

$ nodetool setlogginglevel org.apache.cassandra.gms TRACE  

If you know the message you’re looking for will come from a specific class, you can be even more specific and enable logging only for that class:

$ nodetool setlogginglevel org.apache.cassandra.service.GCInspector TRACE

Once you’re finished, you can reset logging to the defaults configured in logback.xml by running setlogginglevel without any parameters:

$ nodetool setlogginglevel                     

Finally, the nodetool getlogginglevel command will show you what logging levels you currently have configured:

$ nodetool getlogginglevels                       # show current levels

Logger Name                                        Log Level
ROOT                                                    INFO
DroppedAuditEventLogger                                 INFO
SLF4JAuditWriter                                        INFO
com.cryptsoft                                            OFF
com.datastax.bdp.search.solr.metrics.MetricsWriteEventListener     DEBUG
com.thinkaurelius.thrift                               ERROR
org.apache.cassandra                                   DEBUG
org.apache.lucene.index                                 INFO
org.apache.solr.core.CassandraSolrConfig                WARN
org.apache.solr.core.RequestHandlers                    WARN
org.apache.solr.core.SolrCore                           WARN
org.apache.solr.handler.component                       WARN
org.apache.solr.search.SolrIndexSearcher                WARN
org.apache.solr.update                                  WARN

Startup Messages

DSE logs quite a bit of useful information on startup. You can use this information to verify versions and configuration settings. Over the years I have seen many cases where the expected configuration settings were not picked up because a different file was somewhere in the classpath, or where a node was missed when upgrading. It is always good to double check.

What to look for…

What it tells you…

Loading DSE module

Node just restarted

  • DSE version: …
  • Cassandra version: …

Versions of major components

Loading settings from file: …

Settings file locations

Node configuration: …

Settings read from cassandra.yaml

JVM vendor/version: …

JVM vendor and version

  • Heap size: …
  • Par Eden Space Heap memory: …
  • Par Survivor Space Heap memory: …
  • CMS Old Gen Heap memory: …
  • CMS Perm Gen Non-heap memory:…

Heap settings for each generation

Classpath: …

Classpath (jar files and directories)

JNA mlockall successful

JNA is installed (make sure it is!)

  • Starting listening for CQL clients on ...
  • Listening for thrift clients…

Node is serving requests

Understanding Exceptions

Whenever an unexpected error occurs in DSE, as in all Java applications, the result is an exception. Interpreting them can be daunting, especially for non-Java programmers, so it helps to break the exceptions down into their individual components.  In the following discussion, I will highlight various parts of the stack trace using [square brackets].

At the top is the exception, which tells generally what kind of error occurred. However, the exception by itself is missing crucial context.  The wall of text after it, called a stack trace, provides that context.  This shows you where in the code the exception occurred, with the innermost function at the top, and the outermost function at the bottom.

[java.io.EOFException]
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:395)
   at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:356)
   at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:261)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:415)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:309)
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536)
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:376)
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

The first piece of useful information gleaned from the stack trace is the organization name. This tells you what organization is responsible for the code in question. This can help you determine if the code in question is part of Cassandra, DSE, or one of the many libraries it relies upon.

In this example, you can see the exception occurred in the Java standard I/O library (java.io), which in turn was called by Apache Cassandra code (org.apache.cassandra), which in turn was called by DataStax Enterprise code (com.datastax.bdp)

java.io.EOFException
   at [java].io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at [org.apache.cassandra].utils.ByteBufferUtil.read(ByteBufferUtil.java:395)
   at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:356)
   at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:261)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:415)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:309)
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536)
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at [com.datastax.bdp].server.DseDaemon.setup(DseDaemon.java:376)
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

After the organization name, you will usually find the name of a subsystem, which can give you an idea of the general part of the software that is having a problem. In this case, we can tell the error occurred in the io subsystem of the Java standard library, and that several subsystems of Cassandra were involved (utils, service, cache, db, and server).

java.io.EOFException
   at java.[io].DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.cassandra.[utils].ByteBufferUtil.read(ByteBufferUtil.java:395)
   at org.apache.cassandra.[service].CacheService$KeyCacheSerializer.deserialize(CacheService.java:356)
   at org.apache.cassandra.[cache].AutoSavingCache.loadSaved(AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:261)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:415)
   at org.apache.cassandra.[db].ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:309)
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536)
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at com.datastax.bdp.[server].DseDaemon.setup(DseDaemon.java:376)
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

After the subsystem are the specific classes and methods in which a problem occurred. This is where you can start to understand specifics about what is going on.  

The highlighted terms below indicate that an end of file exception occurred while reading a data input stream into a byte buffer, when Cassandra was attempting to deserialize the key cache.  This occurred while it was trying to load a saved cache from disk, during the initialization of a column family while checking the health of the system keyspace.  All of this occurred while the DSE daemon was being setup (i.e., during startup).  

Even if you are unfamiliar with the code, by looking at the names of the classes and methods you can begin to tell a story that will help you understand the problem.

[java.io.EOFException]
   at java.io.[DataInputStream.readFully](DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.cassandra.utils.[ByteBufferUtil.read](ByteBufferUtil.java:395)
   at org.apache.cassandra.service.[CacheService$KeyCacheSerializer.deserialize](CacheService.java:356)
   at org.apache.cassandra.cache.[AutoSavingCache.loadSaved](AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:261)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:415)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at org.apache.cassandra.db.[Keyspace.initCf](Keyspace.java:309)
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.[SystemKeyspace.checkHealth](SystemKeyspace.java:536)
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at com.datastax.bdp.server.[DseDaemon.setup](DseDaemon.java:376)
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

Also notice that the stack trace includes the filename and line number where each method call was found. This can be helpful if you want to go code diving to try and figure out the problem. However, keep in mind that line numbers can--and often do--change between versions as code is added or removed. So if you look at the source, make sure it’s the same version that generated the exception.

java.io.EOFException
   at java.io.DataInputStream.readFully([DataInputStream.java:197])
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:395)
   at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:356)
   at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.([ColumnFamilyStore.java:261])
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:415)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at org.apache.cassandra.db.Keyspace.initCf([Keyspace.java:309])
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.SystemKeyspace.checkHealth([SystemKeyspace.java:536])
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at com.datastax.bdp.server.DseDaemon.setup([DseDaemon.java:376])
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

Take notice of any nested exceptions, indicated by the phrase “Caused by”.  In Java, it’s possible for one piece of code to catch an exception and throw a new exception with the original exception nested inside it. Nested exceptions will often provide important clues about what caused the outer exception.

In this example, the thrift transport exception was actually caused by a networking socket exception.

org.apache.thrift.transport.TTransportException: …
   at org.apache.thrift.transport.TIOStreamTransport.read
   at com.datastax.bdp.transport.server.TPreviewableTransport.readUntilEof
   at com.datastax.bdp.transport.server.TPreviewableTransport.preview
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport.open
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run
   at java.util.concurrent.ThreadPoolExecutor.runWorker
   at java.util.concurrent.ThreadPoolExecutor$Worker.run
   at java.lang.Thread.run
[Caused by]: java.net.SocketException: Connection reset
   at java.net.SocketInputStream.read
   at java.net.SocketInputStream.read
   at java.io.BufferedInputStream.fill
   at java.io.BufferedInputStream.read1
   at java.io.BufferedInputStream.read
   at org.apache.thrift.transport.TIOStreamTransport.read
   ... 9 more

Java’s exception mechanism also allows the programmer to provide a free-form error message which can give additional details about what caused an exception. Look here for information such as specific file names, host names, or other details about an error.

In this example, the text provides more information about the socket exception, indicating that the network connection was reset.

org.apache.thrift.transport.TTransportException: …
   at org.apache.thrift.transport.TIOStreamTransport.read
   at com.datastax.bdp.transport.server.TPreviewableTransport.readUntilEof
   at com.datastax.bdp.transport.server.TPreviewableTransport.preview
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport.open
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run
   at java.util.concurrent.ThreadPoolExecutor.runWorker
   at java.util.concurrent.ThreadPoolExecutor$Worker.run
   at java.lang.Thread.run
Caused by: java.net.SocketException: [Connection reset]
   at java.net.SocketInputStream.read
   at java.net.SocketInputStream.read
   at java.io.BufferedInputStream.fill
   at java.io.BufferedInputStream.read1
   at java.io.BufferedInputStream.read
   at org.apache.thrift.transport.TIOStreamTransport.read
   ... 9 more

 

Tips for Googling Exceptions

Do

  • Use exception and several package+class+method names to provide the necessary context.
  • Use quotation marks around each package+class+method name. This forces Google to only return exact matches.
  • Use “site:” to limit search to relevant web sites
  • Narrow or broaden your search terms as necessary.  
    • Add additional methods from the stack trace if you get back too many irrelevant results.
    • Try removing methods if you don’t get back any hits; just be sure to study the stack trace carefully and make sure it is really the same error.

Don’t

  • Include source file + line number; this will prevent you from finding the same exception in different versions of the product that may have different line numbers
  • Include specific numbers and strings, such as file names, table names, or IP addresses. This will prevent you from finding related exceptions because these details will differ for person who encountered the exception before you.

From the exceptions introduced above, here are some examples of search terms you might use when googling for the exception.  Be sure to enclose each individual search term in quotation marks.

Example 1:

[java.io.EOFException]
   at [java.io.DataInputStream.readFully](DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:395)
   at [org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize](CacheService.java:356)
   at [org.apache.cassandra.cache.AutoSavingCache.loadSaved](AutoSavingCache.java:119)
   at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:261)
   at [org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore](ColumnFamilyStore.java:415)
   at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:386)
   at [org.apache.cassandra.db.Keyspace.initCf](Keyspace.java:309)
   at org.apache.cassandra.db.Keyspace.(Keyspace.java:266)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
   at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
   at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536)
   at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:246)
   at [com.datastax.bdp.server.DseDaemon.setup](DseDaemon.java:376)
   at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

Example 2:

[org.apache.thrift.transport.TTransportException]: …
   at org.apache.thrift.transport.TIOStreamTransport.read
   at [com.datastax.bdp.transport.server.TPreviewableTransport.readUntilEof]
   at com.datastax.bdp.transport.server.TPreviewableTransport.preview
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport.open
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at com.datastax.bdp.transport.server.TNegotiatingServerTransport$...
   at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run
   at java.util.concurrent.ThreadPoolExecutor.runWorker
   at java.util.concurrent.ThreadPoolExecutor$Worker.run
   at java.lang.Thread.run
Caused by: [java.net.SocketException: Connection reset]
   at java.net.SocketInputStream.read
   at java.net.SocketInputStream.read
   at java.io.BufferedInputStream.fill
   at java.io.BufferedInputStream.read1
   at java.io.BufferedInputStream.read
   at [org.apache.thrift.transport.TIOStreamTransport.read]
   ... 9 more

 

Other Considerations

Stack traces caused by assertions (AssertionError) can be very difficult to interpret properly. These traces may lead to the wrong conclusions unless you have a deep understanding of the code.  If in doubt, check with DataStax support.

Nodetool Commands

Nodetool is the primary command-line administration interface for Cassandra, and provides a wealth of useful information for troubleshooting issues. I will discuss individual commands in the context of solving specific problems in a future post, but for now here is a list of the most useful commands.

Command

What it tells you…

status / ring

Overall cluster status

info

Status, memory usage, and caches for a single node

tpstats

Statistics about each thread pool on a single node

tablestats

Summary statistics for all tables and keyspaces on a single node

tablehistograms

Detailed statistics for a specific table on the local node

proxyhistograms

Latency statistics for requests coordinated the local node

netstats

Network activity: streams, read repair, and in-flight commands

compactionstats

Compactions pending and in progress

compactionhistory

Historical compaction information

describecluster

Basic cluster information and schema versions

gcinfo

Garbage collection statistics

sjk ttop

Top threads by CPU utilization or allocation rate

sjk stcap / ssa

Capture and analyze thread dumps

Linux Monitoring Tools

Linux provides many tools for monitoring resource usage. Here are some of the most useful.

Command

What it tells you…

top

CPU utilization and memory use per process

top -H

CPU utilization per thread, memory use is still per process

df

Free disk space

iostat -xd

I/O bandwidth utilization

free -h

Memory and cache usage

netstat -an

Network connections established

iftop

Network bandwidth utilization

dstat -lrvn

Most of the above in a nice format

sar

Most of the above, with history!

iperf

Check network bandwidth

lsof

Show files and sockets opened by a process

Al’s Cassandra Tuning Guide is a great resource that discusses many of the above tools and more in the context of tuning Cassandra.

Java Monitoring Tools

Java provides a wealth of tools to monitor the internal state of the JVM. Here are a few of the most useful ones.

Command

What it tells you…

jstack -l

Status and stack trace of each thread

jmap -histo

Types of objects on the heap (optionally only live objects)

jmap -heap

Size and usage of each java heap generation

jstat -gccause

Causes of gc activity

jmap -dump

Take a heap dump for further analysis

MemoryAnalyzer

Post-mortem heap-dump analysis

Flight Recorder

Built-in profiler in the JVM; will change your life!*

* Flight Recorder requires a license from Oracle for use in production environments.

Overall Status

Nodetool status shows the overall status of all the nodes in the cluster from the perspective of the node on which it is run. This is an important first step for deciding which nodes to investigate further.  The first character--U or D--indicates whether the node is up or down.

$ nodetool status

Datacenter: Cassandra
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns    Host ID                               Rack
UN  10.200.177.196  3.45 GB    1            ?       108af27a-43d8-4814-b617-f8f93ba2bb0e  rack1
UN  10.200.177.197  3.45 GB    1            ?       432bc964-3cd3-4784-9ab7-d7a4a9e063b6  rack1
UN  10.200.177.198  3.45 GB    1            ?       3c467f89-7cce-485f-bb16-dd782c9a84ec  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

In a large cluster, hunting for the nodes that are down can be time-consuming, so pipe the output of nodetool into grep: nodetool status | grep DN

Alternatively, OpsCenter ring view offers an at-a glance cluster status in your web browser.  The size, color, and shape of each node indicates data volume, token alignment, and load.

 https://lh5.googleusercontent.com/9Z-JhKR4oBj4RAN2Txx3RlORU6wSGl6kOJpgAakT0qpTM7ANU8XaOssDQkQYQqgRZt5WUePqtswi5sbReAIEdN5XKFLmnP7bqzoX1_1c3EcsSqfTs6qRakgnKwo0CaQ9RgXlMQDm

Hovering over a node will provide more details about the status of a node, and clicking on it will provide even more.

https://lh5.googleusercontent.com/Uc9wTjismevN3TkDrmqvrmigHbV8ZTidPevLVtfSA_NKT3SqHTpMQKK-Q7CNnH8wXI-PXsJEPVzqB1bpTeBVxmxe2C60jlngu_BgBKZ12b0zgadU3uXPmli2lsw-q3_UFXrQkiRg

Alerts

OpsCenter also provides a proactive way to monitor the status of nodes. You can set up alerts for events such as a node going down, metrics exceeding a limit, data balance issues, and more.  OpsCenter supports alerts via email, HTTP API, or SNMP. The HTTP API will allow you to integrate with many common services such as Hipchat, Slack, Zendesk, or Pagerduty.

Refer to the OpsCenter documentation for instructions on setting up alerts.

Then access the alerts feature through the OpsCenter UI as follows:

 

https://lh4.googleusercontent.com/gtiSDAhmbYXFn9JMMHuuBLoJuGjLPax1MAUpmoVaCQXWk-06XeT32A4nff0-02JXzcxx99zH-nQU2KR2toz-Efw07I0JCWc0ocqVflgsTESTQz2_bMNVphB4ZRAmyFNS10AWE6zU

Next Time, on Troubleshooting DSE

That’s all we have time for in this blog post. I hope that this has given you an understanding of the troubleshooting process and the tools you have at your disposal to help you. In future posts, I will discuss how to use these tools in the context of diagnosing specific types of problems.