The Diagnostic Tarball Gold Mine | Apache Cassandra and DataStax Enterprise - DataStax Academy

I like to use the analogy of gold mining when thinking about obtaining information from a diagnostic tarball. That's mainly because a diagnostic tarball is a gold mine of information and this blog article will hopefully provide some useful insight into the information available. So grab your hard hat, pick axe, torch and get ready to strike it rich with information. Like any good blog post, rather than making a long list, I've tried to break down the article into these separate sections:

General information
OpsCenter and agent information
Node information
Example Scenarios
Conclusion

I've laid it out this way so if for example you have a problem with OpsCenter you can just refer to the 'OpsCenter and agent information' section and not have to read through the whole blog again. Think mining again - you don't need to blast through the opening again with TNT every time you dig further into another section of the gold mine.  

General information

To generate a diagnostic tarball you'll need to install OpsCenter then click Help > Diagnostics and select Download. This will create a file called diagnostics.tar.gz and when you untar this file you'll see this file and folders:

cluster_info.json
> nodes
> opscenterd

A good place to start digging is to have a look at the cluster_info.json and I find this information particularly useful:

bdp _version - This is the version of DSE.
Cassandra version
Cluster cores
RAM
Cluster OS and version
DC count / Keyspace count / Column family count
OpsCenter version
OpsCenter OS and version
OpsCenter server RAM

If the values for CPU and RAM are below recommended production settings there is a good chance the cluster will not be as performant as an appropriately specced cluster. Time for another good analogy - if the size of your shovel is too small you'll be digging for a long time.

For information specific to a node I look at the node_info.json file:
> opscenterd > node_info.json

Some useful nuggets of information include:

IP and host name of the node
JVM version
Keyspace sizes
Load on the node

There are also some thread pool statistics that look like nodetool tpstats output.

OpsCenter and agent information

For OpsCenter server problems refer to the OpsCenter log file:

> opscenterd > opscenterd.log

For OpsCenter server and datastax-agent configuration issues you can check the settings in these files:

> opscenterd > conf.json
> opscenterd > clusters > cluster_name.conf

For OpsCenter repair service issues take a look at these files:

> opscenterd > repair_service.log
> opscenterd > repair_service.json

The repair_service.json shows time to completion and if the repair service is active. Contrary to popular belief the repair service doesn't fix anything. It just streams data to synchronise the data on your nodes. 

For OpsCenter datastax-agent issues refer to the agent.log on the nodes which can be found here:

> nodes > IP_of_node > logs > opsagent > agent.log

You can also take a look at GC activity by referring to the gc.log located here:

> opscenterd > gc.log.0.current

Node information

> nodes > IP_of_node

In the old days of mining, miners used to take canaries into the mine for them to detect noxious gases such as carbon monoxide. If the canary died it was a pretty good indicator that something wasn't right. The same analogy can be applied to the nodes folder. If all is well you should see folders appended with the IP of the node. However, if you find files with the IPs of nodes from the cluster or no folders for some nodes this is a good indicator of a potential problem. The missing diagnostic information could be a result of network problems or you may need to restart the datastax-agent on the node. Another thing to note, if downloading the tarball times out try increasing the default value of the diagnostic_tarball_download_timeout option in the cluster_name.conf. Increasing the default value is recommended for DSE multi-instance clusters or for slower machines and connections.

In the IP_of_node folder you'll find these files:

agent_version.json - Displays the datastax-agent version.
agent-metrics.json - Refer to this file if you have a problem obtaining metrics from this node.
blockdev_report - Shows filesystem information.
java_heap.json - Displays heap and non-heap memory used.
java_system_properties.json - Shows lots of useful Java information including Java version and class path.
machine-info.json - Displays architecture and memory on the node.
os-info.json - Shows OS and version.
process_limits - Refer to this file if you have an OS resource issue.

There is so much useful information in the nodes folder that it can be a little overwhelming. It's like entering a cave and then realising you're in King Solomon's mines and you even have those Geode rocks we all like containing crystals like the ones you get from tourist shops. The trouble is you really don't know what you're going to do with a Geode when you bring it home. However, if you spend a little time understanding what gems lay inside you can take advantage of your rather fortunate position.

> nodes > IP_of_node > cassandra-cli

You can ignore this folder because it's deprecated in later versions.

> nodes > IP_of_node > conf

agentaddress.yaml - Useful to confirm the stomp interface which is the IP for the OpsCenter server.
location.json - Displays the path to the cassandra.yaml and dse.yaml.

> nodes > IP_of_node > conf > cassandra

Contains the cassandra.yaml and cassandra-env.sh files.

> nodes > IP_of_node > conf > dse

Contains dse.yaml and dse file. The dse file is created for package installs and can be used to confirm which services are active (Spark, Solr, Graph etc). 

> nodes > IP_of_node > conf > solr

Counting the folders in the solr folder reveals the number of Solr cores.

> nodes > IP_of_node > conf > solr > solr_core_name

The folder contains the schema.xml and solrconfig.xml files.

> nodes > IP_of_node > conf > spark

Contains Spark configuration including the spark-defaults.conf and spark-env.sh file.

> nodes > IP_of_node > cqlsh

describe_cluster - Displays cluster name and partitioner used.
describe_schema - Shows the current keyspace and table schema.

> nodes > IP_of_node > dsetool

ring - Useful to determine which node is running the master process and status of the nodes.
sparkmaster - Will be updated to use 'dse client-tool' (OPSC-11798).

> nodes > IP_of_node > logs

Contains logs for Cassandra, OpsCenter agent, Solr and Spark.

> nodes > IP_of_node > logs > cassandra

system.log
output.log - Shows the log output from the last startup sequence.
debug.log
gremlin.log

> nodes > IP_of_node > logs > opsagent

agent.log - Useful log to interrogate for DataStax agent issues.

> nodes > IP_of_node > logs > solr

tomcat - Contains the catalina log files.

> nodes > IP_of_node > logs > spark

master - Contains the master.log.
spark-jobserver - Contains spark-jobserver log.
worker - Contains the worker.log.

> nodes > IP_of_node > nodetool

The nodetool folder is the biggest haul of them all. Even the US Federal Reserve would be envious of the gold bars of information found in this section. Fortunately for us, it's no Fort Knox and the information is easily accessible. 

cfstats - Shows statistics for tables. If you see a high sstable count this could indicate a compaction problem unless the table uses LCS which tends to have higher sstable counts. For space problems refer to space used and space used by snapshots. For any write or read performance issues refer to the respective write or read latency output. A node struggling to keep up with writes will show counts for 'Dropped Mutations'. The 'Compacted partition maximum bytes' output is also a good place to check for large partitions. Finally, the 'tombstones per slice' information is very useful to identify tables with a large number of tombstones.  
compactionhistory - Shows history of compaction operations. Useful to investigate compaction issues.
compactionstats - Provides statistics about a compaction. Useful to investigate compaction issues. 
describecluster - Shows the name, snitch, partitioner and schema version of a cluster. Can be used to identify schema disagreements when there are unreachable nodes.
getcompactionthroughput - Prints the throughput cap (in MB/s) for compaction in the system.
getstreamthroughput - Prints the MB/s outbound throughput limit for streaming in the system.
gossipinfo - Provides the gossip information for the cluster. Can be useful to confirm all the nodes can see each other.
info - Some useful node information including uptime, disk storage (load) information, heap and off heap memory used.
netstats - Shows network information about the host. High counts for read repair can indicate the repair process hasn't been run for a while. 
proxyhistograms - Provides a cumulative histogram of network statistics and shows the full read / write request latency recorded by the coordinator. The output can be compared to other nodes to determine if requests encounter a slow node.
ring - Shows node status and information about the ring. The output shows all tokens and can be quite big if you're using vnodes.
status - Displays the cluster layout including DC names, shows if nodes are up or down, load on disk, tokens, IDs and rack information.
statusbinary - Prints the status of native transport. Also shown in the output of nodetool info.
statusthrift - Prints the status of the Thrift server. Also shown in the output of nodetool info.
tpstats - Provides momentary usage statistics of thread pools. Similar output is written to the system log when longer GC pauses occur. This isn't always a problem, but it can help to indicate which thread pool is most active. Any thread pools with 'All time blocked' threads, a high number of 'pending' threads or dropped counts indicate a problem that should be investigated. The output can help to point you in the right direction. For example, if you found dropped counts for Mutation this would indicate a problem with writes. You could then refer to the 'nodetool cfstats' output and look for 'Dropped Mutations' which will show the table or tables that are dropping mutations. 
version - Displays the version of Cassandra on the node.

> nodes > IP_of_node > ntp

ntpstat - Shows network time synchronisation status.
ntptime - Reads kernel time variables

NTP configuration is often overlooked and yet time synchronisation is pretty fundamental for a distributed database. 

> nodes > IP_of_node > os-metrics

cpu.json - Shows CPU activity. This can help determine if a high CPU issue is related to user, system or iowait activity.
disk_space.json - Shows disk usage information. Good place to confirm if the node is running low on disk space.
disk.json - Similar output to the linux iostat command that can help identify disk performance issues.
load_avg.json - A high load average can indicate the node is being overloaded. An idle node would have load average of 0. Any running process either using or waiting for CPU cycles adds 1 to the load average.
memory.json - This is a breakdown of the memory used and includes used, free and cache. The output of nodetool info also provides memory usage.

> nodes > IP_of_node > solr

index_size.json - Displays the size of the Solr index

Example Scenarios

A node is reported as down in the cluster.

The first place to start would be to refer to the nodetool status output from any node in the cluster. If a node is down the nodetool status output found in '> nodes > IP_of_node > nodetool > status' will show DN status for the node, for example.
 

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens  Owns    Host ID                               Rack
UN  127.0.0.1  47.66 KB   1       33.3%   aaa1b7c1-6049-4a08-ad3e-3697a0e30e10  rack1
UN  127.0.0.2  47.67 KB   1       33.3%   1848c369-4306-4874-afdf-5c1e95b8732e  rack1
DN  127.0.0.3  47.67 KB   1       33.3%   49578bf1-728f-438d-b1c1-d8dd644b6f7f  rack1

The output above shows node 127.0.0.3 is down. Once you've identified which node is down you can check the system.log for the node in '> nodes > 127.0.0.3 > logs > cassandra'. The system.log will hopefully contain more information to help you determine why the node went down. You could also perform a ‘nodetool describecluster’ as it performs an actual rpc call to all nodes which can reveal unreachable nodes and any schema disagreements. 

Poor performance on a node.

If a node is experiencing poor performance we need to determine if the node is CPU bound, running out of memory, experiencing an IO problem or something more obscure like running out of OS resources. 

For CPU issues, you can refer to the cluster_info.json to determine the number of cluster cores. If the node has sufficient CPU cores you can then check '> nodes > IP_of_node > os-metrics > cpu.json' to see if the CPU is pegged. 

For memory issues, as before, you can check the cluster_info.json or the '> nodes > IP_of_node > machine-info.json' to determine the amount of RAM available. Assuming there is sufficient memory you can refer to the output of 'nodetool info' to see how much heap and off heap memory is in use. You may be able to resolve the issue by tuning the JVM by increasing the amount of memory allocated to the heap. It is also worth checking the system.log for garbage collection (GC) activity. Long GC pauses or lots of GC activity would suggest the node is being overloaded and further investigation is needed to check what is running on the node. You could refer the 'nodetool tpstats' output which will show you which thread pools are most active. 

For IO problems, if you're experiencing read or write latency and 'nodetool tpstats' output shows read and/or write (mutation) threads have pending activity this could indicate an IO problem. The ‘nodetool tpstats’ output is momentary so checking the same output in the system.log for climbing and sustained pending threads can be useful. Furthermore, you can refer to the '> nodes > IP_of_node > os-metrics > disk.json' file for output similar to iostat which can help to identify an IO issue. 

Finally, if you're seeing messages in the system.log referring to the number of open files then you may have an OS resource issue. You can check the OS resources allocated to a node by referring to '> nodes > IP_of_node > process_limits'. For the latest OS resource limits, JVM settings and disk tuning advice refer to the DataStax recommended production settings for your version of DSE.  

Conclusion

This blog post is not all encompassing and even after nearly 4 years of DSE support I'm still finding new things in a diagnostic tarball. We tend not to document in detail what is collected in a diagnostic tarball because our developers continue to make improvements and the content collected changes from version to version. However, this blog article should help to address that problem by acting as a torch in a dark cave and enlighten (pun intended) readers about the information available in a diagnostic tarball.