Advancements to OpsCenter Amazon-S3 Backups | Apache Cassandra and DataStax Enterprise - DataStax Academy

 

colo monkey
This is me circa 2007 setting up yet another datacenter and networking 300 servers, nicknamed “The Spartans”.

Tackling the 100TB beast!

For a long time we have all felt that the backup speed through OpsCenter to Amazon-S3 has been “fine” or “to be expected with that size data”, however, we were put to task recently and had to come up with a better solution to handle much bigger data. Clusters are getting bigger, Node counts are increasing and Big Data is growing exponentially! We had to keep up with this growing trend and make backups work better, faster, BIGGER.

But how?

That was the One million dollar  question.  How do we accurately verify incremental backups and verify consistency at a higher rate of data transfer?
Using a test cluster of 4TB as a benchmark, the Developers of OpsCenter diligently  assessed the current rate of file transfer and began making code changes  to the backups mechanism. We used 4TB as a benchmark as this was a common size  that we were seeing among our customers’ clusters, however, we knew that size would grow larger as clusters were used more.  We started testing and  hit a snag, the large files caused an issue:

ERROR StagingThread 2017-05-19 16:15:34,051 Got exception 'Maximum number of retries reached for sending snapshots/425fa20f-e7b7-44f7-9d65-3d9c5701d84c/sstables/73affcd06b18e61d886e2a76c9b25943-cid_prod_1-content_items-ka-15-Data.db to s3 bucket dse-backup-prod-meta-a (3 retries)' while sending opsagent.backups.entities.SSTable@ad0b51c6 to :7327d2e534564ff892d238541254fbcb 
clojure.lang.ExceptionInfo: throw+: {:type :opsagent.backups.destinations/unknown-error, :throwable #error { 
:cause "Resetting to invalid mark" 
:via 
[{:type com.amazonaws.ResetException 
:message "Failed to reset the request input stream; If the request involves an input stream, the maximum stream buffer size can be configured via request.getRequestClientOptions().setReadLimit(int)" 
:at [com.amazonaws.http.AmazonHttpClient$RequestExecutor resetRequestInputStream "AmazonHttpClient.java" 1305]} 
{:type java.io.IOException 
:message "Resetting to invalid mark" 
:at [java.io.BufferedInputStream reset "BufferedInputStream.java" 437]}] 
:trace 

OPSC-12278 Amazon-S3 backups fail with :message "Resetting to invalid mark"

The SSTable file sizes were too large and were causing the uploads to Amazon-S3 to retry too many times, erroring out with “resetting to invalid mark.” Here is the code snippet of the area that is causing our problem.

while (true) {
                checkInterrupted();
                if (originalContent instanceof BufferedInputStream && originalContent.markSupported()) {
                    // Mark everytime for BufferedInputStream, since the marker could have been invalidated
                    final int readLimit = requestConfig.getRequestClientOptions().getReadLimit();
                    originalContent.mark(readLimit);
                }

Something had to be done for this, if we are hitting this on testing large SSTable files, then customers must be hitting this as well. And they were!  We initiated a code fix that substitutes ResettableInputStream from the AWS SDK library for the BufferedInputStream we are using in outbound destinations.  We also had to make sure that this section never falls into the “if block” above and thus is never subject to the troublesome mark() call. That solves the problem. However, now we had another problem to solve

With OpsCenter 6.1.1 due to be released soon and completing the necessary rounds of testing prior to public release, it was impossible to get the fix into this release. A Hotfix version was created and made available to whomever was facing this issue.  And the full fix for this is included in the OpsCenter 6.1.2 release.

Ok, we are past that, so where are we with the large SSTables and the data transfers?

Although we had actually increased backup speeds by 20% at this point, we were not happy.  It must be faster to work with Incremental  backups to be effective.  Incremental backups verify that the data that exists on the backup directory has not been modified  between backup cycles.  How Opscenter handled this was by taking a md5sum check of the files on the local system and then cross referencing this check with what is on Amazon-S3 destinations.  If they match, no backup necessary, move along. These are not the droids we are looking for.  However, with this test case we were writing to every SSTable constantly, so essentially we never got an incremental to begin. Each backup we were performing was again another full backup.

We needed a way  to increase the functionality of OpsCenter checking to see which sstables are already compressed and uploaded to Amazon-S3 before performing an upload of backup sstables which used to exist but was removed due to issues with the way Amazon-S3 caches the results of existence checks.  Finding a way to work within the same guidelines as Amazon-S3 now was  the plan.  In order to do this we started using the  AWS-CLI to transfer the files. Since we are testing these features using AWS-CLI , this was added in as a [labs] feature that must be enabled to use.

[labs]
use_s3_cli = True

It should be noted that [labs] features in OpsCenter are experimental features in development available for use. As in the example above, the feature must be explicitly configured before it is available for use. If you decide to try this particular [labs] feature, or any of the [labs] features for that matter, be sure to send DataStax feedback so they can be improved. However, you should be aware that DataStax does not recommend use of or guarantee performance of [labs] features in production. So using this feature is at your own risk until it is an official feature and no longer part of the labs.

When we tested using this [labs] switch, we saw a throughput increase against 4TB of data of 85%! We did it!  Incredible!  The increased speed and the use of AWS-CLI did open up  another opportunity in OpsCenter 6.1.3 where we will add better handling to be able to resume from failures without causing the entire backup to fail.  This was a long road and a hard fought battle to get to this conclusion. Now to tackle enhancements to spinning up a new cluster even faster!

Best regards, and happy backups!
Sean Fuller - US West Coast  Support Team