Live Indexing

DSE Version: 6.0

Video

Transcript: 

Hello! I'm Joe Chu and welcome to Live Indexing. In previous we have talked about the write path and the read path for DSE Search. Specifically this is for the default process which is called near real time indexing. However, for awhile now DSE has also had an alternative process for indexing which is what we call live indexing, or real-time indexing.

Live indexing is a feature that is found only in DataStax enterprise, and has been around since version 4.7. That's at least 2 major releases and several minor releases, so it's been around for quite awhile.

The main idea behind live indexing is to be able to search documents while changes are still in the ram buffer, rather than having to wait for a commit and a new index segment is created. This should allow faster indexing throughput, because it would not be necessary to soft commit frequently if the index is already readable, and therefore it reduces the flushes and merges that may take place in the background. Live indexing should also lower help to lower read latency since often Search will be reading through the index directly in memory.

Let's first review what happens using near real time indexing.

When documents are updated, they are added to the in-memory ram buffer.

A soft commit may be triggered from several different events, such as the auto aocft commit max time, and the updated documents will be flushed to disk as a new index segment.

As the segments increase, a merge can be triggered, which will combine multiple segments together into one larger segment.

Only the documents written in the index segments are searchable when retrieving results for a search query.

Here we start off the same way, with document updates being indexed in the ram buffer.

With live indexing, the changes to the index are readable while they're still in the RAM buffer.

Additional document updates will still be added to the ram buffer and can be read from there. However there is still a period of time where it may not be visible for new searches, but it can be much shorter compared to near real time indexing.

There are no soft commits with live indexing, since it is not needed for visibility. However a hard commit can still be done instead, and occurs when the corresponding Cassandra memtable flushes, or if the RAM buffer threshold is reached.

Again changes to the index will be added the RAM buffer, and a search will read from both the Ram buffer and the index segments to retrieve results.

One other thing that is supposed with live indexing is the use of off-heap memory for the RAM buffer, which can be enabled in the search index config. By being able to use memory that is not limited to the allocated heap, this should reduce the need for garbage collection and should also help improve performance.

There are several different ways to turn on live indexing, depending on what tool you're more comfortable with using. The simplest would be a index management command in CQL. If you want to create a new search index, you can do so with live indexing enabled by adding the config option, realtime set to true. Otherwise an existing search index can have its configuration altered by using the configuration shortcut realtime and setting that to true.

dsetool has been around the longest, and users may be more comfortable with using this command-line tool instead. If you would like to enable live indexing on a new search index, you will need to create a yaml file that contains the setting rt, and have that set to true. Afterwards the yaml file can be included as part of the coreoptions when creating a new search index with the dsetool command create_core. Note that you must also use the generateResources option to automatically build the solrconfig.xml. Otherwise if you're passing along your own solrconfig.xml file, dsetool will not be able to edit your file to include the rt setting.

Speaking of the solrconfig.xml, it is possible to add the setting to enable live indexing directly into the xml file. Under the indexConfig element, add the element rt, with the value true. Afterwards you can reload the search index the have the setting take effect.

Live indexing has the potential to significantly increase the read and write performance of DSE Search, but there are some recommendations that will help ensure that live indexing performs optimally.

We'll be going over these one by one, but the first that I'll mention is actually the last bullet point here, which is to limit the use of live indexing to only one search index in the cluster. Live indexing itself can be resource intensive, and enabling it for more than one search index may actually be detrimental for performance.

Another recommendation would be to lower the autosoftcommit slash maxtime value. The suggested value is 1000 milliseconds, which is actually the maximum value when live indexing is turned on, and can be set lower. In the case for live indexing, this setting is not when a soft commit is performed, since we no longer need soft commits, but it does control when the view of the index in the ram buffer is refreshed.

An example of changing this setting would be to use the CQL command like at the bottom here to change the autocommit time. Even with the maximum value of 1000 milliseconds, this makes reading new changes to the index much faster than with near-real time indexing, which starts off at 10000 milliseconds, or 10 seconds before your search results start to see new changes.

The next recommendation is to turn on offheap allocation for the postings in the RAM buffer. As we've mentioned just a bit earlier, having the index written to and read off heap removes limitations to the heap memory usage, and can improve garbage collection times and prevent out-of-memory errors.

You can turn on offheap allocation by executing the CQL command below, which modifies the search index configuration.

The final recommendation is to make sure you increase the size of the JVM heap as much as possible. With DSE Search, you usually will want to have at least 14GB of heap memory allocated for the JVM, if not more. A bigger heap will allow for a larger RAM buffer size and having index updates in the ram buffer will decrease the number of background operations, such as flushes and merges, that occurs in the background.

The heap size can be changed in the cassandra-env.sh file, by uncommenting the settings MAX_HEAP_SIZE and HEAP_NEWSIZE, and then changing MAX_HEAP_SIZE to your desired value. If using the G1GC garbage collector, the value of the HEAP_NEWSIZE is not important and will actually be tuned automatically over time based on heap usage.

If you're upgrading from an earlier version of DSE, keep in mind that the setting rambuffersizemb is no longer used, as DSE will now adjust the size of the ram buffer automatically. If you use a solrconfig from earlier versions with this setting you may see the warning: Solr config, rambuffersizemb is not supported. There's no need to panic as this setting will be ignored, but you may want to remove that setting from your solrconfig file when you're upgrading.

It's been a while since the last exercise, but we have some hands-on for you to do here. Open the notebook for the exercise live indexing, and get started!

And now, try out this exercise!

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.