Write Path

DSE Version: 6.0



Hi everybody! This is Joe Chu and welcome to the topic of this video, the DSE Search write path. In this video we'll be taking a look at how indexing takes place in Search, which will start with what happens when a write request is sent to the DSE cluster, and then following that request as it is sent to the replica node, and then indexed.

Typically, indexing does not happen unless there is a CQL write request that takes place, whether that is a INSERT, UPDATE or DELETE. The write request is still defined by the replication set for the keyspace, and the consistency level set for the operation. Depending on the number of replicas, the request will be forwarded to the appropriate nodes where the write is processed and the indexing follows afterwards. The indexing occurs independently across each of the Search nodes, with only the local data on a node being indexed. It's only when the write request has been processed and the indexing has completed would the write request be considered to be completed on the node.

The other way that indexing may occur would be in the event of a index rebuild, which can occur when a search index is created or reloaded, or triggered manually by a user. The rebuild will cause all nodes to read through their partitions and reindex that data, following the same path that we'll be seeing in just a few slides.

Here you can see an example of a DSE search cluster that is divided into two datacenters: a Search datacenter and a Cassandra datacenter. Based on the keyspace definition in the upper right, the data that will be written to the tables in this keyspace will have 2 replica nodes in the search datacenter, and three replica nodes in the Cassandra datacenter, for a total of five replicas.

Let's take a look at a write request that may take place, starting with a client that is connected to a node in the search datacenter. Without taking DSE Search into consideration, the write request will still follow the normal Cassandra write path, attempting to send the request to all five replica nodes asynchronously, and then waiting for enough responses to fulfill the requirements for the consistency level set for the write request.

Here the write request is forwarded to five different nodes, two in the search datacenter, and three in the cassandra datacenter. It doesn't matter that the write request originated in the search datacenter, the data for the write request will still be forwarded to the corresponding replica nodes in both of the datacenters.

Likewise, you can also have a client sending a write request to a node in the Cassandra datacenter.

The writes will still be forwarded to the five nodes across both datacenters. In fact, it makes sense to keep write requests to a datacenter meant for handling pure transactional writes and queries and can be scaled accordingly to handle that workload. Data can still be replicated to nodes on the search datacenter, and the datacenter can be scaled to handle the workload for search queries.

Now we get to the tasks that take place at the node level. Regardless of the number of replicas that the write request would be sent to, they all should do the same thing, although not necessarily at the same time.

The first thing that happens is that the write request itself is executed. That means that a row or partition is inserted, updated, or deleted.

Once that is completed, the secondary index hook will trigger the indexing for that data, and is divided up into two parts.

The first is the PREPARE phase which is when the partition data is read from the database before actually being indexed. We'll get into the reasoning behind why the data needs to be read again in just a bit.

The second, and final phase is the EXECUTE phase, which is when the search index is updated based on read data.

The write process and indexing is synchronous, so the write request would not be complete until the indexing is completed as well.

The write path has been modified in DSE 6, mainly due to the use of the new thread per core architecture, and does has some noticeable differences from previous versions. It used to be that indexing was asynchronous with writes, and therefore an indexing task would be placed in a queue following a write. While in the queue, it would be considered to be in the QUEUE phase.

Once the write has completed and the secondary index hook has triggered, the PREPARE phase starts and the indexing task is passed to a TPC thread.

The worker thread will re-read the corresponding row or partition that is being indexed from the database, and a Lucene document object is created from that data.

Afterwards the document object is then used during the EXECUTE phase.

The EXECUTE phase is where the document that was created using the CQL row data is actually indexed. This process that will be explained here is what is known as near real-time indexing. There is also an alternative process called real time indexing, which is explained in a later video.

Anyways, back to the EXECUTE phase. The UpdateHandler manages updating the index and writes the index changes into a RAM buffer in on-heap memory.  

Once in the RAM buffer, the EXECUTE phase, and the indexing task is completed.

As more writes occur on the node, the updated indexes are added to the RAM buffer. This will continue to occur until either a certain amount of time has passed, or a memory threshold is reached. Both of these conditions will cause a soft commit to occur, which will flush the indexes in the RAM buffer and write them into a new index segment on disk. It is only the entries stored in these index segments that are visible for searching. Segment files also have similar properties like Cassandra SSTables: they are immutable, and the complete search index is made up of the sum of all of the index segments.

The soft commit behavior can be tuned by the autoSoftCommit / MaxTime setting in the search index config, and determines how long indexes will accumulate in the ram buffer following an initial document added to the index in the ram buffer. The default value is ten thousand, and represents the number of milliseconds before a soft commit is automatically triggered. This setting can be tuned depending on if there are a lot of writes taking place and frequent soft commits slows down performance, which may indicate that the setting is too low. On the other hand, if the value is too high it may cause search results to not show recently added documents quickly enough, and can then be tuned lower.

Another setting is the autoSoftCommit / maxDocs setting, which determines the number of documents added since the last commit, before automatically triggering a soft commit. However, this is not very often used, with maxTime being the more useful setting.

It is also possible to issue a commit in a Search query, to ensure that all data and index changes are visible for the query. However, this is not recommended to do so in production though.

As more writes occur, updates are written to the RAM buffer, and then flushed to segments.

There is also a hard commit, which has the additional distinction of a fsync being done, which ensures that the index segments are written to disk at the OS level, and therefore durable. On the Cassandra side, a memtable flush will trigger a hard commit to ensure that both data and index are synced on disk. However, there is no longer a manually way to trigger a hard commit from the Search side.

Once there are a certain number of segments, a merge operation can take place and re-write several segments into a new segment. Like compactions in Cassandra, merge also has strategies that can be used to manage how segments are selected and the frequency of the operation. They are IO and CPU intensive, so one of the keys to understanding the write path is being able to avoid triggering extraneous merges from happening.

There is one other operation called optimize that can take place. This must be manually triggered, and will cause all of the index segments to merge into one large segment.

DSE Search also has its own commit log, which can be found for each search index on every search node. These do not the same function as commit logs in Cassandra, and are used primarily when a new node is bootstrapping data when it is joining the cluster.

Due to timing conditions, it is possible for writes to be sent to the node before the search index is ready to begin indexing. Therefore any new writes are added to the search commit log, and once the search index is enabled, the commit log will be replayed and the writes can be indexed then. Once data has finished bootstrapping on the node, the bootstrapped data is re-indexed separately from indexing occurring from new write requests.

There are some things to be aware of when it comes to write performance for indexing. One of these things is constant, continuous updates. As writes causes data to be indexed, background operations can have an impact on search query performance. If the write workload shows this kind of characteristic, make sure to monitor the commits and merges that occur and tune the configuration as needed to help prevent performance degradation.

The other pattern to watch out for are batch updates, particularly cause by events such as bulk loading data. Having too many writes taking place at once may cause them to start to time out and fail, as you may start to have resource contention from several different processes: including indexing, Cassandra compaction, and garbage collection.

When it comes to segment merging, it behaves quite similarly to Cassandra compactions, meaning that it can be very CPU and I/O intensive, and you want to reduce the number of these operations as much as possible.

One way to help control this is with the RAM buffer space threshold. You do not want to set this too low, as that may cause more frequent commits, which create more index segment and trigger merges to occur more often.

Also, avoid changing the merge settings in the search index config. Although these settings are available, some of them are not supported anymore. For the settings that do still work, we don't recommend changing them from their default values.

One last thing to be aware of when it comes to indexing is the brief data inconsistency that can occur between Cassandra data and search indexes. Specifically, there is a period of time between when a write request has completed and when that data may actually be returned in search results. Real time indexing, also known as live indexing, can drastically reduce the time before newly indexed data is searchable, and can minimize the period of inconsistency between data and index. Now that the process between writing data and indexing data is synchronous in DSE 6, that has actually eliminated some other sources of data inconsistency that were seen in earlier versions.

Also keep in mind that due to the distributed nature of the database, it is possible for the indexes across different nodes to be inconsistent for a period of time as well, due to the asynchronous nature of the write path. This can potentially cause indexes to be visible for searching sooner on some replica nodes and later on other replicas.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.