Read Path

DSE Version: 6.0

Video

Transcript: 

Hello, this is Joe Chu again, and welcome to the DSE Search read path. In a previous video, we have taken a look at the write path for DSE Search and the process in which data written to the database is indexed with Search.

In this video we will be look at it's complement, the read path, and how a search request is accepted, processed through various nodes, and how the index is search to find the appropriate search results.

The read path begins with the CQL search query, which is either in one of two possible forms. Either a CQL query that cannot be executed without the help of the search index, or a CQL query that explicitly makes use of the solr_query column to pass in parameters for the search. Like all other CQL queries, the search query can be sent to any Search node, which will then serve as the coordinator for that particular search request.

It is still possible to send search requests using Solr's original HTTP API as well, but for the purposes of this video, we will be focusing exclusively on the search read path for the CQL query.

To help visualize this, we first start off with the CQL query here that is first processed by the Cassandra component. The query is parsed and a determination is made as to whether it is a search query or not. The use of the solr_query column makes it easy, with the parameter for the solr_query column then being passed on to the DSE Search component. If the CQL query does not use the solr_query column, but the query itself does require the use of the search index, the appropriate search request is constructed from the CQL query and is also passed to DSE Search.

The first contact point in DSE Search is the SolrCore component, which would be the component that represents the search index. Each search index has a different SolrCore, which is identified by the keyspace and the table.

The SolrCore also includes a component called the SearchHandler which is what is responsible with coordinating the read request.

The SearchHandler then will pass the request to different search components that are registered with the search handler, something that is usually done in the search index config. The most important of these search components is the QueryComponent, which is the one that builds search results based on the score calculated for each document. Other components may not be as visible, but one that you should recognize is one called the FacetComponent, which is a separate that builds the facet results for a facet query.

Anyways, going back to the QueryComponent, this is the beginning of the PREPARE phase in the search read path, and is where the search query will be parsed and a shard request is constructed. The shard request is basically a sub-query with the same search parameters as the original query, but targeting only a specific shard, or node. If searching in a text field, the search parameters may also go through the text analysis chain that is set for queries in the search index schema.

Following the PREPARE phase is the PROCESS phase, which covers the rest of the distributed read path, and is divided up into multiple stages.  During this phase, the SearchHandler will consult with the shard router, which will determine which nodes to send shard requests to. The shard router is like a Cassandra partitioner, but with a much harder job. Rather than calculating a token value and finding the node endpoints that owns the token value, the shard router has to calculate a set of nodes that covers all of the shards, or token ranges, while minimizing the number of nodes that need to be contacted, and taking the health of each node into consideration.

Once the set of shards are determined, the SearchHandler will then send a sub-query to each of the target nodes.

The number of Search nodes in the datacenter and the replication factor for the keyspace determines the number of sub-queries needed to be able to search the indexes for the entire set of data in a table. For example, if there is only a single node, a search request would only need to go through the search index on that node. If there are three nodes, the complete token range may be divided between those nodes and all of them would need to be queried to determine the complete set of results. However, if the replication factor is great than one, there may be an overlap in the token ranges used by each of the nodes, which means it is possible that only two nodes, or even one node, would have the entire token range, again, depending on the replication factor.

The goal of the shard router is to minimize the number of nodes that is needed to send sub-queries to. If there are multiple nodes that can be used to fulfill a certain token range, the shard router will follow a set of criteria to determine which node to use for the current search query.

The first criteria is to select only those nodes that are active and online, and not in a state where it may be joining the node and bootstrapping, or starting up and may not have loaded the search index yet.

The second criteria is to avoid nodes that are in the middle of re-indexing data, or whose last status was that indexing has failed.

The third criteria is that node must be healthy, as determined by a formula that takes into account the node's uptime, and indications of write requests that have been dropped by the node.

The fourth criteria looks to select a node closest to the coordinator, based on being in the same datacenter and the same rack.

If there are still multiple nodes that meet all of these criteria, then a random node will then be selected.

At this point the nodes are selected. Since only one node is used for each token range, the consistency level of DSE Search queries is equivalent to LOCAL_ONE.

The sub-queries are sent to the nodes selected by the shard router. Each nodes will then compile the search result based on requested number of results, which by default would be 10 results. In this example, we are requesting a limit of 2, so only the top 2 search results would be compiled by each of the nodes. The ranking is usually based on the relevance score based on the row and the search query, but can also be based on sorted order of a certain field or set of fields.

Rather than actually retrieving the requested columns for the search query, only the unique key will be included.

The results from each node is then sent back to the coordinator. The coordinator merges the results and then re-rank them, keeping only the required number of results and discarding the rest. This is why only the unique key is retrieved and returned to the coordinator, since the process would be much more inefficient if nodes also had to read and transmit all of the data for rows that might be discarded anyways.

Results are then passed back to Cassandra, which then uses the unique keys in the results to retrieve the requested column data, following the Cassandra read path, and then return the  results to the client.

Let's take a look now at what happens at the node level. Similarly to the original search query, the shard request, or sub-query will also be handled by the SolrCore and it's specific SearchHandler for the search index. The request is again passed  along to all of the registered search components, like the QueryComponent, to process.

For our specific case here, the QueryComponent calls a SolrIndexSearcher, a schema-aware object that controls the searching that occurs at the Lucene level. The SolrIndexSearcher also maintains a set of caches for quick retrieval of results or values for frequently used queries.

The SolrIndexSearch uses an instance of an IndexSearch which spawns SegmentReaders, one for each segment in the search index, to read the index segments. Once the index has been searched, the results are then returned to the QueryComponent, who prepares the response, and then passed back to the SearchHandler to be returned to the coordinator.

Here we take a look at how the index segments are searched. Segments are what make up the Lucene index for a search index. Each segment is fully self-contained, but together they represent  the entire collection of indexed documents.

The IndexSearcher searches for documents using the index segments that have been committed…​ which can be read from memory or from disk.

Note that document updates that are placed in the RAM buffer are not actually searchable yet, as they need to be committed into an index segment and then attached to a segment reader.

In the case that commits or merges are taking place, the IndexSearch can still run and search the index at that point in time. However a new set of readers need to be instantiated in order to read the latest updates. The old segment readers will wait until current active searches are completed before being cleaned up.

There are actually several different caches that exist in DSE Search, most which come from Apache Solr. However many are either intentionally disabled or not recommended to be modified or tuned by users.

The exception to this is the filter cache. It's the most important cache, as it stores the document ids associated with previously run filter queries. If certain filter queries are often used, they’ll remain cached. Results for filter queries can quickly be retrieved, and will then be the source of documents to search in for the regular search query.

The implementation of the filter cache in DSE Search is different from vanilla Solr, and revolves around a more useful per segment filter cache, instead of a global filter cache. This allows each filter cache to hold only the filter query results from the corresponding segment and minimizes changes to the cache. The memory usage of the filter cache can be controlled globally across the search index, on a node, as well. This is based on the highWaterMarkMB and lowWaterMarkMB settings in the search index configuration. If memory usage hits the high watermark, cache entries are evicted until memory usage goes back to the low watermark level.

Well that's the simplified version of the Search read path, which should be just enough to understand how the read path works without getting overwhelmed by too many details that you won't necessarily encounter when working with Search.

Some things to watch out for, as mentioned before, is that recent changes to the index will not be visible until they are committed, which means that search results may not show the latest results immediately following new writes and updates.

Commits themselves require segments to be flushed to disk, which requires CPU and I/O resources, as well as merges. Both of these, if they start to occur frequently, may have an impact on query performance.

In a later video, we will take a look at live indexing, also called real-time indexing, which is an alternate way to read and write index segments that helps increase indexing throughput and reduce latency on index reads. Look forward to that!

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.