JSON Search Query

DSE Version: 6.0



I'm Jim Hatcher, and this is Advanced Search with DSE Search

Earlier in this course, we've learned how to use the solr_query field as a means of specifying search queries via CQL.

This is a simple way to integrate our search logic with CQL queries.  It lets us specify the basic query we want to run. However, there are more advanced things that can be done in Search, and to be able to harness these additional features, we need a more expressive interface.

For those occasions, we have the option of specifying our search logic in a more structured way using JSON notation.  

When we pass a JSON object into the solr_query field, CQL will automatically detect that we are doing so, and it will adjust to using that interface.

At a minimum, we need to make sure we pass the "q" parameter which is our basic query.

Some of the more advanced features that we can specify in our searches include: whether or not to enable the field cache, how to sort and paginate our search results, and what time zone to use for date-based math.

We will cover these options in the following segment.

Then, we will do a deeper dive on two very powerful and useful features: faceting and filter queries.

We know that when we're setting up the schema for our search indices that we can specify whether docValues should be enabled for each field being indexed.  Two common reasons for enabling docValues is to support sorting or faceting on that field.

However, it is possible sort or facet on a field without docValues enabled by using the fieldCache on a search-by-search basis.

Using JSON notation, we can indicate to DSE Search that we want the field cache enabled for a given search operation.

However, there's a reason that the field cache is disabled by default -- namely that is can result in memory spikes and out of memory exceptions -- and so we don't recommend using this option in your Production applications.  However, it can be handy tool to keep in your bag of tricks during development or when you're doing some basic data exploration.

Here, you can see an example of enabling the field cache on a search.  We haven't yet covered how to indicate sorting and faceting options -- which are the two cases where enabling the field cache will benefit you -- but we'll get to those soon.

We know that one of the big benefits of using a *Search* system to give us our query results is that in addition to effectively filtering the data, it also tells us the relevancy of our results in terms of a score.  In DSE Search, results are sorted by this relevancy score by *default*.

However, sometimes we need to get our results back in a different order.

To instruct DSE Search to sort differently, we specify those instructions in our JSON-based request using the "sort" parameter.

We need to indicate not only what field or fields to sort on, but also the direction in which to sort -- either ascending or descending.  For a field to be eligible to be sorted on, it needs to be included in the search index, have docValues enabled, and not be multi-valued or tokenized into multiple tokens.

Sometimes, you run into cases where you need a field to be tokenized *and* you need to be able to sort on it.  If you run into this situation, create a copyField in your schema where one copy of the field is tokenized and the other copy is not tokenized and has docValues enabled.  We cover copyField in other parts of this course.

Here is what our sort instruction looks like in our JSON request object.

In this example, we are telling Search to sort results *first* by the release_year field in ascending order -- *then* by the release_date field in descending order.

Here is an example of what happens if you attempt to sort by a tokenized field -- in this case, "title."  You won't get an error, but the results will probably not be what you'd expect. If you needed to sort on a tokenized field, you need to apply our copyField technique.  You could create a non-tokenized, docValues-enabled field called title_sortable, for instance; and when you sorted by *that* field, you'd get the output you need.

In the DSE world, we're often dealing with Big Data volumes, and so it stands to reason that our searches can yield some big result sets.

There are plenty of times when it makes sense to only look at a slice or a *subset* of our results; and so, we need a way to page through our data.

When we're paging through our data, we often want to also specify how the data should be *sorted* -- so that our paged result sets will come back in a predictable and repeatable order.

In DSE Search, we have two flavors of paging: Basic Pagination and Cursor-based paging.

Let's dig into each of those a little more.

Basic pagination, as its name suggests, is the *simplest* means of paging through results.

To use *basic* paging, we simply supply the *start* parameter in our search request.  This *start* value represents the offset of results from the previous *result* set. We also typically use the CQL *LIMIT* clause to indicate the *size* of each page.

When using *basic* pagination, for the *first* page, we don't need to *specify* a start value; we just ask for the first three records.  However, on every *subsequent* request, we want to get an offset of the previous records, and so we *specify* the start parameter.

This is a simple and effective way of paging through data; *however*, that simplicity comes at the cost of *performance* as you page deeper and deeper into your results.  We'll talk about that in a minute.

But first, let's walk through an example using basic pagination.

We ask DSE Search to give us three records.  We get: Rio Lobo, Chisum, and Mackenna's Gold.

On our *next* request, we specify a start value of 3; and in our results, we get the next three movies.  And on the subsequent request, we specify a start value of *SIX* and get the final three movies.

When you're just peeking into your results a few levels *deep*, this approach is effective.  However, on each *subsequent* retrieval, the response times will be slightly slower -- because *internally*, for DSE to be able to get your nth page, it still has to retrieve all the previous pages before that -- even though it doesn't send them to you.

For cases where *deeper* paging is required, we have an alternative approach: *cursor*-based paging.  Let's look at that now.

In cursor-based paging, a cursor is employed which marks the place from which the last result was returned -- and this cursor mark can be used to resume the next operation.  This cursor represents application *state*, and so it has to involve another layer of the stack -- namely, the DataStax *drivers*.

To instruct DSE Search to employ *cursor-based* paging in the driver, we add a *paging* parameter to our request with the hard-coded value of *driver*.

Notice that when using cursor-based paging that we do not *NEED* to specify a "start" parameter; in fact, you'll get an error if you *do*.  We're letting the *driver* keep up with where its cursor is, so we don't want to give *con*flicting information by specifying our own offset in CQL.

As an *alternative* to passing the paging parameter on a statement-by-statement basis, we can also enable it by default on *every* statement by adjusting a setting in the dse.yaml.  But, the default is to handle it per statement.

Here's a handy feature that you can use when working with date fields.

In Search, date values are stored in Coordinated Universal Time.  However, if you're working with application logic that assumes *another* time zone, it can become complex to have to constantly convert from the local time to *UTC* time on every interaction with the data layer.

Enter the *TZ* parameter.  With the TZ -- or time zone -- parameter, the local time zone can be specified, and we effectively push the work of handling all the time zone conversion logic down into the *data* layer where we don't have to worry about it.

The potential values that can be passed into the TZ parameter are any values supported by the Java *TimeZone* object.  There are *hundreds* to choose from, and you can get those from the online Java documentation. There's even a list on Wikipedia.

The *NOW* function complements the use of the TZ parameter; so, you'll often see them used together.

Here's an example where we want to see a list of movies released in the last *month*.  Using the *NOW* syntax, we're asking Search to give us everything between the *current* month and one month ago.

You'll notice that in the *first* query, we specify that we want to *reason* about our dates using New York City time; and we don't get any results.

In the *second* query, we do the exact same thing except that we ask for results relative to the time in *Kolkata*; and in this case, we *do* get a result back.

In turns out that the *time zone* difference from one side of the planet to the *other* had a big enough gap in this case that it affected our result set.  So, this is a good habit to get into when using date-based logic in searches.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.