Faceting

DSE Version: 6.0

Video

Transcript: 

Faceting is a really powerful feature supported by DSE Search.

Even if you've never heard the term *faceting* before, you've almost certainly used faceting if you've ever shopped online.

Faceting involves providing some metadata about our search results, in conjunction with the results themselves.

For example, if, using the KillrVideo website, a user searches for movies with the term Alien in their title, it can be very valuable to that user to get a list of genres contained in those results.  Using these facets, the user can query a gain using a more narrowly-scoped search -- which ultimately helps them get to the most relevant results more quickly.

DSE Search provides a mechanism to let us do this.  Let's check it out.

Faceted results can be calculated quickly since they can be built solely from the index.

Any kind of faceting we do returns buckets and counts that summarize the search results.  There are several types of faceting available; they differ in the way that they distribute the results into those buckets.

Let's explore the different types and get a sense of when each one can help us.

The most common type of faceting is field faceting.

Here's how it works.

The specified search query is executed against the killrvideo videos search index.  But since faceting was requested, DSE Search doesn't return the video records themselves; instead it calculates facets based on the video records.

Genres is the field by which we want to group; and the results are organized into groups based on the values they have for the genres field.

Query faceting operates similarly to field faceting.

But instead of organizing the facets strictly by the field values, it organizes them by the results of one or more query expressions.

Range faceting organizes its facets according to a given field and ranges of that field's values.

Interval faceting is similar to range faceting in that it organizes facets according to a given field, but it allows more flexibility in how the ranges are specified.

Finally, pivot faceting organizes facets according to multiple fields and the various combinations of those fields' values.

Regardless of which type of faceting is run, the faceted results are returned in a single JSON object.  But, the structure of the JSON object varies based on what type of faceting was specified.

Earlier, we talked about the fact that faceting is only supported on fields under two conditions: when the field has docValues enabled in its index schema -- OR --  when the fieldCache is enabled for the query.

When docValues is enabled in the index schema, DSE Search keeps a separate index structure that is highly efficient for reading out the values for faceting.  The performance of this structure stays consistent even when the size of the index or the size of the result set is very large.

Similar to a sorting example we examined earlier, it's possible to get some strange faceting results when faceting occurs against a field that is tokenized.

In this example, we see odd facet values like for "adventure" with the trailing "e" missing; and we see "science" and "fiction" under separate facets.

This is a situation to watch out for since no error is thrown, so we won't be able to detect by monitoring our application's error logs.

Like our sorting example earlier, the way to address this situation is to use a copyField so that you can have a tokenized version and a non-tokenized version which can be faceted against.

Let's talk about how we specify our faceting options using JSON notation.

Before we jump into the facet parameter, let me just call out the fact that the q parameter is required for faceting operations.  And, don't forget that the SELECT clause is ignored for faceting requests since we don't return any field values anyway.

For basic field faceting, we need to specify the field by which to facet.  In this case, we're faceting by the genres field.

The JSON output of field faceting is a map made up of the field values and their respective counts.

To specify query faceting options, we need to use the query parameter, and we need to specify at least one query that should be run.

The JSON faceting output is a collection of maps, keyed by our queries and result counts.

It's common in query faceting to pass more than one query; to do so, we pass an array of queries.

The output contains counts for all the queries we requested.

Sometimes, we want to bucket our results based on a numeric or date field that has non-discrete values.  Instead of returning a bucket for each distinct value, in these cases, it makes sense to bucket into ranges.  This is where range faceting comes into play.

We need to specify a few more parameters to make range faceting work.  In addition to the field we're grouping on, we also need to provide the upper and lower bounds of our data values; and we need to specify the interval between ranges.

If we're creating facets based on multiple fields, we have to use a slightly more verbose syntax.

Here's an example.  We're wanting to bucket based on the avg rating field. Then, we want to see buckets for values between FIVE and SEVEN, in increments of POINT FIVE.

In our faceting output, we get four buckets back, based on ranges of FIVE, FIVE AND A HALF, SIX, and SIX AND A HALF.

We can bucket by multiple fields in the same request.  To do so, we specify our fields in an array and then specify our upper bounds, lower bounds, and interval between ranges per field.

The faceting output returned would show the buckets for avg_rating and then the buckets for release_year.

Here is an example using a date field instead of a numeric field.  Notice the format of the start and end values. And, notice the PLUS ONE MONTH syntax used to specify the interval between ranges.

The faceting output is similar to what we saw with *numeric* fields, but the buckets are keyed by dates.

Interval faceting is another way we can customize our buckets beyond the strict lines of field faceting.  In interval faceting, we explicitly indicate the bounds of each our intervals.

We use Solr's range syntax to specify the lower and upper bound values -- square brackets for inclusive limits and parentheses for exclusive limits.  Those can be mixed and matched. We'll see an example of that here in a second.

Several fields can be specified at once, and of course, multiple intervals can be specified.

In this example, we're asking DSE Search to facet by the *genres* file; and we're specifying two intervals: one that starts at A and goes through B (but not including B) -- so what we're really saying here is any value that starts with A -- and the other is --  values that start with the letter T or higher.

Our results come back as a map whose key is the interval we *specified* and whose value is the count of results.

Finally, pivot faceting is very similar to field faceting, but instead of specifying a single field, we specify multiple fields using the pivot parameter.

What we'll get in return is a multi-level hierarchy of all the values across the specified fields.  As you can imagine, for fields with many values, these results can be large, so be careful about doing this across too many fields or fields with too many values since the results will grow exponentially and can cause some resource contention.

The output of pivot faceting can be a lot of data. Here, we're looking at a pared down version of some output to give you a sense of it.  You can see that we have counts, first by the genre value, and then the results are further broken down by release year.  We get counts at every level.

There are a few faceting parameters we can pass that help us control our output.

One such parameter is mincount. Mincount lets us specify the minimum threshold of count values that we want to see in our groupings.

For example, it's common to want to filter out any groupings that have zero results in them -- which is a possibility due to the way that the docValues structure works.

But, you can also filter out groups based on some higher threshold.

This is a handy feature and can cut down on the volume of results that we get in our faceting output.

The *limit* parameter lets us specify how many groups that will be returned.  By default, this value is ONE HUNDRED, but we can ask for more or less by changing this parameter.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.