Open Source Solr has an open ticket for adding docValues support to TextField that has not been acted on since it was filed in 2015, SOLR-8362. The main reason for taking a step into a discussion on the topic of docValues and TextField in Solr is that DataStax has provided support for docValues of TextField in DSE Search since DSE 4.7. However, there are some important details not to overlook.

A Brief Review of docValues, and Why It’s Important

The column based index called docValues is straightforward to explain briefly (I hope) when it comes to simpler fields in Solr. Let’s take a StrField as an example. A strField with multivalued=false has a single term to index, and if docValues=”true” in the field definition we also add the document to the docValues index for that term (or create it if we haven’t seen that term in a previous document). In this way we have both a normal inverted index entry mapping the document to the term, and we have another index that maps the term to the documents that contain it. Let’s say we have a strField field named “firstName” and the term “Alice” comes in. By having Alice in both the “regular” inverted index and in docValues, we can efficiently group, sort and facet on the firstName field without using the low-level Lucene feature, the field cache.

DocValues is important. DocValues allows for any query that implies a sort to use a disk-based data structure to satisfy the sort requirement of the query, without pulling all the relevant data into memory and sorting it in the heap. Without docValues=”true” on our firstName field, the only way to sort the results of a query on firstName is by sorting in memory. This is where the Lucene field cache comes into play.

The field cache has it’s haters, and has gotten a bad reputation in some circles. Others love the field cache when they have a rare need to sort on a field and plenty of heap memory to satisfy field cache’s needs. It allows the query to work, and we didn’t spend time or disk space on an additional index. The biggest problem with the field cache is when a naive query causes nodes in our Search data center to crash with Out-Of-Memory Exceptions (OOM). Even if you have the heap head room, a query using significant field cache memory can put additional load on the node and impact other queries negatively. To help prevent these problems we made a change in DSE starting with 5.0, requiring the use of ‘useFieldCache=true’ for queries that would require it to be acted on.

In either case, whether we use docValues or the field cache, when we sort (asc) on the field firstName, Alice comes before Bob and life is good.

TextField, Tokenization and Terms

Unlike the example of the firstName field, which is a strField, our next field is a TextField. This is a field type that is designed for tokenizing into many terms. One can tokenize with one of the tokenizers bundled with Solr, or one can write their own. The one-to-one document field to term mapping becomes a one-to-many with tokenizing.

Tokenize We Must

When text is what we’re dealing with, tokenizing and filtering the input is usually necessary to achieve the goals of the application. There are a few tokenizers available with Solr. Typically, they remove ‘noise’ of punctuation and whitespace.  The the individual terms of interest are indexed. Also this is prerequisite for using term vectors and highlighting the term in context. We can also filter, the most common filter is lowercase, for case sensitivity so that input string “A Tale of Two Indexes” becomes terms “tale”,”two”,”indexes” and if you add stemming to that, “indexes” becomes “index” and this is good; one can search for those terms, and identify not only which document and field has the term, but where in the field the term resides.  

However, to do any type of aggregation, anything that implies sorting, on tokenized text fields requires the use of the Lucene Field Cache and there is no way to avoid it.

DocValues and Tokenize We Must Not

KeywordTokenizer - the “Non-Tokenizing” Tokenizer. This tokenizer is the only one supported for DSE’s implementation of docValues for TextField. Yet, we find docValues set sometimes for tokenized text fields. The KeywordTokenizer doesn’t tokenize the input, it takes the input as it is and indexes it. Unless that is, unless the input is over 32k in size. That’s Lucene’s limit on an individual field. When the input is larger than 32k it’s dropped and not added to the index.

Tokenized Danger Field

No relation to the late Rodney.

So what happens when one sets docValues=true for a DSE Search TextField that is using a real tokenizer? Let’s take a look using the Wikipedia DSE Demo. That way, those of you following along at home can join in the fun. Stand up your demo cluster it’s hands-on time.

Follow the instructions to set up the wiki.solr in dse-demos/wikipedia

With the small sample data set supplied we can run this query (this is using DSE 5.0.8):

curl 'http://localhost:8983/solr/wiki.solr/select?q=*%3A*&sort=body+asc&fl=id&wt=json&indent=true&useFieldCache=true'
{
  "responseHeader":{
    "status":0,
    "QTime":12},
  "response":{"numFound":3579,"start":0,"docs":[
      {
        "id":"23750998"},
      {
        "id":"23744459"},
      {
        "id":"23756443"},
      {
        "id":"23728198"},
      {
        "id":"23753868"},
      {
        "id":"23747942"},
      {
        "id":"23747464"},
      {
        "id":"23733153"},
      {
        "id":"23752816"},
      {
        "id":"23755567"}]
  }}

So we can see that we have 3579 total documents and we retrieved the first 10 results, but behind the scenes all the the body text were sorted in the Lucene field cache.

Now, I’m going to add docValues to the field and reindex:

Interesting that there is a note in the comments about not doing this. Let’s ignore it.

After making the change, we run:

dsetool reload_core wiki.solr schema=schema.xml deleteAll=true reindex=true

To rebuild the index with docValues on our tokenized text field. A few minutes later our index is ready, we check with dsetool:

~$ dsetool core_indexing_status wiki.solr
[wiki.solr]: FINISHED

And we run our query again, this time we can leave off the query parameter for using the field cache, since we have docValues, Solr will use docValues and ignore the parameter if supplied.

curl 'http://localhost:8983/solr/wiki.solr/select?q=*%3A*&sort=body+asc&fl=id&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":21},
  "response":{"numFound":3523,"start":0,"docs":[
      {
        "id":"23728577"},
      {
        "id":"23749558"},
      {
        "id":"23757596"},
      {
        "id":"23727743"},
      {
        "id":"23753931"},
      {
        "id":"23757153"},
      {
        "id":"23756516"},
      {
        "id":"23741382"},
      {
        "id":"23758748"},
      {
        "id":"23759925"}]
  }}

Two things jump out right away: The count is not the same, and the results are completely different. What happened?

First of all, if we examine our system.log we’ll find the count for the following exception equal to the difference in “numFound”:

Caused by: java.lang.IllegalArgumentException: DocValuesField "body" is too large, must be <= 32766

When we enabled docValues, it acted on the raw input fields, and a number of them were too large. These documents will no longer show up in our query results.

Next look at the id list returned by the two query invocations: The results are not even close, because when we did our sort with the field cache, the individual terms tokenized from the body were put into memory and sorted. This is very different from sorting on the untokenized input which is all we can get when docValues is set.

There are some who think we should disable docValues from being honored when the tokenizer is not the supported one. This is to prevent the kind of query results disaster we just walked through. However, we don’t do this, and I think there is one good reason to use docValues on tokenized text: To prevent field cache from being used despite the incorrect query responses for sorting (or any query that requires sorting such as faceting).

So there you have it, a tale of two indexes.