Search Index Management

DSE Version: 6.0

Video

Transcript: 

Previously our discussion has centered on schemas. In this section, we add focus to config, and a number of operations involving DSE Search indexes.

As stated earlier, you can create and maintain DSE Search indexes using CQL alone, or also use the Apache Lucene legacy XML encoded data files, and the DSE dsetool command line utility.

Most of the commands you run using CQL and dsetool will be data center wide in scope. A very small number of commands, like checking new DSE Search index build completeness, require dsetool and are single node in scope.

Creating ( adding ) or updating resource files, like a synonym file, can currently  only be completed using dsetool, and are data center side in scope.

Entering a longish section then, where we detail a number of DSE Search index examples using CQL.

CREATE SEARCH INDEX is entirely the first command you need to run, after which you often run a number of ALTER SEARCH INDEX statements to get exactly what you want.

In the first example, a DSE Search index is created on every column in the named DSE table.  Fun and easy. Please do this only if you're developing, or on non-production systems. For any larger data set, we would only DSE Search index column we are certain will actually use said index.

The last example begins to explain that a fuller CREATE SEARCH INDEX statement names only specific columns, with specific schema specifications, as well as specific config settings, and more.

Here we see two or three examples, detailing you can name specific columns to the DSE Search index, and even set specific attributes, like docValues=true or false.

Recall that upon receipt of the first( and only ) CREATE SEARCH INDEX statement, the DSE Search runtime automatically records metadata about the DSE table primary key.

Here we see that copyField, and specific field attributes, namely, docValues true|false, indexed true|false, and others may be set via CREATE SEARCH INDEX.

You may exclude any DSE table column proper from the DSE Search index, either by never naming them, never adding DSE Search index metadata, or by using the excluded switch.

To aid in setting given DSE Search index schema metadata attributes quickly, a number of profiles exist, that contain many common or popular config related settings.

Each of the profiles acts as a macro for a single or set of attributes that might have also been set of the field type or field level.

Officially Apache Lucene supports table joins, although a better title for this ability might be ( SQL ) nested selects; you can't actually ( join ) two tables using Apache Lucene.

If you need to support (nested selects, aka, joins ) using DSE Search, there are field type and field attributes you need to add. If you never plan to join, you should leave the resources needed to support joins ( un-calculated, uncollected ).

As a profile, spaceSavingNoJoin sets these field type and field attributes for you.

spaceSavingSlowTriePrecision-

 As a high level we state; you can, if you choose, index every integer key value distinctly, yielding ultimate performance. But performance always takes resource, in this case, additional disk space to record every integer key value distinctly.

If you use the index associated with that integer infrequently, or can suffer a little less performance in favor of saving a good measure of disk space, you can ( dial this back a bit ).

At a high level, imagine DSE Search will index ( ranges ) of integer key values, saving disk space. When looking for a particular integer key value, DSE Search will position at the start the range containing the integer key value, then scan within said range. Slower, but less disk space certainly.

This precision, this ranging, is available for most or all Trie column types.

And the spaceSavingAll profiles, sets both profiles above.

The default query field can be set using config, and at the query level (CQL SELECT).  Here we see the default query field set via config.

Default query fields are handy when trying to write query predicates referencing multiple tokens. For example, find all movie titles containing. "Clifford, The Big Red Dog". There are many ways to write this query, and using a default query field is just one.

Also using config, you can tune, DSE table by DSE table, how changes to the DSE table proper are written to the DSE Search index, and how certain DSE Search index queries are cached.

DSE Search cached queries require the "fq" filter query syntax.

In addition to DSE Search index schemas, with their profiles, and config attributes, are a number of DSE Search index ( options ).

In effect, how should DSE respond if the DSE Search runtime discovers an inconsistency in any of its index lists. The choices here are recover(y) and reindex, descriptions as shown.

Also, how should the DSE Search index runtime respond  when receiving casting errors; the end user application program trying to index unsupported data types and similar.

Just like the CQL DESCRIBE TABLE command, there is a CQL DESCRIBE SEARCH INDEX command.

As there is a schema, both pending and active for any given DSE Search index, there is also a pending and active config.

Output is written using an XML format.

Once you become comfortable with what DSE Search indexes can deliver, you will commonly design a CREATE SEARCH INDEX statement, followed by a number of ALTER SEARCH INDEX statements.

Why ?

Largely because of DSE Search text analytics. Specifying the very specific, and perhaps multiple treatments to any given DSE table column can become several lines long. You could try to fit all of that into one CREATE SEARCH INDEX statement, but why- It would be the equivalent is a relational database stored procedure that is 10 pages long. Don't do it.

The DSE Search index exists as metadata only until you deploy. Why not just simplify, and organize what you want into a small number of well delineated, simple commands.

And that command is, ALTER SEARCH INDEX.

You can alter the schema, or config, or both.

You can add, drop, or modify DSE table columns to the DSE Search index, and more.

In effect, you add, drop, or modify field types and fields, (and more).

Recall that all changes will be PENDING, until you deploy.

Using ALTER SEARCH INDEX, you can add. drop, or modify all of the ( topics ) we discussed previously; field types, fields, attributes to same, config, copyFields, dynamicFields, and more.

In these two examples, we add an existing DSE table column proper, to the DSE Search index.

In the second example, we see we can specify any specific attributes to the field type or field, as required by our application.

The note at the bottom is a topic we detail on the next page; use types.fieldType when ( changing ) field types, and use fields.field when ( changing ) fields.

In the first example, we are adding a field type of StrField to the DSE Search index schema.

In the second example, we add a field to the DSE Search index schema. And we specifically set the attributes, docValues=true, and indexed=true.

If there was a DSE table column named, "title_sort", we'd be done.

As it turns out, the third example displays a copyField from "title", to "title_sort".

If this group of code is accurate, title would have to be a column in the DSE table proper, probably a DSE column type of TEXT.

Just to be clear; this example could largely have been delivered without the copyField, and instead could have just DSE Search indexed title directly.

As written, the end user program would have to know to call to range and sort on a DSE Search index key column titled, title_sort, and not title.









































































 

  Generally we state,

 

     You can DSE Search queries just using CQL, or using a DSE Search predicate in JSON format, or non-JSON format.

 

     You can issue CREATE or ALTER SEARCH INDEX statements using CQL JSON, and the legacy Apache Lucene XML encoded data files.

 

     ( If ) you wish, ALTER SEARCH INDEX statements can also be written using a JSON map format.

 

Minimally the use case for this is to avoid issuing multiple CQL ALTER SEARCH INDEX statements to accomplish the same end goal.

In this example we see the CQL JSON map formatted ALTER SEARCH INDEX command.

In this example, we add a field type of type TextField.

The analyzer for this field type is specified to be the Standard Tokenizer, (generally: split on white space and discard punctuation). This tokenizer is followed by the lower case filter (in effect, fold the DSE Search index key value to lower case).

In this example, we create a second field type, similar to the previous field type, but add a stop word filter. In effect, do not index words like, The, And, and A, (and others as specified in the "stopwords.txt" data file.

Recall that "stopwords.txt" is a resource file, and must be added to the DSE Search index after the DSE Search index is initially created, but before any DSE ALTER SEARCH INDEX statement makes reference to this resource.

In most previous examples, we added a field to the DSE Search index schema. Using SET, we can modify a DSE Search index field. Example as shown.

Here we see the syntax to drop a field from the DSE Search index; the source column in the DSE table proper is left unchanged, it's just no longer indexed.

And then examples dropping a field from the DSE Search index.

The second example details how to drop a copyField.

The last example leaves the DSE Search index column, but drops an attribute on said column, specifically the docValues attribute.

Just as you can add, modify, or drop a DSE Search index schema element, you can perform the same to DSE Search index config elements.

In this example, we modify the manner in which DSE Core propagates data modifications from the base table, to the DSE Search index.

Previously, all of our discussion has centered on changing the DSE Search index schema, or config.

By default, a CQL CREATE SEARCH INDEX will automatically deploy, but all of our CQL ALTER SEARCH INDEX statements only changed metadata.

As an aside; to avoid actually creating a DSE Search index upon receipt of a DSE Search CREATE SEARCH INDEX, use the OPTIONS reindex = false parameter.

Here we begin a discussion of 4 commands to actually deploy our DSE Search index.

RELOAD SEARCH INDEX moves any DSE Search index schema and config changes from PENDING TO ACTIVE. No DSE Search index rebuild occurs as the result of a RELOAD.

Following a RELOAD, the call to REBUILD will actually deploy any changes; cause the new index to be built, and made available to end user applications.

Any current DSE Search index will remain in place, taking end user requests, until the new index build is complete. Once the new index is complete, the ( old ) index is dropped with no interruption of service.

The deleteAll parameter is handy if you do not have space to store two concurrent copies of any index on disk. Be aware, however, newly arriving DSE Search queries may receive incomplete result sets as the old index went away, and the new index is still being built.

You can track DSE Search index build progress via a form of the dsetool command.

And there is a DROP SEARCH INDEX command, obviously.

If you're using this command while developing, be aware that this command also drops any resource files by default.

Use the deleteResources=false, to override this behavior, if that is your preference.

deleteDataDir calls to actually delete the DSE Search index data files on disk, should that also be your preference.

While changes to the DSE table proper, will automatically update the key value entries in the DSE Search index, and this is tunable, when you are developing, sometimes you are in a hurry.

The COMMIT SEARCH INDEX calls to have changes to the DSE table proper immediately pushed to the DSE Search index.

And then just a quick note; DSE Search index build happen asynchronously in the background.

But, if you feel you need it, do not forget about the CQLSH specific environment variable setting, CQLSH SEARCH MANAGEMENT TIMEOUT SECONDS



































 

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.