Schema and Field Types

DSE Version: 6.0

Video

Transcript: 

As discussed, the complete definition of a DSE Search index involves metadata related to schema, and config. As a generalization we state; you spend most of your attention defining the schema.

Defining the schema involves specifying field types, and then fields, each having distinct properties.

Let's get started.

Legacy Apache Lucene defined schemas using an XML encoded data file generally titled, schema.xml. In DSE, you can fully define the schema using CQL commands alone. But, it's best to know both.

First we define ( field types ). A field type includes a number of properties that are then applied to fields. A field type might be applied to one or several fields.  Fields also have properties, and generally, field type properties are inherited by fields.

In this context, the ( field ) is that DSE Search indexed key value. Generally, the key value that is indexed, originated from a column in a DSE table, but this DSE Search indexed key value can also be derived. More information on this topic as we continue.

Field types and fields are DSE table wide in scope.

As stated earlier, DSE Search index schemas can be specified using CQL, either JSON formatted or not, or via the legacy Lucene XML encoded data files.

While there may be outliers, expect there is nothing you can accomplish using XML that you can not also accomplish using CQL.

The CQL DESCRIBE SEARCH INDEX output is XML formatted.

If you have any experience reading XML, you already know that XML data files carry (embed) the schema of the data they contain. Thus, the schema.xml file has a schema to the DSE Search index schema it defines.

After the XML data file schema, and as stated earlier, the XML schema.xml then has sections for field types, and then fields.

In either schema.xml, and any similar CQL command, you will see metadata for both the DSE Search indexed columns, and the DSE Core primary key columns, labeled, uniqueKey.

Here we have a fragment of a schema.xml file.

Two field types are followed by two fields; purely a coincidence, as a field type might be referenced by a larger number of fields.

A minimum field type definition has a name and a class. The name is an identifier of your choosing; here we see string and long. If you had several different treatments to a string field type, you might see identifiers titled, String1, String2, other. class refers to the actual Java class name which supports this native data type.

The sub-identifier "Trie" is a reference to the longer phrase, "prefix tree field type", and generally indicates a field type that is also ( optionally available ) for sorting, group by, faceting, and other similar operations.

A minimum field definition has a name and a type. type must refer to an existing field type name. The name is the DSE Search indexed column name; the same name you apply DSE search predicates to. This name may or may not equal a same named column in the DSE table proper.

If the DSE Search indexed field sources directly from a DSE table column proper, it is common for the name to equal the DSE table column name. If the DSE Search indexed field was derived (does not directly map to a single DSE table column), then this name identifies a column in the DSE Search index, where no same named column exists in the DSE table proper.

To put it another way; there may be index key values in the DSE Search index that do not appear in the DSE table proper.

The uniqueKey at the bottom of this display lists the primary key to the DSE table proper.

Just to be excessively chatty; all of the contents inside this schema.xml file, might have been generated by only calling to DSE Search index the DSE table column titled, amount.

When a first column in any DSE table is DSE Search indexed, the DSE Search index runtime automatically configures to record metadata for the DSE table primary key proper. In this case, the primary key column titled, id.

When discussing the indexing and search of just textual data, generally there are two categories of treatments you can apply; string field and text field.

You can treat the (entire string) as a whole unit, unmodified, and support literal searches, wild card searches, case insensitive or not, and more. In this manner you could query the movie title, "The Flash" with high confidence.

Or, you can call to tokenize the text, apply many various treatments to it to include; sounds like, synonyms, phrase and proximity searches, stemming (correct for past and future tense references, plural versus singular reference, other), and much more.

With tokenized text, a search for the movie titled "The Flash", might also return the movie titles "Flash Dance", "Flash Gordon", and whatever else.

When working with textual data as tokenized text, there are specific choices based on language; past tense being corrected (indexed) differently in Finnish than it is in English, I'm told.

And as introduced earlier, DSE Search can index integers, unique identifiers (UUIDs), and other data types, scalars, or geo-spatial/spatial.

As stated earlier, the minimum field type definition has a name, and a class.

Additionally, and specific to a given ( field type ), zero or more attributes may be applied. The attributes you apply to a decimal field type may differ than those applied to a string.

It would not be unusual to have a single text field definition with 12 or more attributes. But, non-text field types usually have fewer attributes.

Each DSE Search (Apache Lucene) field type has a matching DSE Core column type.

As small subset of this mapping is displayed here.

It may be possible that some DSE Core column types are not supported by DSE Search, Apache Lucene.

As stated above, there are basically two treatments available for textual data; string field, and text field.

String field is the default generated field type for any DSE Core TEXT column type, and similar. You can change these (index as text) to DSE Search text fields as you have specific application needs.

As text fields are tokenized, actually indexed as separate entries (multiple tokens) in the DSE Search index proper, expect that you can not text field, can not tokenize, the primary key to a DSE table. You can copy this key value using a technique we overview below, then tokenize it, but this becomes effectively a second column, at least to the DSE Search index.

String fields are not tokenized, and may optionally be used for sorts, exact matches, (and fuzzy matches), faceting, and more.

You can not sort on tokenized text; what would it mean to return DSE result sets sorted on a tokenized set of text from a paragraph ? Do you return the first word first (in effect, not tokenized), or sort by any word found in the paragraph, which kind of wouldn't make sense ?

Text fields are perhaps the heart of DSE Search. Certainly, it's the area of DSE Search with the most programmability.

Text fields can be tokenized, are expected to be tokenized, via a construct titled, an analyzer. There are prepackaged (unchained) analyzers, and chained analyzers, where you can specify one or more filters, a tokenizer, and more.

In this example, we see a text field with the chained (pre-packaged) analyzer titled, Standard Tokenizer. Generally, the standard tokenizer breaks (tokenizes) the named DSE table column value proper on whitespace, and discards most or all punctuation.

If this is not the exact behavior you seek, you may choose another unchained analyzer, or specify a (chained) analyzer of your exact requirements. You specify a chained analyzer by assembling a specific tokenizer, zero or more filters, and more.

Any of the field types beginning with the literal string Trie, are used for scalar type data; integers, dates, non-tokenized text (strings), and more.

In addition to supporting most query predicates, equality, ranges, wild card string searches and more, Trie field types are also optionally able to support sorting.

But note; sort ability and other abilities are enabled via the field type and field attributes referred to earlier. Generally these additional abilities consume disk space and potentially other resource. Thus, if you're not sorting a given column, do not configure to support a sort.

As stated above, a minimum field definition has a name and a type. The field ( type ), matches an existing field type's ( name ). Also stated above, you may optionally list zero or more field attributes.

One of the many field attributes is titled, indexed.

So let's break this down-

Let's say a DSE table has 10 columns.

DSE Search effectively knows nothing about those 10 columns until you create DSE Search metadata on any single, or set of given DSE table columns.

This DSE Search index metadata is the schema; the field type and field discussed above. Adding a field makes a DSE table column proper known to DSE Search.

Generally, you might expect that a DSE Search index field is indeed DSE Search indexed, otherwise; Why are you bothering to introduce DSE Search metadata about said column ?

The use case to not ( index ) a given column is when you use the ( value ) in a DSE table column proper, to derive another key value; in this case, an index key column, not directly mapped to a single DSE table column proper.

This use case may not be entirely common, but it's out there.

Most DSE Search indexed fields are in fact, indexed.

If a DSE Search index field is not indexed, you can expect that this DSE table column proper, had it's value sourced to derive some other DSE Search key value.

Also on this page we mention; the attributes to field types and fields ? Attributes set at the field type level, may be overwritten at the field level.

Historically, Apache Lucene has always had the inverted index. Text searches returned results sorted by score; how closely any given element of the result set was believed to match your query criteria. Best matches first.

And Apache Lucene would use a sorting algorithm to sort non-textual data by a natural sort oder, ascending or descending; we're talking about scalars and the field types integer, date, other.

To improve performance, Apache Lucene added a second index type, which is enabled by adding the docValues=true attribute to given fields.

docValues=true is the default for given field types, but again, is you're not actually sorting on that field, turn docValues off, docValues=false.

docValues is false by default for StrField, and if you need to sort, turn docValues to true.

multiValued true or false is a field level attribute, used to inform DSE Search (Apache Lucene), that the source DSE column type is a collection, basically an array. If DSE Search indexing DSE column types of set, list, or map, you should set multiValued to true.

multiValued true is ( generally ) used no where other than when working with DSE collection column types, and there are additional attributes in this area, not mentioned here.

Dynamic fields are used in many places including the older, legacy Apache Lucene geo-spatial and spatial index definitions.

A current use case for dynamic fields includes support for the DSE column type titled, map. DSE maps are effectively collections of key value pairs.

In the example shown, any key value in the map that begins with the literal string, address_, will automatically be indexed, in this case, as a StrField.

If the mapp contains keys titled, address line 1, address line 2, then all elements of this map are then DSE Search indexed.

A comment on the slide cautions about resource consumption. While this example was for address, presumably, other use cases like tags on postings, can be delivered using a single standard TextField, that you then tokenize. Much less costly..

It happens that you might want more than one treatment applied to a given DSE Search indexed column. For example, you might want to greatly favor exact string matches, but also support synonym search, or sounds like searches.

In this case, you would use a copyField.

In the example shown, the DSE table column titled, title, is DSE Search indexed first as a StrField. This means that the value inside the DSE column title, will not be tokenized. Searching for a movie titled, "The Flash", exact match, will yield extremely accurate results.

The second line of code in this example, adds metadata for a second DSE Search index field titled, title_tt, of type TextField. At this point in the example, we can't be certain if title_tt is just a normally sourced column found to exist in the DSE table proper.

The field type to title_tt is not displayed, and could include an analyzer to apply sounds like, stemming, synonyms, whatever.

It is the last line in this example where we see that the DSE table column, title, is sourced to populate (copyField) title_tt.

This is one of the derived key value applications we spoke about earlier, albeit a simple one.  In this case, we derive title_tt from title.

Lastly, to be verbose, title allows sorting via the presence of the docValues attribute, where title_tt does not and can not; title_tt being tokenized as a TextField.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.