Inverted Indexes

DSE Version: 6.0

Video

Transcript: 

Hello. My name is Daniel Farrell, a staff member here at DataStax, and it is my extreme pleasure to carry you a bit farther into your investigation of DSE Search.

Previously, we detailed elements of the DSE Search runtime environment, and a number of the queries you can run using both DSE Core, and now DSE Search.

In this (section), we move into the topics of actually creating, and then managing DSE Search indexes.

At the center of DSE Search text analytics, is the inverted index.

We preface this discussion as "DSE Search", because DSE Core uses a different index type, which is best suited for what DSE Core means to deliver, that being; time constant lookups, and an always-on operational capability.

And we preface this discussion as "text analytics", because DSE Search also delivers scalar search, and geo-spatial/spatial search capabilities. We saw in a previous section where we queried integers, date times, location data, and more, using DSE Search.

These are also DSE Search queries, as we soon detail, but are not text analytics requiring the additional constructs we use specifically when searching text.

Here we see a diagram with two labels; rows, and inverted index.

Rows is meant to represent a common database index; a primary key value of some sort may be used to locate the whole of the record stored in a table. A primary key value is used to locate the remainder of the descriptive values, the attributes, the non-key columns of any particular row.

Find the whole of an order row by order number; the order date, the order payment status, the order ship date, other.

Find the whole of a customer row by customer id; the customer first name, last name, the customer primary billing postal code, other.

The point is; the non-key portions of the row, the values, are retrieved by a unique identifying ( key value ).

Contrast this with the expected use case for an inverted index-

With an inverted index, the ( values ), the descriptive elements, in a row are stored as the index key values.

Take the text, the plot summaries for all movies, to find movies about a lawyer, a civil rights case, and as told from the perspective of a young child. And the movie should be based on an award winning book.

Inverted indexes, and DSE Search, support such queries.

To aid in the indexing of text, DSE Search indexes are ( programmable ) in the way that they tokenize text, that is; how the words in a sentence are recognized and stored. How the tokens in a document are identified and indexed.

In an inverted index, the term lawyer, is stored as a key value along with all movies that have lawyer as a descriptive element. "Which movies" are values ( attributes ) to the key value, lawyer.

In a common database index (rows), the attribute lawyer would be a stored in a distinct column.

Why do we have to mention tokens at all ? Isn't it obvious to index all words in a sentence ?

Let's find out.

Here we have a number of  movie IDs, and text movie descriptions, that we use in the example that follows.

Give special notice to the tokens; Australia, cop, dystopic, and crime-ridden.

As the previous example rows were input into DSE, their descriptive text was configured to be indexed by DSE Search.

Australia, as a token, was found to exist in only movie id 3.  cop, as a token, was found to exist in all movie ids. The token, dystopic, was found for movie ids 2 and 3.

Notice that all of the key values were folded to lower case, to aid in finding rows. Australia is a proper noun, and should be capitalized, but what if the end user did not query as such ?

And what about cop ? What if cop always appeared as the first word in every sentence (and was capitalized before it was indexed).

Notice the term crime-ridden was split on punctuation, and indexed as two separate terms. Should crime-ridden have been indexed as one term ? Would the end user have spelled crime-ridden correctly in all cases ?

All of the above, and much more, are programmable ( configurable ) when using DSE Search.

And here we have a simple DSE Search query to find movies addressing dystopic-ness in their movie description.

This DSE Search query predicate could be ( lengthened ) to serve much more advanced queries, like not Australia, yes crime-ridden, and most importantly, dystopic.

All from the same DSE Search index.

DSE Search will match, or exclude, all tokens, as instructed by the query predicates, and produce a set of unique ( movie ids ). It is at this point, expectedly, that DSE Core will produce the full movie listing for each (release date, director, other), to the calling function, presumably an end user application of some sort.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.