Lucene

DSE Version: 6.0

Video

Transcript: 

Hi, my name is Joe Chu, and welcome to Lucene Indexes. In past videos, we've talked a bit about how to create indexes, where they are stored, and even a bit about the calculations done to generate scores for each of our documents. In this video though, we'll be talking about what actually gets indexed, and how those indexes allow up look up terms in a document field.

As we have mentioned throughout the course, the technology behind the indexing engine in DataStax Enterprise Search is Apache Lucene. It is with Lucene that we are able to create indexes on different columns and allows us to perform different types of searches using those indexes.

The Lucene index is made up of index segments files, where the combined total of the segment files make up the search index for a table. Each segment in turn is made up of several different indexes and data structures, which may be saved in different file types. As more data in the table are indexed, the more index segments get created. The segments themselves are immutable, and index entries that are updated or are marked as deleted are written to the new index segment.

If you're familiar with the Cassandra SSTable, the index segment is a very similar concept. Each segment would represent a "SSTable", which itself is actually made up of several different files. There are multiple files that make up a segment, each which have a different purpose: whether they serve as some sort of index, or stores statistics, or are bloom filters.

Before we take a look at the files that make up the Lucene index, let's focus on one of the main components of the index: the postings format, which is basically our inverted index. We generally say that the inverted index stores our values and points to the documents, or rows, that contains these values.

There are two main parts to the postings formation: the term dictionary and the posting list, both which we'll get into more detail a bit later. Aside from the postings format, there are several other formats that make up the dictionary, which includes metadata, the LiveDocsFormat, which marks when a document has been deleted, and optional components such as the DocValuesFormat.

Presiding over all of these formats is the codec, which is a concrete class that implements the reading and writing of these formats and provides the API used by Lucene to interact with the index. There are many different versions of the codec, and generally changes with each major release of Apache Lucene. However, you should not have to think too much about them; the default codec is sufficient and shouldn't be changed to maintain compatibility with DSE Search.

The first structure in the postings format that we'll be talking about is the terms dictionary. This is a sorted skip list that contains all of the unique terms that can be found in a field, across all of the indexed documents. There would be a separate terms dictionary for each indexed field.

For each unique term in the terms dictionary, there is the document frequency, which is the number of documents that contains the value in the field. In addition, there is also the postings list.

The postings list is also a sorted skip list, though in this case the list contains all of the document ids that contains a term. Note that the document id is an internal id that is generated for each document. This means that the document IDs are reused and changed as documents are deleted, or index segments are merged or optimized. Of course, for data that exist across replica nodes, the document ID will most likely be different as well.

The document ID itself is a 32-bit integer, which means that there is a maximum number of documents that can be stored on a node in a single search index, which is approximately 2.1 billion documents. This limit can be worked around by adding more DSE Search nodes, which will reduce the number of rows that need to be indexed for each node.

For each document ID, the postings list will also save the term frequency, which is the number of times the term appears in the document field, and an array with the term positions.

As an example, let's take a look at the postings that is generated when we insert three rows. Here we are indexing two fields, title and mpaa_rating. The three rows are titled The Adventures of Rocky & Bullwinkle, Adventures in Babysitting, and The Many Adventures of Winnie The Pooh, all with a different mpaa rating. Each row will also eventually have a document id, which is shown here for visualization purposes. The text analysis that we use also helps to determine what the index will look like, so let's say we are tokenizing on spaces, with no filters.

First we'll take a look at the terms dictionary for the mpaa_rating field. Each term has an ordinal, which is basically a term ID, followed by the term itself.

Looking at our first term, G, it has a document frequency of 1 since it only shows up in one of the rows. The postings list shows the document id the term shows up in, the term frequency for the document, which is 1, and the term position, which is 0, and is the very beginning of the text.

The second term is PG, which also has a document frequency of 1. The postings list shows the document id of 0, term frequency of 1, and a term position of 0.

The last term is PG-13, with a document frequency of 1. The postings list shows the document id of 2, term frequency of 1, and a term position of 0.

Now let's look at a slightly more complicated example of a term dictionary, which is built for the title field. Here you can see that there are terms that show up across multiple documents and can be found in different positions in the text.

For example, Adventures has a document frequency of 3, which means it is actually in the title of all 3 of our rows. The Postings list reflects this, with 3 document ids and a term frequency  and the position that the term Adventure can be found in. In document 0, the term is found once and is the found as the second term in the title, The Adventures of Rocky and Bullwinkle. Document 1 also has a term frequency of 1, but the term is in the very beginning of the title, Adventures in Babysitting. Finally document 2 has a term frequency of 1 and term position of 2, meaning it is the third term in the title, The Many Adventures of Winnie the Pooh.

The other terms shown here are pretty straightforward. Again, we are assuming in this example that no filters are being used, so there are actually two entries for the term the, one that is capitalized and one that is not.

We've mentioned DocValues quite a bit in different parts of this course, and you should already have an understanding that you can enable or disable them in the search index schema, and that they are used with sorting and faceting queries.

Let's also take a quick look at how they work. DocValues are a type of forward index, which is the inverse of an inverse index. In other words, this would represent the standard definition of an index, where it is keyed by an ID, the document ID in this case, and points to the various values associated with the ID. What makes it useful is that the values are typically compressed, and therefore reduces the consumption of memory needed when running search queries with sorting or faceting that would need to iterate through all of the values in a field.

There are several various types of DocValues indexes, however these will be transparent to the end user. The type used will actually be dependent on the field type and whether the field is multi-valued.

An example of the simplest DocValues type is Numeric, which are used for the Trie fields, and can be visualized as an array. Each index in the array would refer to the corresponding document id, and the element would be the field's value, saved as a long value or some compressed form of it.

The Sorted DocValues would be the one used for StrField and UUIDField, and would also be an array of long values. However, the element value would represent the term as it is stored in another, separate dictionary map.

The Sorted_set DocValues is again very similar, but the array elements would be a set of compressed integers, that represents the term in a dictionary map. This type would be used for pretty much any field type that is multi-valued.

There are other types that are not covered here as well, but the three that we've mentioned are the ones primarily used in DSE Search.

Some other noteworthy data structures that are created with Lucene indexes include the live documents, which keeps track of documents and whether they have been deleted, or still alive. Documents are not deleted in the traditional way since each index segment is immutable. However with the live documents, deleted documents can be marked as deleted and avoided from being included in search results.

The norm is a structure that contains the length normalization and boost factor saved for each field and document. The values stored here are used for relevancy score calculation that helps determine the ranking of the documents in search results.

There may also be other structures included in the index, depending on the search components registered in the search index config.

Eventually the index segments will be written to disk as various files. The default location for these files is in /var/lib/cassandra/data/solr.data, but this can be changed in the dse.yaml configuration file. Since index segments are immutable, they are given an identifier which is a base 36 integer, which you would see in the file name as values from 0-9, and then a-z. The Lucene codec version used for the index would also be included in the file name.

Instead of having to navigate through the filesystem to see the index files, there is a dsetool command list_index_files to view the files for a search index. Simply pass in the name of the search index, which should be the keyspace dot table name, and it'll return the individual files that make up the index segments, and other statistics, like whether the index files are encrypted.

In the table here, you can see all of the different file types that make up each index segment.

The Lucene field info and metadata is stored in a file with the extension .nfm.

The postings list is separated into several files including the .doc and .pos.

The terms dictionary, and an index for that,  would be in the .tim and .tip files.

There are also bloom filters created for the above files that are used to reduce the amount of unneeded access to those files.

After that you have the docvalues saved in the .dvd and .dvm files

Finally there are the live documents would be in the .liv file, and the norms in the .nvd and .nvm files.

The pound sign in the file name represents where the identifier would go. Again this is a value represented with the numbers 0-9 and characters a-z.

As mentioned, the codec version is included in the file name, which for DSE 6 would default to the Lucene50 and Lucene54 codecs.

For viewers familiar with Apache Solr, one set of files you won't see in DSE is the .fdt and .fdx file types, which is used to hold field data. Of course, the field data is actually saved in the database, and would be read from there when needed for search results.

(Some variety, pick one where Joe sounds the awesomest)

After viewing this video, it's now time for some hands on! In this exercise, you'll be taking a look at the various files that make up the index segment, and also examining what is stored in the index and how.

Try doing this exercise before moving on to the next unit.

Now it's time for an exercise!

Did everything make sense to you? Try doing some hands-on in this exercise to help clarify the concepts explained in this video.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.