Text Analysis

DSE Version: 6.0

Video

Transcript: 

Hi, I'm Joe Chu and this is Text Analysis. In this video, we'll be taking a look at how text analysis is defined in your search index schema, and how it affects the way text searches are executed.

First, let's try to understand why text analysis is needed. Here we have some text which represents the description of a movie. We want to make sure that certain search queries are able to match this text.

For example, if someone searches for the term "after", that should match the word in the description here.​​​​​​​

If there is a search query looking for the phrase "proton pack", that should match this part of the text that has that phrase as part of the term.

Finally a search query for the word ghostbusters should match the part of the text here.

The first challenge would be that any one of those search queries would not match the complete text, which is generally how we find matches by default. Assuming that we can even tokenize the text and split the text into individual words, you can still see that there are slight differences between the search term, and what we would want the search term to match in the description: whether it is the case-sensitivity, or additional words that can be found as a part of a term, or the use of punctuations within or around a term.

We will need to process this text in such a way that DSE Search will still consider this text a match, even though it is not exactly the same as the search query. This is the purpose of text analysis.

You're probably thinking, this is great, so how do we do this text analysis thing? Again, this is where the search index schema comes into play. When you define a field type, particular one of the class org.apache.solr.schema.TextField, you can then add another element, called analyzer, which will allow you to further define how text analysis will occur for a field that uses this field type.

The easiest way to implement text analysis is to specify a class for the text analysis. One possible class you can use is the WhitespaceAnalyzer class. This class will perform pre-defined analysis on the text value that is stored for a field in a document, which is, to tokenize the text anywhere there is a whitespace.

Using this analyzer, the illustration shows how the original text will get processed and tokenized into individual terms, whether it can be split with a whitespace, and those resulting terms will then be indexed. Note that these terms do not have any other processing done, so case-sensitivity is still preserved, and punctuations are still saved, like in the case of university, comma.

There are other analyzer classes that you can use, but take your time finding those either in the Lucene documentation or the API javadocs.

If the use of an analyzer class doesn't agree with you, then you will be pleased to know that you can customize your own analyzer through the use of two additional schema elements: the tokenizer and the filter. The tokenizer would represent how want you want the input text to be tokenized into individual terms, and the filter represents a way in which the individual terms can be processed.

The example below shows a schema definition that has a tokenizer, using the StandardTokenizerFactory class, and a filter, which uses the LowerCaseFilterFactory class. Don't worry about what these two classes to right now, we'll get into that in a short while.

In the schema that we just saw, the first element within the analyzer element is the tokenizer, which is what will always process the input text first. Only one tokenizer can be defined and used, and will then read the input text and generate tokens based on the tokenizer class. When defining a tokenizer, you will always specify the class, which provides the tokenization logic. Some tokenizer classes may also require additional attributes, like in the case of the PatternTokenizerFactory here, which allows you to specify a regular expression pattern that will determine where to tokenize.

There are some common tokenizers you should know and consider using. The first one is the StandardTokenizer, which we have seen in one of our earlier examples. This tokenizer will generate a token anywhere it sees white space or punctuation, like commas, periods, hyphens, etc.  The second is the WhiteSpaceTokenizer, which treats only white space as a delimiter. Finally there is the the PatternTokenier, which is what we saw in our most recent example, and uses a regular expression as the delimiter. Besides these three there are many more that are packaged with Search, so it's a good idea to check out the Solr documentation to see what else is available, and what they do.

Here we have an illustration that demonstrates what the three tokenizers would do on our example input text. The standard tokenizer would generate the most tokens as it tokenizes anytime whitespace or punctuation is found, which is why you don't see any of the original punctuation in the resulting tokens.

The whitespace tokenizer shows slightly longer tokens as a result of keeping punctuation, so you see terms such as proton dash pack dash toting, and ghostbusters in double quotes, and so on.

The last example here shows the use of the PatternTokenizer, which has a regular expression that will tokenize only on dashes or commas. This results in extremely long tokens that includes multiple words.

After the tokenizer, filters are used to manipulate or process the resulting tokens in some way. It uses a TokenStream as input, which would first come from a tokenizer. Therefore it is not possible to define and filters without having a tokenizer. Afterwards, a filter will also generate a TokenStream as its output, which means that it can then be used by another filters. You are able to define multiple filters in your analyzer, which will be chained together as the text is indexed.

We can briefly illustrate this here, where the input text goes through a tokenizer, and then subsequently through three filters. You can see on the right hand side here, that you cannot start or use a filter without having a tokenizer preceding it.

Again, just like the tokenizer, the filter is defined within the analyzer element in the schema. In a strict sense, the filter definitions should follow the tokenizer and a class name must be specified for the filter, usually with a name ending in FilterFactory. The example below shows two different filters that are added to the analyzer for this TextField field type.

There are over 40 different filters that come pre-packaged with Search, which I welcome you to study in more detail in the Solr documentation. For the purposes of this video though, we will only be looking at 4 of them. These are the LowerCaseFilter, the StopFilter, the SynonymFilter, and the PorterStemFilter. You can see here a brief description of each of these filters, but let's take a look at an example of each of these as well.

The first one we'll take a look at is the LowerCaseFilter. When tokens are processed through this filter, you can see that they are all transformed to using lower case characters. The token After, highlighted in red on the left, is capitalized, but after the lowercase filter, is in all lower case, as seen on the right.

The stop filter removes certain terms from being indexed, and is controlled with a user provided list, which would be defined in the search index schema. Generally the list should include stop words, commonly used words in the English language that are not very useful to index, as a match would return a large number of documents. These include words such as the, and, a, and so forth. In the example here, the terms at and a are two stopwords found in the stop word list and we can see after the Stop filter that they are not included in the resulting token stream to be indexed.

The synonym filter provides a way to allow certain terms to match their synonyms. This filter is also controlled by a user-provided list. In this example, the list is a text file called synonyms.txt, and contains the following logic:  any occurrence of the term academic or education should be removed and replaced with the term underpaid. The term prestigious will be removed and replaced with three new terms awesome, cool, and lucrative. Finally any occurrences of the term college will also add the term university, and vice versa. You can see the results of this filter by looking at the terms highlighted in red, green and purple, on either side on the diagram.

The final filter we'll look at here is the porter stem filter. This filter replaces terms with their base form through a process called stemming. This allows terms in a search query to match similar words that share the same base. For example, the word academy and academic have similar meanings and share the same base word, academ. If the term academic is indexed with the PorterStemFilter, then a search for the term academic or academy should be able to match. You can take a look at some other terms here and how the porter stem filter changes them in the output tokenstream.

Text analysis occurs at two different times. One is during indexing, when the field value is analyzed and the resulting terms are then indexed individually for that field and document. The other is during querying, and the search terms are also analyzed so that they can then be used to cross-referenced with the index for the field being searched. Each search term is analyzed with the same field type analyzer that is used for the field when indexing, so it is possible that different terms will be analyzed differently if searching through multiple fields.

Our example here shows the text we are using as the value to be indexed. If you recall our three example search queries from the beginning of this video, we'll see how the search is executed. Of course, assuming that we have already set up the search index schema appropriately.

Starting with the field value, it would first​​​​​​​.

Go through the standardtokenizer, converting the description into a stream of tokens.

Go through the lower case filter.​​​​​​​

And then go through the porterstem filter, with the resulting terms being indexed.​​​​​​​

If we try to run the first search query, looking for the term after, that term will also go through the standardtokenizer, lowercase filter, and porterstem filter, which results in one term.​​​​​​​

This term matches one of terms that was indexed from our text input, as highlighted here in red, and therefore the document that contains this value would be included in the search results.​​​​​​​

The next query which searches for the phrase proton pack also goes through the same analyzer, but results in two terms. Since it is a phrase, these two terms would have to be found in the same order in the index​​​​​​​.

Which you can see it does, with the terms highlighted in red.​​​​​​​

The last query, looking for the term ghostbusters, results in a single term, which would also be found in the index for the field value.

It is possible to two different analyzers to use for a field type: one that would be used for indexing, and another to use for querying. As shown in our example schema here, we have two separate analyzers, which are differentiated by the type attribute, with the first one marked as index, and the second marked as query.

Before we end here, I also wanted to mention the analysis screen that you can use on the Solr Admin page. This tool can be used to simulate how an analyzer would work, and will break down the individual steps of the analyzer, and how index field value or query field value would look. This is particularly useful when you have a search query that doesn't include a document you would have expected, and can then see what terms were able to match or not match.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.