Document Scoring

DSE Version: 6.0

Video

Transcript: 

Hi, I'm Joe Chu and in this video we will be talking about document scoring.

When it comes to text searching, how does the database know what data represents a good match for a search query? It's pretty straightforward when you're trying to match the entire string in a column. If we use our mpaa rating column in our KillrVideo videos table as an example, there is no ambiguity regarding what you are looking for.  There's only a certain number of possible values, and they are all extremely short, being no more than one or two characters at most.

On the other hand, determining the results for a search query gets more complicated when the stored text is longer and consists of multiple words, as in the case of the title column. If searching for a particular word in the title, it's possible that there are several rows that contain the word. In that case, which of the rows should return in the search results, and what order should the rows be in? Generally the rows should be returned in an order that is most relevant to the search query. DSE Search can return results that are relevant to the search query by using a document score that is calculated for each row, based on the provided search query. Once the scores are calculated for each row, they are sorted so that the rows with the highest score, which are the most relevant to the search query, are returned first.

The handling of the document score is done at the Lucene level in DSE Search, which uses two computation models to determine and rank results.  These two models are the Boolean model, and the vector space model.

The Boolean model is the first one used when determining the score for a document, and is an inverse index made of the different unique terms across all values in a column or field, and contains a bit vector that indicates whether the term exists in a particular document or row.

In the example here, there are several terms that are listed. 4 of these terms can be found in document 0, but only 3 in document 1, 2 in document 2, and so forth. With this index, it is very efficient to determine whether a document is a match for a search query by masking the bits used for the search term.

As an example, let's say that there is a search query which is looking for documents that contain the term graceful, or the term ponies. Using the inverse index, you can find the bits assigned to each document for each of the search terms and perform an OR operation on the bits, which will indicate whether the document is a match for the search query. Here it turns out that there are 10 documents, out of 20, that matches the search query.

The next question is, how do we determine what order these documents are returned? That is what the vector space model is for.

The vector space model is a way to model both the query and documents as a vector, using the terms as different dimensions. In this way, it is possible to calculate which documents are closest to the query itself.

A numeric value is also calculated for every document, using a scoring system called tf-idf. Just a quick note, we won't be getting into too much of the details and the math here, but we will try to explain the various parts so that you can understand what gets factored into the score and why.

The factors that make up the score for a document is made up of several components. We mentioned tf-idf just now, which stands for term frequency and inverse document frequency. There is also the coordination factor, and the document length normalization factor. All of these together help to make up the formula used to calculate the document score.

The first part of tf-idf is the term frequency, which calculates how important a term is in a document, based on how often the term appears in the document. The idea is that if a term appears often in a document, then that term probably has some special significance and should be weighed higher if searching for the term.

The example below shows some documents that contains the term "trot", and shows the number of occurrences for the term, as well as the possible tf score for the document and that particular term.  Again, note that we will not be getting into the details on how the tf score is calculated, but can be considered accurate here.

The other part of tf-idf is the inverse document frequency.  This is a score that is used to measure the importance of a term, depending on how often it can be found in documents. If we consider document frequency, that would be the number of documents that a particular can be found in, over the total documents in a table. The inverse document frequency would roughly flip that around as the total number of documents, over the number of documents that contains the term.  The meaning of using the inverse document frequency is so that we can assign a weight to each of the terms, depending on how often that term can be found.

In the example below, the term "unicorn" can be found in only one document out of 20. Since it is so rare, the inverse document frequency places more importance on that document. However if the search query is also searching for the term "the", it would not care about this as much, since it is quite commonly found, with 17 out of 20 documents containing that term. You can see then that the idf score is much higher for the term unicorn, then the score for the word "the", which will then go on to affect the overall score as well.

Another part of the score calculation is the coordination factor, which is a value that is calculated as the fraction of the number of search terms that can be found in a document. This is useful since there can be documents that can be a match for a search query, even if not all of the search terms can be found in the document. However, the more search terms that can be found, the higher the coordination factor will be.

The example show here is a search query that is looking for the term trot, or the term ponies. As long as document has either one of these two terms, it will be a match.

Document 0 actually has both of these terms, and the coordination factor is higher than document 5, which contains only one of the terms.

The last component in the score is the document length normalization, or field normalization. This is a mechanism that modifies the score for a document based on the length of string that is stored. The more terms there are in the document, the lower the document length normalization will be.

In the example below, document 0 has a fairly small string, made up of 9 terms. On the other hand document 15 is rather long with over 30 terms. The document length normalization value for each of these documents are adjusted accordingly, with document 0 having a larger value than document 15.

There are several other ways that a document score can be affected as well. There is  the index-time boosts, which is a way to boost the score for a newly inserted or updated document, when it is being inserted. This course does not talk about this in any further detail since it has been deprecated and no longer supported in current versions of Solr.

The other way is query time boosts, which can modify document scores on a per-query basis. This can be in the form of a term boost, where certain terms may be given a higher score compared to other terms, or with a function query, which is a more programmatically way to influence or modify scores for documents. Both of these will be discussed in further detail in other videos.

No write up.
No Exercises.
No FAQs.
No resources.
Comments are closed.