This article is first on a series in which we talk about how relevance scoring works in Manticore Search. We start with the base of revelance : term frequency and specificity.
Simple introduction to scoring
Search engines like Sphinx/Manticore Search take texts from documents, split it into tokens and allow searching a bag of tokens/words over this collection.
In the most simple scenario this collection contains the tokens/words ( in what we call the index dictionary) and list with documents where they were found and document’s fields in which they are found.
The search identifies in the first place of one of the input words exists in the collection and it extracts the documents list which later is merged with the other words lists to compile of list of documents that contain all words from the input.
If we have more than one documents respecting this criteria the next step is to make a difference between them. Which of these documents are more relevant for the given input?
A first and simple differentiator is to count the occurences of each input word in the document. More occurences means more relevant word? It depends.
Some words can be found many times in document - which we can consider that the word is important to that document, but the word can also appear in a lot of other documents from the collection - which raise the question if that word is just a common word and it’s importance should be diminished.
If we count the occurences (frequency) of a word in a document we will want to suppress the possible effect (on our count) of these common words. For this we add a counter of the documents that contain the word - this is the “term frequency” - and another counter that tell us in how many documents (compared to total) the word is found - the inverse of it is known as inverse document frequency (IDF) and is a measure of specificity of a term.
Together these two form TF-IDF which tells us how important is a word relative to a document in a collection. Based on IDF we can also compute another wide used ranking function - BM25.
BM25 and TF-IDF
How these factors are implemented and can be used in Manticore Search?
For start, there are several built-in rankers (scoring formulas) that can be used right away. But the most important is the expression ranker which allows to access these factors and create custom formulas.
We mentioned earlier the BM25 ranking function, which in Manticore is included in the default ranker (proximity_bm25) and it can also be used as a document-level factor in ranking expressions.
We’re not going to enter into the details of BM25 as there are plenty of scholar papers about, but we’ll focus on the Manticore implementation.
What must be noted about the bm25 factor is that the name is a bit improper as it’s not a traditional implementation of BM25, but rather an estimate equivalent of BM25(K1=1.2,B=0)
- which is some papers is called BM15.
You can still use the classic BM25 and even tune the k1 and b parameters with the bm25a(k1,b)
ranking function.
An even more advanced version is bm25f(k1, b, {field=weight, …})
allows setting also weights per fields. Note that this is different from the regular field_weights (later about this) and requires index_field_lenghts option.
At field level, TF-IDF can be accessed in several ways: The **tf\_idf**
is a sum of tf_idf values for each occurence of a matched keyword in the field. In addition, **sum\_idf**
is a sum of each tf_idf value for each matched keyword (so not multiplied by occurrences) and **max\_idf**
and **min\_idf**
provide the biggest value of a tf_idf found, respective the smallest tf_idf.
The IDF calculation can be controlled by **idf**
of OPTION clause. In the initial implementation (pre Sphinx 2.1) IDF values were normalized - this means common words ( found in more than 50% of documents) got a penalization (idf=‘normalized’). The value of IDF ranges between [-log(N),log(N)]
. In this mode, a match of a “common word +not-so-common word” would get a IDF sum lower than a match containing only the not-so-common word. To overcome this (if was desired) a new mode (idf=‘plain’) was added where there is no penalization with the IDF value ranging between [0,log(N)]
. On top of this IDF value is divided by the query keyword count so the tf_idf sum fit into [0,1]
range.
This brought a problem in boolean queries when a search “found-word” vs “found-word OR not-found-word” would provide different weights on same result set because of the division.
To fix this, the idf
received another pair of settings that allow controlling if the division is applied or not. The default idf
settings remains normalized,tfidf_normalized
for compatibility reasons, but idf='plain,tfidf_unnormalized'
should provide better results in general cases.
What’s next?
At this point we can create a score based on frequency and specificity of the words, but this only tells us that words are in the document not also if words are grouped next to each other or simply spread across the text. Which leads to another thing we need to add in the collection: the positions of the words in the text.
Having the positions allow us to know if the input words match an exact field of the document, if they are found verbatim or how close/distant are each from another or how close they are to the beginning of the text (as we can consider being closer or even starting the text is more relevant than being at the end of the document).
As the document could have more than one field, the importance of fields can be different as well. For example if we have ’title’ and ‘body’ columns, matches in ’title’ are seen as more important and they should get higher scoring. But more about this in a future article.