Meet Manticore Search

High performance full-text search engine with SQL and JSON support
Drop-in replacement for Sphinx Search.
Guaranteed to stay open source. Highly scalable.

September 18, 2018

Relevance scoring in Manticore : part I

Simple introduction to scoring

Search engines like Sphinx/Manticore Search take texts from documents, split it into tokens and allow searching a bag of tokens/words over this collection.
In the most simple scenario this collection contains the tokens/words ( in what we call the index dictionary) and list with documents where they were found and document’s fields in which they are found.
The search identifies in the first place of one of the input words exists in the collection and it extracts the documents list which later is merged with the other words lists to compile of list of documents that contain all words from the input.
If we have more than one documents respecting this criteria the next step is to make a difference between them. Which of these documents are more relevant for the given input?
A first and simple differentiator is to count the occurences of each input word in the document. More occurences means more relevant word? It depends.
Some words can be found many times in document – which we can consider that the word is important to that document, but the word can also appear in a lot of other documents from the collection –  which raise the question if that word is just a common word and it’s importance should be diminished.
If we count the occurences (frequency) of a word in a document we will want to suppress the possible effect (on our count) of these common words. For this we add a counter of the documents that contain the word – this is the “term frequency” – and another counter that tell us in how many documents (compared to total) the word is found – the inverse of it is known as inverse document frequency (IDF) and is a measure of specificity of a term.
Together these two form TF-IDF which tells us how important is a word relative to a document in a collection. Based on IDF we can also compute another wide used ranking function – BM25.

BM25 and TF-IDF

How these factors are implemented and can be used in Manticore Search?
For start, there are several built-in rankers (scoring formulas) that can be used right away. But the most important is the expression ranker which allows to access these factors and create custom formulas.
We mentioned earlier the BM25 ranking function, which in Manticore is included in the default ranker (proximity_bm25) and it can also be used as a document-level factor in ranking expressions.
We’re not going to enter into the details of BM25 as there are plenty of scholar papers about, but we’ll focus on the Manticore implementation.
What must be noted about the bm25 factor is that the name is a bit improper as it’s not a traditional implementation of BM25, but rather an estimate equivalent of BM25(K1=1.2,B=0) – which is some papers is called BM15.
You can still use the classic BM25 and even tune the k1 and b parameters with the bm25a(k1,b) ranking function.
An even more advanced version is bm25f(k1, b, {field=weight, …}) allows setting also weights per fields. Note that this is different from the regular field_weights (later about this) and requires index_field_lenghts option.
At field level, TF-IDF can be accessed in several ways: The `tf_idf` is a sum of tf_idf values for each occurence of a matched keyword in the field. In addition, `sum_idf` is a sum of each tf_idf value for each matched keyword (so not multiplied by occurrences) and `max_idf` and `min_idf` provide the biggest value of a tf_idf found, respective the smallest tf_idf.
The IDF calculation can be controlled by `idf` of OPTION clause. In the initial implementation (pre Sphinx 2.1) IDF values were normalized  – this means common words ( found in more than 50% of documents) got a penalization (idf=’normalized’). The value of IDF ranges between [-log(N),log(N)]. In this mode, a match of a “common word +not-so-common word” would get a IDF sum lower than a match containing only the not-so-common word. To overcome this (if was desired) a new mode (idf=’plain’) was added where there is no penalization with the IDF value ranging between [0,log(N)]. On top of this  IDF value is divided by the query keyword count so the tf_idf sum fit into [0,1] range.
This brought a problem in boolean queries when a search “found-word” vs “found-word OR not-found-word” would provide different weights on same result set because of the division.
To fix this, the `idf` received another pair of settings that allow controlling if the division is applied or not. The default `idf` settings remains normalized,tfidf_normalized for compatibility reasons, but  idf='plain,tfidf_unnormalized'  should provide better results in general cases.

What’s next?

At this point we can create a score based on frequency and specificity of the words, but this only tells us that words are in the document not also if words are grouped next to each other or simply spread across the text. Which leads to another thing we need to add in the collection: the positions of the words in the text.
Having the positions allow us to know if the input words match an exact field of the document, if they are found verbatim or how close/distant are each from another or how close they are to the beginning of the text (as we can consider being closer or even starting the text is more relevant than being at the end of the document).
As the document could have more than one field, the importance of fields can be different as well. For example if we have ‘title’ and ‘body’ columns, matches in ‘title’ are seen as  more important and they should  get higher scoring. But more about this in a future article.

 

Features

Advanced
text search

Over 20 full-text operators, over 20 ranking factors, custom rankers that allow creating complex searchs

Distributed indexes and
high availability

Indexes can spread across multiple servers with built-in load balacing and mirroring

Real-time
indexes

RT indexes allows updates be instant available for searching

Sphinx
compatible

Developed from Sphinx 2.3, Manticore Search is fully compatible with Sphinx

Geosearch
locating

Flexible geosearch and geometry functions

JSON
attributes

Schemaless data can be used for filtering, grouping and sorting

Percolate
Queries

Blazing-fast documents matching against stored criteria

Faceted
search

Out-of-box facets on any attribute

Community

Join the community to get help,  discuss ideas or involve into development.

discourse
Forum
Github
Slack

Professional support

meet the team

Our team of tech superheroes are leading experts in the field of full-text search. We share a mission to develop the best possible search engine in the world- with the best possible user support to boot.

Sergey Nikolaev

CEO

Stanislav Klinov

Senior Core Developer

Ilya Kyznetsov

Senior Core Developer

Alexey Vinogradov

Senior Core Developer

Gloria Vinogradova

Duty officer

Adrian Nuta

Director of Support