Fuzzy matching and 2nd pass query

Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents.
There are many ways to do this. Manticoresearch makes it very easy, as fuzzy matching is included out of the box. It consists of three main components:

1. Quorum operator:

"computing and technology news"/2
This means that at least two words of the phrase should match, i.e. this query would find texts containing both “computing news” and “technology news”.

2. Proximity search operator:

"computing news"~3
This means that there can be less than N non-matching words between the words from the query. Here is few examples given the text is “a b c d e f g h” :
  • a h”~7 would find that since “b c d e f h” are 6 non matching words and this is less than 7
    however “a h”~6 wouldn’t find the text
  • “a d h”~6 would find that
  • “a d h”~5 wouldn’t

3. NEAR operator

computing NEAR/6 "technology news"

The above proximity operator only works on sets of keywords. NEAR is more generic and can accept arbitrary subexpressions as its two arguments, matching the document when both subexpressions are found within N words of each other, no matter in which order.

The above query would:

  • match with a document containing “computing is a popular topic in technology news”, since the 1st word and the phrase are found within 6 words
  • but would not match with “computing nowadays is a popular topic in technology news” since the gap is already 6 and exceeds the limit set in the NEAR operator.

2nd pass logic

Many of our customers like the second pass logic when the first query to Manticoresearch is strict. If nothing is found or not enough results are returned, the second less strict query is issued. There also may be more advanced logic with 3rd and 4th passes. It just depends on your requirements and whether or not you want your user to at least find something in any case. Or, on the contrary, you can let them find something that exactly matches his query.

Sometimes it makes sense to do it the other way around and make the query stricter. For example, if your default matching strategy is ‘any word should match’ and your application doesn’t have any extended syntax to allow users to specify the best query themselves, it makes sense to first try the ‘all words should match’ strategy or even the ‘phrase should match’. This might significantly increase the quality of the search.

In some applications it makes sense to parallelize the 1st/2nd pass queries and so on. This can be easily done using Manticoresearch multiquery. The logic here is to do the 2nd pass query beforehand and if nothing is found by the 1st pass query the results will be ready and since the queries were done simultaneously this improves performance. However, this depends on a lot of things. You should be careful, as it can reduce performance sometimes. These things are:

  • What hardware are you using? If it’s not powerful enough to handle two queries at once or the response time is close to the response is double that of a single query, it makes little sense to use this technique.
  • What is your load? If your Manticoresearch instance/server is already heavily loaded, you will get a worse response time.
  • What are your statistics? If for 99% of queries the 1st pass the query returns results, there’s a little point to make the 2nd pass query along with the 1st one, this will just be a waste of resources.
The article is based on Sphinx in Action: Fuzzy matching and 2nd pass query” by Sergey Nikolaev https://www.ivinco.com/blog/sphinx-in-action-fuzzy-matching-and-2nd-pass-query/, updated in cooperation with the author and publication is authorized by him.

2 thoughts on “Fuzzy matching and 2nd pass query

  • Another type of ‘fuzzy’ matching, not mentioned is allowing for misspellings etc (or at least mismatches between how user types word, and how the word appears in index data). That might be using morphology and stemming. Or even soundex/metaphone. and expand_keywords with infix/prefix index. Again 2nd-pass, could be allowing matches with stemming if first ‘strict’ query not got much.

    Another feature is ‘CALL SUGGEST’ to look for actual misspellings. Could run that on each word from the query, and see if any similar words that (for example have many more matches). Either do that semi-automatically in the 2nd-pass. Or could just prompt the user with a Google style ‘Did you mean?’.

  • We use a multi-pass approach for our myHealthbox (myhealthbox.eu) search engine:

    In order we use:

    1) strict query
    2) proximity operator
    3) quorum operator
    4) we then repeat the above removing any AND clauses in the query (i.e. language=x and country=y clauses)

    We also use CALL SUGGEST but only after the strict query if the search string is made of just 1 term, the SUGGEST operator does not support a phrase at the moment and calculating all possible combinations for all terms (original and suggested) would just complicate things exponentially.

    It all works nicely for us and it is fast, we have not tried parallelizing the queries with multiquery (as this would be a fixed overhead while we run extra queries only if the previous one fails to produce results in a cascade fashion)

    Just our experience …

Leave a Reply