Fuzzy matching and 2nd pass query

Published: Feb 16, 2018

Many customers that we have helped with integrating search into their applications wanted their search to be more intelligent than just strictly matching a query with documents.
There are many ways to do this. Manticore Search makes it very easy, as fuzzy matching is included out of the box. It consists of three main components:

1. Quorum operator:

"computing and technology news"/2This means that at least two words of the phrase should match, i.e. this query would find texts containing both “computing news” and “technology news”.

2. Proximity search operator:

"computing news"~3 This means that there can be less than N non-matching words between the words from the query. Here is few examples given the text is a b c d e f g h:

"a h"~7 would find that since b c d e f h are 6 non matching words and this is less than 7
however "a h"~6 wouldn’t find the text
"a d h"~6 would find that
"a d h"~5 wouldn’t

3. NEAR operator

computing NEAR/6 "technology news"

The above proximity operator only works on sets of keywords. NEAR is more generic and can accept arbitrary subexpressions as its two arguments, matching the document when both subexpressions are found within N words of each other, no matter in which order.

The above query would:

match with a document containing computing is a popular topic in technology news, since the 1st word and the phrase are found within 6 words
but would not match with computing nowadays is a popular topic in* technology news since the gap is already 6 and exceeds the limit set in the NEAR operator.

2nd pass logic

Many of our customers use the second pass logic: this is when you first run a stricter query and then if nothing is found or not enough results are returned, the second less strict query is issued. There also may be more advanced logic with 3rd and 4th passes. It just depends on your requirements and whether or not you want your user to at least find something in any case. Or, on the contrary, you can let them find something that exactly matches his query.

Sometimes it makes sense to do it the other way around and make the query stricter. For example, if your default matching strategy is ‘any word should match’ and your application doesn’t have any extended syntax to allow users to specify the best query themselves, it makes sense to first try the ‘all words should match’ strategy or even the ‘phrase should match’. This might significantly increase the quality of the search.

In some applications it makes sense to parallelize the 1st/2nd pass queries and so on. This can be easily done using Manticore Search multiquery. The logic here is to do the 2nd pass query beforehand and if nothing is found by the 1st pass query the results will be ready and since the queries were done simultaneously this improves performance. However, this depends on a lot of things. You should be careful, as it can reduce performance sometimes. These things are:

What hardware are you using? If it’s not powerful enough to handle two queries at once or the response time is close to the response is double that of a single query, it makes little sense to use this technique.
What is your load? If your Manticore Search instance/server is already heavily loaded, you will get a worse response time.
What are your statistics? If for 99% of queries the 1st pass the query returns results, there’s a little point to make the 2nd pass query along with the 1st one, this will just be a waste of resources.