Suggestions on phrases using a single SUGGEST call

In this article we discuss how a single CALL QSUGGEST can be used to correct phrases in particular cases.

CALL QSUGGEST was introduced in the last version of Sphinx 2.x. The statement allows finding close matches of an input word from the dictionary of an index with infixing enabled. The most common use case for this feature is implementing a “Did you mean …?” functionality.

Before the introduction of QSUGGEST, to achieve a “Did you mean …?”  it was needed to extract the words from the index dictionary and put the trigrams of the words in a separate index. A search based on trigrams of the input word would then been performed against this index. As a follow up, to improve the quality of the return, Levenshtein distances were calculated on the returned matches.
CALL QSUGGEST eliminate this, as it doesn’t require a separate index and it also provides Levenshtein calculation, taking out not only the need for another index (which would also require a refresh from time to time), but also the extra code needed to implement the feature.

Although CALL QSUGGEST allows receiving more than one word at input, it will only look to give suggestions on the last word and ignore the rest.   The counterpart – CALL SUGGEST – use instead the first word and ignore the rest.  If we want to perform suggestions on multiple words, a first choice is to perform multiple QSUGGEST calls.  But, there might be cases where our input is a specific term or taxonomy (think product SKUs or tags) made by more than one word or a word broken into separate words (by user mistake).  A single SQUGGEST can be used in these cases.

As we said above, CALL QSUGGEST is based on trigrams. If we are trying to match terms made on more than one word we can have the existing whitespaces  as indexable characters, which would give us a single “word” that QSUGGEST can work on it. For example, we can replace whitespaces with underscore which is included in the default charset_table. To make this work, we would need to add in the  index an additional version of these terms where whitespaces (or other non-indexable characters) would be replaced with an indexable character. An alternative would be to just remove the whitespaces, but we might  end up with same token for different terms.

In this quick example we are using a dump of wikipedia articles where we indexed a separate version of the articles titles where the whitespaces are replaced with underscores.

In a first example, we’re going to misspell the name of one of Manticore Search main supporter –  Craigslist. The wrong terms are taken from actual searches that hit our website:

mysql> CALL QSUGGEST('craig_list','wikititles', 1 as non_char);
+--------------+----------+------+
| suggest      | distance | docs |
+--------------+----------+------+
| craigslist   | 1        | 1    |
| craig_first  | 2        | 1    |
| craig_zisk   | 2        | 1    |
| craig_ellis  | 3        | 1    |
| craig_ellis_ | 3        | 1    |
+--------------+----------+------+
5 rows in set (0.09 sec)

mysql> CALL QSUGGEST('crages_list','wikititles', 1 as non_char);
+--------------+----------+------+
| suggest      | distance | docs |
+--------------+----------+------+
| craigslist   | 3        | 1    |
| swadesh_list | 4        | 2    |
| danger_list  | 4        | 1    |
| dean_s_list  | 4        | 1    |
| lrus_list    | 4        | 1    |
+--------------+----------+------+
5 rows in set (0.13 sec)


mysql> CALL QSUGGEST('crag_list','wikititles', 1 as non_char);
+------------+----------+------+
| suggest | distance | docs |
+------------+----------+------+
| craigslist | 2 | 1 |
| craig_zisk | 3 | 1 |
| brad_listi | 3 | 1 |
| army_list | 3 | 1 |
| cratylism | 3 | 1 |
+------------+----------+------+
5 rows in set (0.03 sec)

A reverse situation is when the user doesn’t type the space between words, ending up with a single word instead of two.

mysql> CALL QSUGGEST('starwars','wikititles', 1 as non_char);
+------------+----------+------+
| suggest    | distance | docs |
+------------+----------+------+
| star_wars  | 1        | 108  |
| starways   | 1        | 1    |
| star_wars_ | 2        | 8    |
| stakkars   | 2        | 1    |
| stalwart   | 2        | 1    |
+------------+----------+------+
5 rows in set (0.01 sec)

We can also have both words wrong:

mysql> CALL QSUGGEST('abaham_lincol','wikititles', 1 as non_char);
+-----------------+----------+------+
| suggest | distance | docs |
+-----------------+----------+------+
| abraham_lincoln | 2 | 1 |
| abraham_sinkov | 4 | 1 |
+-----------------+----------+------+
2 rows in set (0.14 sec)

Because  QSUGGEST applies Levenshtein distances, the order of the words matters and checking terms in a reversed order will not bring expected results, for example:

mysql> CALL QSUGGEST('lincol_abaham','wikititles', 1 as non_char);
+---------------+----------+------+
| suggest | distance | docs |
+---------------+----------+------+
| lincoln_isham | 4 | 1 |
+---------------+----------+------+
1 row in set (0.14 sec)

For this case, a QSUGGEST call for each word should be used or if the terms can be extracted separate from the index, a trigram searching should be used.

 

Leave a Reply