Suggestions on phrases using a single SUGGEST call

In this article we discuss how a single CALL QSUGGEST can be used to correct phrases in particular cases.

CALL QSUGGEST was introduced in the last version of Sphinx 2.x. The statement allows finding close matches of an input word from the dictionary of an index with infixing enabled. The most common use case for this feature is implementing a “Did you mean …?” functionality.

Before the introduction of QSUGGEST, to achieve a “Did you mean …?”  it was needed to extract the words from the index dictionary and put the trigrams of the words in a separate index. A search based on trigrams of the input word would then been performed against this index. As a follow up, to improve the quality of the return, Levenshtein distances were calculated on the returned matches.
CALL QSUGGEST eliminate this, as it doesn’t require a separate index and it also provides Levenshtein calculation, taking out not only the need for another index (which would also require a refresh from time to time), but also the extra code needed to implement the feature.

Although CALL QSUGGEST allows receiving more than one word at input, it will only look to give suggestions on the last word and ignore the rest.   The counterpart – CALL SUGGEST – use instead the first word and ignore the rest.  If we want to perform suggestions on multiple words, a first choice is to perform multiple QSUGGEST calls.  But, there might be cases where our input is a specific term or taxonomy (think product SKUs or tags) made by more than one word or a word broken into separate words (by user mistake).  A single SQUGGEST can be used in these cases.

As we said above, CALL QSUGGEST is based on trigrams. If we are trying to match terms made on more than one word we can have the existing whitespaces  as indexable characters, which would give us a single “word” that QSUGGEST can work on it. For example, we can replace whitespaces with underscore which is included in the default charset_table. To make this work, we would need to add in the  index an additional version of these terms where whitespaces (or other non-indexable characters) would be replaced with an indexable character. An alternative would be to just remove the whitespaces, but we might  end up with same token for different terms.

In this quick example we are using a dump of wikipedia articles where we indexed a separate version of the articles titles where the whitespaces are replaced with underscores.

In a first example, we’re going to misspell the name of one of Manticore Search main supporter –  Craigslist. The wrong terms are taken from actual searches that hit our website:

mysql> CALL QSUGGEST('craig_list','wikititles', 1 as non_char);
+--------------+----------+------+
| suggest      | distance | docs |
+--------------+----------+------+
| craigslist   | 1        | 1    |
| craig_first  | 2        | 1    |
| craig_zisk   | 2        | 1    |
| craig_ellis  | 3        | 1    |
| craig_ellis_ | 3        | 1    |
+--------------+----------+------+
5 rows in set (0.09 sec)

mysql> CALL QSUGGEST('crages_list','wikititles', 1 as non_char);
+--------------+----------+------+
| suggest      | distance | docs |
+--------------+----------+------+
| craigslist   | 3        | 1    |
| swadesh_list | 4        | 2    |
| danger_list  | 4        | 1    |
| dean_s_list  | 4        | 1    |
| lrus_list    | 4        | 1    |
+--------------+----------+------+
5 rows in set (0.13 sec)


mysql> CALL QSUGGEST('crag_list','wikititles', 1 as non_char);
+------------+----------+------+
| suggest | distance | docs |
+------------+----------+------+
| craigslist | 2 | 1 |
| craig_zisk | 3 | 1 |
| brad_listi | 3 | 1 |
| army_list | 3 | 1 |
| cratylism | 3 | 1 |
+------------+----------+------+
5 rows in set (0.03 sec)

A reverse situation is when the user doesn’t type the space between words, ending up with a single word instead of two.

mysql> CALL QSUGGEST('starwars','wikititles', 1 as non_char);
+------------+----------+------+
| suggest    | distance | docs |
+------------+----------+------+
| star_wars  | 1        | 108  |
| starways   | 1        | 1    |
| star_wars_ | 2        | 8    |
| stakkars   | 2        | 1    |
| stalwart   | 2        | 1    |
+------------+----------+------+
5 rows in set (0.01 sec)

We can also have both words wrong:

mysql> CALL QSUGGEST('abaham_lincol','wikititles', 1 as non_char);
+-----------------+----------+------+
| suggest | distance | docs |
+-----------------+----------+------+
| abraham_lincoln | 2 | 1 |
| abraham_sinkov | 4 | 1 |
+-----------------+----------+------+
2 rows in set (0.14 sec)

Because  QSUGGEST applies Levenshtein distances, the order of the words matters and checking terms in a reversed order will not bring expected results, for example:

mysql> CALL QSUGGEST('lincol_abaham','wikititles', 1 as non_char);
+---------------+----------+------+
| suggest | distance | docs |
+---------------+----------+------+
| lincoln_isham | 4 | 1 |
+---------------+----------+------+
1 row in set (0.14 sec)

For this case, a QSUGGEST call for each word should be used or if the terms can be extracted separate from the index, a trigram searching should be used.

 

Leave a Reply

Training

Personal and team training will maximize them performance. 

Custom development

Need cone custom or individual features?

Fill the form and don’t forget to make the description of what you need.

Free config review

There are often optimizations that can be made to a Sphinx / Manticore setup by changing some simple directives in the configuration or making quick changes to an index definition.

Some common mistakes and issues can include:

  • doing main+delta without kill-lists, even if the delta does include updated records found in the main
  • using wildcarding with very short prefix/infix which can hammer performance in some cases
  • disabled (unintentional) seamless rotates and getting stalls on index rotations
  • adding texts as string attributes even if they are not using for any kind of operation (filtering, grouping, sorting) or mandatory to be present in results
  • using deprecated settings 

Having a quick look on the configuration can show issues or potential issues, this is why we want to offer a gift to our growing community!

When uploading your configuration file, we recommend to remove any database credentials first.

We suggest also you give as many possible details about your setup: how big is the data you have, how typical queries look and what issues you experience.

Contact us