blog-post

词形变化与例外

例外和词形变化是 Manticore Search 内置的两个有用工具,可用于提高搜索召回率和精确度。它们有很多共同点,但也有一些重要的区别,这是我想在本文中阐述的。

关于分词

全文搜索(也称为自由文本搜索)和通配符搜索(如下所示)有什么区别:

  • 常见的 LIKE 运算符的各种形式
  • 或更复杂的正则表达式

?当然有很多不同,但一切都始于我们处理初始输入文本的方式:

  • 在通配符搜索方法中,我们通常将文本视为整体
  • 而在全文搜索领域,首先需要分词,然后将每个词作为独立实体考虑

当你想要对文本进行分词时,需要决定如何进行,特别是:

  1. 什么应该作为分隔符和词字符。通常分隔符是不出现在词内的字符,例如标点符号:.,?!- 等。
  2. 是否保留词的字母大小写。通常不保留,因为搜索时用 orange 无法找到 Orange 是不好的。

Manticore 会自动完成这些。例如,文本 “What do I have? The list is: a cat, a dog and a parrot.” 被分词为:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | do        | do         |
| 3    | i         | i          |
| 4    | have      | have       |
| 5    | the       | the        |
| 6    | list      | list       |
| 7    | is        | is         |
| 8    | a         | a          |
| 9    | cat       | cat        |
| 10   | a         | a          |
| 11   | dog       | dog        |
| 12   | and       | and        |
| 13   | a         | a          |
| 14   | parrot    | parrot     |
+------+-----------+------------+
14 rows in set (0.00 sec)

可以看到:

  • 标点符号被删除
  • 所有单词都被转换为小写

问题

第一个问题来了:在某些情况下,分隔符被视为普通词字符,例如在 “Is c++ the most powerful language?” 中,很明显 c++ 是一个独立的词。对人来说很容易理解,但对全文搜索算法来说却不是,因为它看到加号,没有在其词字符列表中找到,所以从词中移除了加号,最终得到:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c         | c          |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)

好,但问题是什么?

问题是在这种分词后,如果搜索 c#,你会找到上面的句子:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> insert into t values(0,'Is c++ the most powerful language?');
mysql> select highlight() from t where match('c#');

+-------------------------------------------+
| highlight()                               |
+-------------------------------------------+
| Is <b>c</b>++ the most powerful language? |
+-------------------------------------------+
1 row in set (0.01 sec)

这是因为 c# 也被分词为 c,然后搜索查询中的 c 匹配了文档中的 c,所以你得到了这个结果。

解决方案是什么?有几个选项。第一个可能浮现在脑海中的是:

那为什么不把 + 和 # 加到词字符列表中?

这是个好问题。让我们试试。

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)

它有效,但将 + 放入列表立即开始影响其他词和搜索,例如:

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | is        | is         |
| 3    | 2+2       | 2+2        |
+------+-----------+------------+
3 rows in set (0.00 sec)

你想让 c++ 成为一个独立词,但不想 2+2 也成为一个词,对吧?

对,那我们能做什么?

要以特殊方式处理 c++,你可以将其设为例外。

例外

所以,例外(也称为同义词)允许将一个或多个词(包括包含通常会被排除的字符的词)映射到单个关键词。

让我们通过将其放入例外文件来使 c++ 成为一个例外:

➜  ~ cat /tmp/exceptions
c++ => c++

并在创建表时使用该文件:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    |         |          |
| 2    | c++       | c++        |
| 3    | 这个      | 这个       |
| 4    |        |          |
| 5    | 强大     | 强大       |
| 6    | 的语言   | 的语言     |
| 7    | 什么      | 什么       |
| 8    |         |          |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.01 sec)

太好了,c++ 现在是一个单独的词,且加号没有丢失,而 2+2 也没问题。

你需要记住的关于例外的事情是,例外是非常愚蠢的,根本不聪明,它们完全按照你的要求去做,毫不多做。尤其是:

  • 它们不改变大小写
  • 如果你犯了错误,放入了双空格,它们不会将其转换为单个空格

等等。它们字面上将你的输入视为字节数组。

例如,人们会同时用小写和大写来写 c++。我们来尝试一下上面的例外,使用大写?

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    |         |          |
| 2    | c         | c          |
| 3    | 这个      | 这个       |
| 4    |         |          |
| 5    | 强大      | 强大       |
| 6    | 的语言    | 的语言     |
| 7    | 什么      | 什么       |
| 8    |         |          |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.00 sec)

哎呀,C++ 被标记为只有 c,因为例外是 c++(小写),而不是 C++(大写)。

但你注意到例外构成了一对项目,而不是单个项目:c++ => c++。左侧是触发文本中例外算法的部分,右侧是结果标记。我们来尝试添加 C++c++ 的映射?

  ~ cat /tmp/exceptions
c++ => c++
C++ => c++

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    |         |          |
| 2    | c++       | c++        |
| 3    | 这个      | 这个       |
| 4    |         |          |
| 5    | 强大      | 强大       |
| 6    | 的语言    | 的语言     |
| 7    | 如何      | 如何       |
| 8    | 关于      | 关于       |
| 9    | c++       | c++        |
+------+-----------+------------+
9 rows in set (0.00 sec)

好吧,现在又正常了,因为 C++c++ 都被标记为 c++。如此令人满意。

例外的其他好例子是什么:

  • AT&T => AT&Tat&t => AT&T
  • M&M's => M&M'sm&m's => M&M'sM&m's => M&M's
  • U.S.A. => USAUS => USA

有那些不好的例子?

  • us => USA,因为我们不希望每个 us 都变成 USA

因此,关于例外的经验法则是:

如果一个术语包含特殊字符,并且它在文本和搜索查询中就是这样书写的 - 将其作为例外。

同义词

Manticore Search 用户通常也称 exceptions 为同义词,因为它们的另一个用途不仅仅是保留特殊字符和字母大小写,而是将绝对不同书写的术语映射为相同的标记,例如:

MS Windows => ms windows
Microsoft Windows => ms windows

这为什么重要?因为它使通过 MS Windows 轻松找到包含 Microsoft Windows 的文档,反之亦然。

例子:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems');
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976139 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

乍一看,它工作得很好,但进一步思考并回忆起例外对大小写和字节敏感,你可能会问自己:“人们不能写 MicroSoft windowsMS WINDOWSmicrosoft Windows 等吗?”

是的,他们可以。所以如果你想用例外来实现这一点,要准备迎接在数学中被称为组合爆炸的现象。

看起来一点也不好,我们该怎么办呢?

单词形式

与例外类似的另一个工具是 wordforms。与例外不同,单词形式是在对传入文本进行标记后应用的。因此,它们是:

  • 不区分大小写(除非你的 charset_table 启用大小写敏感)
  • 不关注特殊字符

它们基本上让你用另一个词替换一个词。通常,这将用于将不同的单词形式归一化为一个单一的正常形式。例如,将所有如“walks”、“walked”、“walking”的变体标准化为正常形式“walk”:

➜  ~ cat /tmp/wordforms
walks => walk
walked => walk
walking => walk
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('walks _WaLkeD! walking', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | walks     | walk       |
| 2    | walked    | walk       |
| 3    | walking   | walk       |
+------+-----------+------------+
3 rows in set (0.00 sec)

As you can see all the 3 words were converted to just walk and, note, the 2nd word _WaLkeD! even being very deformed was also normalized fine. Do you see where I’m going with this? Yes, the MS Windows example. Let’s test if the wordforms can be useful to solve that issue.

Let’s put just 2 lines to the wordforms file:

➜  ~ cat /tmp/wordforms
ms windows => ms windows
microsoft windows => ms windows

and populate the table with a few documents:

mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down');

mysql> select * from t;
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)

Let’s now try various queries:

mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

MS Windows finds Microsoft Windows fine.

mysql> select * from t where match('ms WINDOWS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.01 sec)

ms WINDOWS works fine too.

mysql> select * from t where match('mIcRoSoFt WiNdOwS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

✅ And even mIcRoSoFt WiNdOwS finds the same document.

mysql> select * from t where match('windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)

✅ Just basic windows finds all the documents.

So indeed, wordforms helps to solve the issue.

The rule of thumb with the wordforms is:

Use wordforms for words and phrases that can be written in different forms and don’t contain special characters.

Floor & Decor

Let’s take a look at another example: we want to improve search for the brand name Floor & Decor. We can assume people can write this name in the following forms:

Floor & Decor
Floor & decor
floor & decor
Floor and Decor
floor and decor

and other letter capitalization combinations.

Also:

Floor & Decor Holdings
Floor & Decor Holdings, inc.

and, again, various combinations with different letter capitalized.

Now that we know how exceptions and wordforms work what do we do to cover this brand name?

First of all we can easily notice that the canonical brand name is Floor & Decor, i.e. it includes a special character which is normally considered a word separator, so should we use exceptions? But the name is long and can be written in many ways. If we use exceptions we can end up with a huge list of all the combinations. Moreover there are extended forms Floor & Decor Holdings and Floor & Decor Holdings, inc. which can make the list even longer.

The most optimal solution in this case is to just use wordforms like this:

➜  ~ cat /tmp/wordforms
floor & decor => fnd
floor and decor => fnd
floor & decor holdings => fnd
floor and decor holdings => fnd
floor & decor holdings inc => fnd
floor and decor holdings inc => fnd

Why does it include &? Actually you can skip it:

floor decor => fnd
floor and decor => fnd
floor decor holdings => fnd
floor and decor holdings => fnd
floor decor holdings inc => fnd
floor and decor holdings inc => fnd

because wordforms anyway ignores non-word characters, but just for the sake of ease of reading it was left.

As a result you’ll get each combination tokenized as fnd which will be our shortkey for this brand name.

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('Floor & Decor', 't')
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('floor and Decor', 't')
+------+-----------------+------------+
| qpos | tokenized       | normalized |
+------+-----------------+------------+
| 1    | floor and decor | fnd        |
+------+-----------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor holdings', 't')
+------+----------------------+------------+
| qpos | tokenized            | normalized |
+------+----------------------+------------+
| 1    | floor decor holdings | fnd        |
+------+----------------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor HOLDINGS INC.', 't')
+------+--------------------------+------------+
| qpos | tokenized                | normalized |
+------+--------------------------+------------+
| 1    | floor decor holdings inc | fnd        |
+------+--------------------------+------------+
1 row in set (0.00 sec)

这是完美的最终解决方案吗?可惜不是,正如许多其他的全文搜索领域一样。总是存在一些罕见的情况,这个案例也是如此。例如:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0,'它位于二楼。装饰也不错');
mysql> select * from t where match('Floor & Decor Holdings');

+---------------------+---------------------------------------------------+
| id                  | f                                                 |
+---------------------+---------------------------------------------------+
| 1514841286668976231 | 它位于二楼。装饰也不错                             |
+---------------------+---------------------------------------------------+
1 row in set (0.00 sec)

我们可以看到,Floor & Decor Holdings 找到了包含 floor 的文档,而第一句话的末尾是 floor,后面的行以 Decor 开头。这是因为 floor. Decor 也被标记为 fnd,因为我们用的 wordforms 对字母大小写和特殊字符不敏感:

mysql> call keywords('floor. Decor', 't');
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)

错误匹配并不好。为了解决这个特定问题,我们可以利用 Manticore 的功能来 检测句子和段落

现在如果我们启用它,我们可以看到文档不再与关键字匹配:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1';
mysql> insert into t values(0,'它位于二楼。装饰也不错');
mysql> select * from t where match('Floor & Decor Holdings');

Empty set (0.00 sec)

因为:

  1. Floor & Decor,如我们所记得的,已被 wordforms 转换为 fnd
  2. index_sp='1' 将文本拆分为句子
  3. 拆分后 floor.Decor 处于不同的句子中
  4. 因此不再匹配 fnd,因此所有它的原始形式也不再匹配

结论

Manticore 的异常和词形是强大的工具,可以帮助您微调搜索,特别是在涉及特殊字符的短语以及应该相互别名的较长术语时提高召回率和精确度。但是您需要帮助 Manticore,因为它无法为您决定名称应是什么。

感谢您阅读本文!

参考文献:

安装Manticore Search

安装Manticore Search