blog-post

词形与例外

例外和词形是内置于 Manticore Search 的两个有用工具,您可以使用它们来提高搜索的召回率和精确度。它们有很多共同点,但也有重要的区别,我想在这篇文章中讨论一下。

关于分词

全文搜索(也称为自由文本搜索)与通配符搜索(例如:

  • 通常所知的 LIKE 操作符以这种或那种形式
  • 或更复杂的正则表达式

之间有什么区别?当然有很多区别,但一切都始于我们在每种方法中对初始输入文本的处理:

  • 在通配符搜索方法中,我们通常将文本视为一个整体
  • 而在全文搜索领域,首先 分词 文本是至关重要的,然后将每个词元视为一个单独的实体

当您想要对文本进行分词时,您需要决定如何进行,特别是:

  1. 什么应该是分隔符和词字符。通常,分隔符是一个不出现在单词中的字符,例如标点符号:., ,, ?, !, - 等等。
  2. 是否应该保留词元的字母大小写。通常不应该,因为如果您通过关键字 orange 找不到 Orange,这对搜索是不利的。

Manticore 自动完成这一切。例如,文本 "我有什么?列表是:一只猫,一只狗和一只鹦鹉。" 被分词为:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('What do I have? The list is: a cat, a dog and a parrot.', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | do        | do         |
| 3    | i         | i          |
| 4    | have      | have       |
| 5    | the       | the        |
| 6    | list      | list       |
| 7    | is        | is         |
| 8    | a         | a          |
| 9    | cat       | cat        |
| 10   | a         | a          |
| 11   | dog       | dog        |
| 12   | and       | and        |
| 13   | a         | a          |
| 14   | parrot    | parrot     |
+------+-----------+------------+
14 rows in set (0.00 sec)

如您所见:

  • 标点符号被移除
  • 所有单词都被转换为小写

问题

这里出现了第一个问题:在某些情况下,分隔符被视为常规词字符,例如在 "C++ 是最强大的语言吗?" 中,显然 C++ 是一个单独的词。这对人类来说很容易理解,但对全文搜索算法来说却不是,因为它看到加号,未在其词字符列表中找到它并将其从词元中移除,因此您最终得到:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c         | c          |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)

好吧,但问题是什么?

问题是,在这种分词之后,如果您搜索 C#,例如,您将找到上述句子:

mysql> drop table if exists t;
mysql> create table t(f text);
mysql> insert into t values(0,'Is c++ the most powerful language?');
mysql> select highlight() from t where match('c#');

+-------------------------------------------+
| highlight()                               |
+-------------------------------------------+
| Is <b>c</b>++ the most powerful language? |
+-------------------------------------------+
1 row in set (0.01 sec)

发生这种情况是因为 C# 也被分词为 C,然后搜索查询中的 C 与文档中的 C 匹配,因此您得到了它。

解决方案是什么?有几个选项。第一个可能想到的是:

好吧,为什么我不把 + 和 # 放入词字符列表中呢?

这是一个好问题。我们试试。

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('Is c++ the most powerful language?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
+------+-----------+------------+
6 rows in set (0.00 sec)

它有效,但列表中的 + 立即开始影响其他单词和搜索,例如:

mysql> drop table if exists t;
mysql> create table t(f text) charset_table='non_cjk,+,#';
mysql> call keywords('What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | what      | what       |
| 2    | is        | is         |
| 3    | 2+2       | 2+2        |
+------+-----------+------------+
3 rows in set (0.00 sec)

您希望 C++ 是一个单独的词,但不是 2+2,对吗?

对,那我们该怎么办?

要以特殊方式处理 C++,您可以将其设为例外。

例外

因此,例外(也称为同义词)允许将一个或多个术语(包括通常会被排除的字符的术语)映射到一个单一的关键字。

让我们通过将 C++ 放入例外文件来使其成为例外:

➜  ~ cat /tmp/exceptions
c++ => c++

并在创建表时使用该文件:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is c++ the most powerful language? What is 2+2?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | what      | what       |
| 8    | is        | is         |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.01 sec)

太好了,C++ 现在是一个单独的词,加号没有丢失,2+2 也没问题。

您需要记住关于例外的事情是,例外非常愚蠢,根本不聪明,它们完全按照您要求的方式执行操作,别无其他。特别是:

  • 它们不改变大小写
  • 如果您犯了错误并放入双空格,它们不会将其转换为单个空格

等等。它们字面上将您的输入视为字节数组。

例如,人们以大写和小写两种形式写 C++。我们试试上述例外的大写形式?

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c         | c          |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | what      | what       |
| 8    | is        | is         |
| 9    | 2         | 2          |
| 10   | 2         | 2          |
+------+-----------+------------+
10 rows in set (0.00 sec)

哎呀,C++ 被分词为 c,因为例外是 c++(小写),而不是 C++(大写)。

但您注意到例外构成了一对项目,而不是单个项目:c++ => c++。左侧是触发文本中例外算法的部分,右侧是结果词元。我们试着将 C++ 映射到 c++

  ~ cat /tmp/exceptions
c++ => c++
C++ => c++

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> call keywords('Is C++ the most powerful language? How about c++?', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | is        | is         |
| 2    | c++       | c++        |
| 3    | the       | the        |
| 4    | most      | most       |
| 5    | powerful  | powerful   |
| 6    | language  | language   |
| 7    | how       | how        |
| 8    | about     | about      |
| 9    | c++       | c++        |
+------+-----------+------------+
9 rows in set (0.00 sec)

好吧,现在又没问题了,因为 C++c++ 都被分词为词元 c++。真令人满意。

例外的其他好例子有:

  • AT&T => AT&Tat&t => AT&T
  • M&M's => M&M'sm&m's => M&M's 以及 M&m's => M&M's
  • U.S.A. => USAUS => USA

坏例子是什么?

  • us => USA,因为我们不希望每个 us 都变成 USA

因此,关于例外的经验法则是:

如果一个术语包含特殊字符,并且这就是它在文本和搜索查询中通常的写法 - 将其设为例外。

同义词

Manticore Search 用户也常常称 例外 为同义词,因为它们的另一个用例不仅是保留特殊字符和字母大小写,而是将完全不同的书写术语映射到同一个词元,例如:

MS Windows => ms windows
Microsoft Windows => ms windows

这为什么重要?因为它使得通过 MS Windows 和反之亦然轻松找到包含 Microsoft Windows 的文档。

示例:

mysql> drop table if exists t;
mysql> create table t(f text) exceptions='/tmp/exceptions';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems');
mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976139 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

乍一看,它工作得很好,但进一步思考并回想起例外是大小写和字节敏感的,您可能会问自己:“人们不能写 MicroSoft windowsMS WINDOWSmicrosoft Windows 等等吗?”

是的,他们可以。因此,如果您想为此使用例外,请准备好应对数学中所称的组合爆炸。

这看起来一点也不好。我们该怎么办?

词形

另一个与例外类似的工具是 词形。与例外不同,词形是在对传入文本进行分词后应用的。因此它们是:

  • 不区分大小写(除非您的 charset_table 启用大小写敏感)
  • 不关心特殊字符

它们本质上让您用另一个单词替换一个单词。通常,这将用于将不同的词形归一化为一个单一的标准形式。例如,将所有变体如 "walks"、"walked"、"walking" 归一化为标准形式 "walk":

➜  ~ cat /tmp/wordforms
walks => walk
walked => walk
walking => walk
mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('walks _WaLkeD! walking', 't');

+------+-----------+------------+
| qpos | tokenized | normalized |
+------+-----------+------------+
| 1    | walks     | walk       |
| 2    | walked    | walk       |
| 3    | walking   | walk       |
+------+-----------+------------+
3 rows in set (0.00 sec)

As you can see all the 3 words were converted to just walk and, note, the 2nd word _WaLkeD! even being very deformed was also normalized fine. Do you see where I'm going with this? Yes, the MS Windows example. Let's test if the wordforms can be useful to solve that issue.

Let's put just 2 lines to the wordforms file:

➜  ~ cat /tmp/wordforms
ms windows => ms windows
microsoft windows => ms windows

and populate the table with a few documents:

mysql> drop table if exists t;
mysql> create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0, 'Microsoft Windows is one of the first operating systems'), (0, 'porch windows'),(0, 'Windows are rolled down');

mysql> select * from t;
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)

Let's now try various queries:

mysql> select * from t where match('MS Windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

MS Windows finds Microsoft Windows fine.

mysql> select * from t where match('ms WINDOWS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.01 sec)

ms WINDOWS works fine too.

mysql> select * from t where match('mIcRoSoFt WiNdOwS');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
+---------------------+---------------------------------------------------------+
1 row in set (0.00 sec)

✅ And even mIcRoSoFt WiNdOwS finds the same document.

mysql> select * from t where match('windows');
+---------------------+---------------------------------------------------------+
| id                  | f                                                       |
+---------------------+---------------------------------------------------------+
| 1514841286668976166 | Microsoft Windows is one of the first operating systems |
| 1514841286668976167 | porch windows                                           |
| 1514841286668976168 | Windows are rolled down                                 |
+---------------------+---------------------------------------------------------+
3 rows in set (0.00 sec)

✅ Just basic windows finds all the documents.

So indeed, wordforms helps to solve the issue.

The rule of thumb with the wordforms is:

Use wordforms for words and phrases that can be written in different forms and don't contain special characters.

Floor & Decor

Let's take a look at another example: we want to improve search for the brand name Floor & Decor. We can assume people can write this name in the following forms:

Floor & Decor
Floor & decor
floor & decor
Floor and Decor
floor and decor

and other letter capitalization combinations.

Also:

Floor & Decor Holdings
Floor & Decor Holdings, inc.

and, again, various combinations with different letter capitalized.

Now that we know how exceptions and wordforms work what do we do to cover this brand name?

First of all we can easily notice that the canonical brand name is Floor & Decor, i.e. it includes a special character which is normally considered a word separator, so should we use exceptions? But the name is long and can be written in many ways. If we use exceptions we can end up with a huge list of all the combinations. Moreover there are extended forms Floor & Decor Holdings and Floor & Decor Holdings, inc. which can make the list even longer.

The most optimal solution in this case is to just use wordforms like this:

➜  ~ cat /tmp/wordforms
floor & decor => fnd
floor and decor => fnd
floor & decor holdings => fnd
floor and decor holdings => fnd
floor & decor holdings inc => fnd
floor and decor holdings inc => fnd

Why does it include &? Actually you can skip it:

floor decor => fnd
floor and decor => fnd
floor decor holdings => fnd
floor and decor holdings => fnd
floor decor holdings inc => fnd
floor and decor holdings inc => fnd

because wordforms anyway ignores non-word characters, but just for the sake of ease of reading it was left.

As a result you'll get each combination tokenized as fnd which will be our shortkey for this brand name.

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> call keywords('Floor & Decor', 't')
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('floor and Decor', 't')
+------+-----------------+------------+
| qpos | tokenized       | normalized |
+------+-----------------+------------+
| 1    | floor and decor | fnd        |
+------+-----------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor holdings', 't')
+------+----------------------+------------+
| qpos | tokenized            | normalized |
+------+----------------------+------------+
| 1    | floor decor holdings | fnd        |
+------+----------------------+------------+
1 row in set (0.00 sec)

mysql> call keywords('Floor & Decor HOLDINGS INC.', 't')
+------+--------------------------+------------+
| qpos | tokenized                | normalized |
+------+--------------------------+------------+
| 1    | floor decor holdings inc | fnd        |
+------+--------------------------+------------+
1 row in set (0.00 sec)

Is this the perfect ultimate solution? Unfortunately not as many other things in the area of full-text search. There are always rare cases and in this case too. For example:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');

+---------------------+---------------------------------------------------+
| id                  | f                                                 |
+---------------------+---------------------------------------------------+
| 1514841286668976231 | It's located on the 2nd floor. Decor is also nice |
+---------------------+---------------------------------------------------+
1 row in set (0.00 sec)

We can see here that Floor & Decor Holdings finds the document which has floor in the end of the first sentence and the following one starts with Decor. This happens because floor. Decor also gets tokenized to fnd since we use just wordforms that are insensitive to letter case and special characters:

mysql> call keywords('floor. Decor', 't');
+------+-------------+------------+
| qpos | tokenized   | normalized |
+------+-------------+------------+
| 1    | floor decor | fnd        |
+------+-------------+------------+
1 row in set (0.00 sec)

The false match is not good. To solve this particular problem we can use Manticore's functionality to detect sentences and paragraphs .

Now if we enable it we can see that the document is not a match for the keyword any more:

mysql> drop table if exists t; create table t(f text) wordforms='/tmp/wordforms' index_sp='1';
mysql> insert into t values(0,'It\'s located on the 2nd floor. Decor is also nice');
mysql> select * from t where match('Floor & Decor Holdings');

Empty set (0.00 sec)

because:

  1. Floor & Decor, as we remember is converted into fnd by wordforms
  2. index_sp='1' splits text into sentences
  3. after splitting floor. and Decor end up in different sentences
  4. and do not match fnd and therefore all the original forms of it anymore

Conclusion

Manticore's exceptions and wordforms are powerful tools that can help you fine-tune your search, in particular improve recall and precision when it comes to short terms with special characters that should be retained and longer terms that should be aliased one to another. But you need to help Manticore do it, since it can't decide what the names should be for you.

Thank you for reading this article!

References:

安装Manticore Search

安装Manticore Search