全文操作符和基本搜索介绍

已发布: Sep 16, 2020
自动翻译: Introduction to full-text operators and basic search

在本教程中，我们将探索 Manticore Search 中可用的全文搜索操作符。

全文操作符和基本搜索介绍

Manticore Search 中的所有搜索操作基于标准布尔操作符（AND、OR、NOT），可以组合使用，并以自由顺序组合或排除关键字，以获取更相关的结果。

默认且最简单的全文操作符是 AND，当您仅列举几个单词进行搜索时，默认假设使用该操作符。

Full-Text-Operators-in-Manticore_AND AND 是一个默认操作符，使用它时 fast slow 查询将返回同时包含 ‘fast’ 和 ‘slow’ 两个词的文档。如果一个词在文档中而另一个词不在，文档将不会包含在结果列表中。
默认情况下，词汇将在所有可用的全文字段中搜索。

SELECT * FROM testrt WHERE MATCH('fast slow');

Full-Text-Operators-in-Manticore_OR OR 用于匹配任何一个词（或两个）。词用竖线分隔，例如 fast | slow。它将找到包含 fast 或 slow 的文档。

SELECT * FROM testrt WHERE MATCH('fast | slow');

OR 操作符的优先级高于 AND，因此查询 ‘find me fast|slow’ 可以解释为 ‘find me (fast|slow)’：

SELECT * FROM testrt WHERE MATCH('find me fast | slow');

NOT NOT 确保标记为 - 或 ! 的词不在结果中。任何包含该词的文档将被排除。例如 fast !slow 将找到包含 fast 但不包含 slow 的文档。使用时要小心，因为这可能变得太具体，从而排除良好的文档。

SELECT * FROM testrt WHERE MATCH('find !slow');

SELECT * FROM testrt WHERE MATCH('find -slow');

MAYBE MAYBE 是一个特殊操作符，像 OR 一样工作，但要求左侧词始终出现在结果中，而右侧词是可选的。但是，当两个词都满足时，文档将获得更高的搜索排名。例如 fast MAYBE slow 将找到包含 fast 或 slow 的文档，但同时包含两个词的文档将获得更高的评分。

SELECT * FROM testrt WHERE MATCH('find MAYBE slow');

用例示例

让我们使用 mysql 客户端连接到 Manticore：

# mysql -P9306 -h0

对于布尔搜索，可以使用 OR 操作符 |：

MySQL [(none)]> select * from testrt where match('find | me fast');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    6 |    1 | find me fast now       |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
+------+------+------------------------+----------------+
4 rows in set (0.00 sec)

OR 操作符的优先级高于 AND，查询 find me fast|slow 被解释为 find me (fast|slow)：

MySQL [(none)]> SELECT * FROM testrt WHERE MATCH('find me fast|slow');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    6 |    1 | find me fast now       |  quick         |
|    3 |    1 | find me slow           |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
+------+------+------------------------+----------------+
5 rows in set (0.00 sec)

对于否定，可以将操作符 NOT 指定为 - 或 !：

MySQL [(none)]> select * from testrt where match('find me -fast');
+------+------+--------------+---------+
| id   | gid  | title        | content |
+------+------+--------------+---------+
|    3 |    1 | find me slow |  quick  |
+------+------+--------------+---------+
1 row in set (0.00 sec)

必须注意，默认情况下，Manticore 不支持完整的否定查询，因此不可能仅运行 -fast（自 v3.5.2 起将支持该功能）。

另一个基本操作符是 MAYBE。由 MAYBE 定义的词可以在文档中出现或不出现。如果存在，它会影响排名，并且包含该词的文档将被排名更高。

MySQL [(none)]> select * from testrt where match('find me MAYBE slow');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    3 |    1 | find me slow           |  quick         |
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
|    6 |    1 | find me fast now       |  quick         |
+------+------+------------------------+----------------+
5 rows in set (0.00 sec)

字段操作符

如果我们想限制搜索到特定字段，可以使用操作符 ‘@’：

mysql> select * from testrt where match('@title find me fast');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | find me fast           |  quick  |
|    6 |    1 | find me fast now       |  quick  |
|    5 |    1 | 快速找到我                      |  快速   |
+------+------+------------------------+---------+
3 rows in set (0.00 sec)

我们也可以指定多个字段来限制搜索：

mysql> select * from testrt where match('@(title,content) 快速找到我');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
| 1    |    1 | 找到我                | 快速和快速       |
| 2    |    1 | 快速找到我           | 快速          |
| 6    |    1 | 现在快速找到我       | 快速          |
| 5    |    1 | 快速找到我和快速     | 快速          |
+------+------+------------------------+----------------+
4 rows in set (0.00 sec)

字段操作符也可以用来限制搜索仅在前 x 个单词中进行。例如：

mysql> select * from testrt where match('@title 懒狗');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 那只快速的棕色狐狸跳过了懒狗                                        |  五个拳击巫师迅速跳跃               |
|    7 |    1 | 那只快速的棕色狐狸退后一步，跳过了懒狗                             |  五个拳击巫师迅速跳跃               |
|    8 |    1 | 棕色而美丽的狐狸退后一步，跳过了懒狗                               |  五个拳击巫师迅速跳跃               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
4 rows in set (0.00 sec)

但是如果我们只在前 5 个单词中搜索，我们将什么也得不到：

mysql> select * from testrt where match('@title[5] 懒狗');
空集 (0.00 sec)

在某些情况下，搜索可以在多个索引上执行，这些索引可能没有相同的全文字段。
默认情况下，指定一个在索引中不存在的字段将导致查询错误。为了克服这个问题，可以使用特殊操作符 @@relaxed：

mysql> select * from testrt where match('@(title,keywords) 懒狗');<br></br>错误 1064 (42000): index testrt: 查询错误: schema 中未找到字段 'keywords'

mysql> select * from testrt where match('@@relaxed @(title,keywords) 懒狗');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 那只快速的棕色狐狸跳过了懒狗                                |  五个拳击巫师迅速跳跃               |
|    7 |    1 | 那只快速的棕色狐狸退后一步，跳过了懒狗                             |  五个拳击巫师迅速跳跃               |
|    8 |    1 | 棕色而美丽的狐狸退后一步，跳过了懒狗                               |  五个拳击巫师迅速跳跃               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set, 1 warning (0.01 sec)

模糊搜索

模糊匹配允许仅匹配查询字符串中的部分单词，例如：

mysql> select * from testrt where match('"狐狸 鸟 懒狗"/3');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 那只快速的棕色狐狸跳过了懒狗                                |  五个拳击巫师迅速跳跃               |
|    7 |    1 | 那只快速的棕色狐狸退后一步，跳过了懒狗                             |  五个拳击巫师迅速跳跃               |
|    8 |    1 | 棕色而美丽的狐狸退后一步，跳过了懒狗                               |  五个拳击巫师迅速跳跃               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

在这种情况下，我们使用 QUORUM 操作符并指定仅匹配 3 个单词也是可以的。使用 /1 的搜索相当于 OR 布尔搜索，而使用 /N 的搜索，其中 N 是输入单词的数量，则相当于 AND 搜索。

除了绝对数字外，您还可以指定一个介于 0.0 和 1.0 之间的数字（表示 0% 和 100%），Manticore 将仅匹配至少具有指定百分比给定单词的文档。上面的同样例子也可以写成 "狐狸鸟懒狗"/0.3，它将匹配至少有 30% 四个单词的文档。

mysql> select * from testrt where match('"狐狸 鸟 懒狗"/0.3');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 那只快速的棕色狐狸跳过了懒狗                                |  五个拳击巫师迅速跳跃               |
|    7 |    1 | 那只快速的棕色狐狸退后一步，跳过了懒狗                             |  五个拳击巫师迅速跳跃               |
|    8 |    1 | 棕色且美丽的狐狸退后一步，跳过懒狗       | 箱子里的五个巫师迅速跳动                    |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

高级运算符

除了简单的运算符，还有许多不常用的高级运算符，在某些情况下可能是绝对必要的。

最常用的高级运算符之一是短语运算符。
短语运算符只有在给定的单词以逐字指定的顺序出现时才会匹配。这也会限制单词在同一字段中被找到：

mysql> SELECT * FROM testrt WHERE MATCH('"quick brown fox"');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title            | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 迅速的棕色狐狸跳过懒狗         |  箱子里的五个巫师迅速跳动                   |
|    7 |    1 | 棕色狐狸退后一步并跳过懒狗      |  箱子里的五个巫师迅速跳动                   |
+------+------+-------------------------------------------------------------------+---------------------------------------+
2 rows in set (0.00 sec)

短语运算符的更放松的版本是严格顺序运算符。
顺序运算符要求单词以指定的确切顺序出现，但接受其他单词在之间：

mysql> SELECT * FROM testrt WHERE MATCH('find << me << fast');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | 找到我快                  |  快速     |
|    6 |    1 | 现在找到我快               |  快速     |
|    5 |    1 | 快速并找到我              |  快速     |
+------+------+------------------------+---------+
3 rows in set (0.00 sec)

另一个与单词位置有关的运算符对是开始/结束字段运算符。
这些运算符将限制单词出现在字段的开始或结束。

mysql> SELECT * FROM testrt WHERE MATCH('^find me fast$');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | 找到我快                  |  快速     |
|    5 |    1 | 快速并找到我              |  快速     |
+------+------+------------------------+---------+
2 rows in set (0.00 sec)

接近运算符类似于AND运算符，但增加了单词之间的最大距离，这样它们仍然可以视为匹配。让我们用仅使用AND运算符的这个示例：

mysql> SELECT * FROM testrt WHERE MATCH('brown fox jumps');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 迅速的棕色狐狸跳过懒狗                             |  箱子里的五个巫师迅速跳动                    |
|    7 |    1 | 快速的棕色狐狸退后一步并跳过懒狗                        |  箱子里的五个巫师迅速跳动                    |
|    8 |    1 | 棕色且美丽的狐狸退后一步并跳过懒狗                      |  箱子里的五个巫师迅速跳动                    |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

我们的查询返回2个结果：一个是所有单词彼此接近，第二个是其中一个单词的距离更远。
如果我们想要匹配单词在某个距离之内，我们可以通过接近运算符来限制：

mysql> SELECT * FROM testrt WHERE MATCH('"brown fox jumps"~5');
+------+------+---------------------------------------------+---------------------------------------+
| id   | gid  | title                                       | content                               |
+------+------+---------------------------------------------+---------------------------------------+
|    4 |    1 | 迅速的棕色狐狸跳过懒狗                        |  箱子里的五个巫师迅速跳动                    |
+------+------+---------------------------------------------+---------------------------------------+
1 row in set (0.00 sec)

接近运算符的更普遍的版本是NEAR运算符。在接近的情况下，指定了一组单词的单一距离，而NEAR运算符可以处理2个操作数，这些操作数可以是单个单词或表达式。

在以下示例中，“棕色”和“狐狸”之间的距离不得超过2，而“狐狸”和“跳跃”之间的距离不得超过6：

mysql> SELECT * FROM testrt WHERE MATCH('brown NEAR/2 fox NEAR/6 jumps');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                             | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | 迅速的棕色狐狸跳过懒狗                        |  箱子里的五个巫师迅速跳动                   |
|    7 |    1 | 快速的棕色狐狸退后一步并跳过懒狗                      |  箱子里的五个巫师迅速跳动                   |
+------+------+-------------------------------------------------------------------+---------------------------------------+
2 rows in set (0.00 sec)

The query leaves out one document which doesn’t match the first NEAR condition (the last one here):

mysql> SELECT * FROM testrt WHERE MATCH('brown NEAR/3 fox NEAR/6 jumps');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.09 sec)

A variation of the NEAR operator is NOTNEAR, which matches only if the operands have a minimum distance between them.

mysql> SELECT * FROM testrt WHERE MATCH('"brown fox" NOTNEAR/5 jumps');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                             | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+-------------------------------------------------------------------+---------------------------------------+
1 row in set (0.00 sec)

Manticore can also detect sentences in plain texts and paragraphs in HTML content.
For indexing sentences the index_sp option needs to be enabled, while paragraphs also require html_strip =1. Let’s take the following example:

mysql> select * from testrt where match('"the brown fox" jumps')G
*************************** 1. row ***************************
id: 15
gid: 2
title: The brown fox takes a step back. Then it jumps over the lazydog
content:
1 row in set (0.00 sec)

The document includes 2 sentences, while the phrase is found in the first one ‘jumps’ is only in the second sentence.

With the SENTENCE operator we can restrict the search to match only if the operands are in the same sentence:

mysql> select * from testrt where match('"the brown fox" SENTENCE jumps')G
Empty set (0.00 sec)

We can see that the document is not a match any more. If we correct the search query so all the words are from the same sentence we’ll see a match:

mysql> select * from testrt where match('"the brown fox" SENTENCE back')G<br></br>*************************** 1. row ***************************<br></br>id: 15<br></br>gid: 2<br></br>title: The brown fox takes a step back. Then it jumps over the lazydog<br></br>content:<br></br>1 row in set (0.00 sec)

To demonstrate the PARAGRAPH let’s use the following search:

mysql> select * from testrt where match('Samsung  Galaxy');
+------+------+-------------------------------------------------------------------------------------+---------+
| id   | gid  | title                                                                               | content |
+------+------+-------------------------------------------------------------------------------------+---------+
|    9 |    2 | <h1>Samsung Galaxy S10</h1>Is a smartphone introduced by Samsung in 2019            |         |
|   10 |    2 | <h1>Samsung</h1>Galaxy,Note,A,J                                                     |         |
+------+------+-------------------------------------------------------------------------------------+---------+
2 rows in set (0.00 sec)

These 2 documents have different HTML tags

If we add the PARAGRAPH only the document with the search terms found in the single tag will remain.

The more general operators are ZONE and it’s variant ZONESPAN. “zone” is text inside an HTML or XML tag.

The tags to be considered for the zones need to be declared in the index_zones setting, like index_zones = h*, th, title.

For example:

mysql> select * from testrt where match('hello world');
+------+------+-------------------------------+---------+
| id   | gid  | title                         | content |
+------+------+-------------------------------+---------+
|   12 |    2 | Hello world                   |         |
|   14 |    2 | <h1>Hello world</h1>          |         |
|   13 |    2 | <h1>Hello</h1> <h1>world</h1> |         |
+------+------+-------------------------------+---------+
3 rows in set (0.00 sec)

We have 3 documents, where ‘hello’ and ‘world’ are found in plain text, in different zones of the same type or in a single zone.

mysql> select * from testrt where match('ZONE:h1 hello world');
+------+------+-------------------------------+---------+
| id   | gid  | title                         | content |
+------+------+-------------------------------+---------+
|   14 |    2 | <h1>Hello world</h1>          |         |
|   13 |    2 | <h1>Hello</h1> <h1>world</h1> |         |
+------+------+-------------------------------+---------+
2 rows in set (0.00 sec)

In this case, the words are present in H1 zones, but they are not required to be in the same zone. If we want to limit the match to a single zone, we can use ZONESPAN:

mysql> select * from testrt where match('ZONESPAN:h1 hello world');
+------+------+----------------------+---------+
| id   | gid  | title                | content |
+------+------+----------------------+---------+
|   14 |    2 | <h1>你好，世界</h1> |         |
+------+------+----------------------+---------+
1 row in set (0.00 sec)

希望通过这篇文章你已经了解了Manticore中的全文搜索操作符是如何工作的。如果你想通过实践来更好地学习，你可以立即在浏览器中尝试我们的互动课程。

互动课程

如果你尝试我们的"全文操作符简介" 互动课程，你可以了解更多关于全文匹配的内容，该课程提供了命令行以便于学习。