blog-post

全文本操作符和基本搜索介绍

在本教程中,我们将探索Manticore Search中可用的全文本搜索操作符。

全文本操作符和基本搜索介绍

Manticore Search中的所有搜索操作基于标准布尔操作符(AND,OR,NOT),可以组合使用并自由排列,以结合或排除搜索中的关键字,从而获得更相关的结果。

默认且最简单的全文本操作符是AND,当您仅列举几个单词进行搜索时,默认使用AND。

Full-Text-Operators-in-Manticore_ANDAND是默认操作符,**fast slow**查询将返回同时包含两个术语的文档:'fast'和'slow'。如果一个术语在文档中而另一个不在,则该文档将不包含在结果列表中。
默认情况下,单词将在所有可用的全文本字段中进行搜索。

SELECT * FROM testrt WHERE MATCH('fast slow');

Full-Text-Operators-in-Manticore_OROR用于匹配任何术语(或两者)。术语之间用竖线分隔,例如**fast | slow**。它将找到包含fastslow的文档。

SELECT * FROM testrt WHERE MATCH('fast | slow');

OR操作符的优先级高于AND,因此查询**'find me fast|slow'**可以解释为'find me (fast|slow)':

SELECT * FROM testrt WHERE MATCH('find me fast | slow');

NOT NOT确保标记为-!的术语不在结果中。任何包含此类术语的文档将被排除。例如**fast !slow**将找到包含fast的文档,只要其中没有slow。使用时要小心,以免搜索变得过于具体,从而排除良好的文档。

SELECT * FROM testrt WHERE MATCH('find !slow');

SELECT * FROM testrt WHERE MATCH('find -slow');

MAYBEMAYBE是一个特殊操作符,类似于OR,但要求左侧术语始终在结果中,而右侧术语是可选的。但是当两者都满足时,文档将获得更高的搜索排名。例如**fast MAYBE slow**将找到包含fastslow的文档,但同时包含这两个术语的文档将得分更高。

SELECT * FROM testrt WHERE MATCH('find MAYBE slow');

使用示例

让我们使用mysql客户端连接到Manticore:

# mysql -P9306 -h0

对于布尔搜索,可以使用OR操作符|

MySQL [(none)]> select * from testrt where match('find | me fast');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    6 |    1 | find me fast now       |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
+------+------+------------------------+----------------+
4 rows in set (0.00 sec)

OR操作符的优先级高于AND,查询find me fast|slow被解释为find me (fast|slow)

MySQL [(none)]> SELECT * FROM testrt WHERE MATCH('find me fast|slow');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    6 |    1 | find me fast now       |  quick         |
|    3 |    1 | find me slow           |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
+------+------+------------------------+----------------+
5 rows in set (0.00 sec)

对于否定,可以将操作符NOT指定为-!

MySQL [(none)]> select * from testrt where match('find me -fast');
+------+------+--------------+---------+
| id   | gid  | title        | content |
+------+------+--------------+---------+
|    3 |    1 | find me slow |  quick  |
+------+------+--------------+---------+
1 row in set (0.00 sec)

必须注意的是,默认情况下Manticore不支持完全否定查询,无法仅运行-fast(自v3.5.2起将可行)。

另一个基本操作符是MAYBE。MAYBE定义的术语可以在文档中存在或不存在。如果存在,它将影响排名,包含该术语的文档将排名更高。

MySQL [(none)]> select * from testrt where match('find me MAYBE slow');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
|    3 |    1 | find me slow           |  quick         |
|    1 |    1 | find me                |  fast and quick|
|    2 |    1 | find me fast           |  quick         |
|    5 |    1 | find me quick and fast |  quick         |
|    6 |    1 | find me fast now       |  quick         |
+------+------+------------------------+----------------+
5 rows in set (0.00 sec)

字段操作符

如果我们想将搜索限制为仅特定字段,可以使用操作符'@':

mysql> select * from testrt where match('@title find me fast');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | find me fast           |  quick  |
|    6 |    1 | find me fast now       |  quick  |
|    5 |    1 | find me quick and fast |  quick  |
+------+------+------------------------+---------+
3 rows in set (0.00 sec)

我们还可以指定多个字段以限制搜索:

mysql> select * from testrt where match('@(title,content) find me fast');
+------+------+------------------------+----------------+
| id   | gid  | title                  | content        |
+------+------+------------------------+----------------+
| 1    |    1 | find me                | fast and quick |
| 2    |    1 | find me fast           | quick          |
| 6    |    1 | find me fast now       | quick          |
| 5    |    1 | find me quick and fast | quick          |
+------+------+------------------------+----------------+
4 rows in set (0.00 sec)

字段操作符还可以用于限制搜索仅在前x个单词中进行。例如:

mysql> select * from testrt where match('@title lazy dog');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
4 rows in set (0.00 sec)

然而,如果我们仅在前5个单词中搜索,我们将得不到任何结果:

mysql> select * from testrt where match('@title[5] lazy dog');
Empty set (0.00 sec)

在某些情况下,搜索可以在多个索引上进行,这些索引可能没有相同的全文本字段。
默认情况下,指定一个在索引中不存在的字段将导致查询错误。为了解决这个问题,可以使用特殊操作符@@relaxed

mysql> select * from testrt where match('@(title,keywords) lazy dog');<br></br>ERROR 1064 (42000): index testrt: query error: no field 'keywords' found in schema
mysql> select * from testrt where match('@@relaxed @(title,keywords) lazy dog');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set, 1 warning (0.01 sec)

模糊搜索

模糊匹配允许仅匹配查询字符串中的某些单词,例如:

mysql> select * from testrt where match('"fox bird lazy dog"/3');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

在这种情况下,我们使用QUORUM操作符并指定仅匹配3个单词是可以的。带有/1的搜索相当于OR布尔搜索,而带有/N的搜索,其中N是输入单词的数量,相当于AND搜索。

您还可以指定一个介于0.0和1.0之间的数字(代表0%和100%),Manticore将仅匹配至少具有指定百分比给定单词的文档。上面的同一示例也可以写成"fox bird lazy dog"/0.3,它将匹配至少包含4个单词中30%的文档。

mysql> select * from testrt where match('"fox bird lazy dog"/0.3');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

高级操作符

除了简单的操作符外,还有许多高级操作符使用得较少,但在某些情况下可能是绝对必要的。

最常用的高级操作符之一是短语操作符。
短语操作符仅在给定单词按逐字指定的顺序找到时匹配。这也将限制单词在同一字段中找到:

mysql> SELECT * FROM testrt WHERE MATCH('"quick brown fox"');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title            | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog            |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+-------------------------------------------------------------------+---------------------------------------+
2 rows in set (0.00 sec)

短语操作符的更宽松版本是严格顺序操作符。
顺序操作符要求单词按指定的确切顺序找到,但可以接受其他单词在其间:

mysql> SELECT * FROM testrt WHERE MATCH('find << me << fast');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | find me fast           |  quick  |
|    6 |    1 | find me fast now       |  quick  |
|    5 |    1 | find me quick and fast |  quick  |
+------+------+------------------------+---------+
3 rows in set (0.00 sec)

另一对与单词位置相关的操作符是开始/结束字段操作符。
这些将限制一个单词出现在字段的开始或结束。

mysql> SELECT * FROM testrt WHERE MATCH('^find me fast$');
+------+------+------------------------+---------+
| id   | gid  | title                  | content |
+------+------+------------------------+---------+
|    2 |    1 | find me fast           |  quick  |
|    5 |    1 | find me quick and fast |  quick  |
+------+------+------------------------+---------+
2 rows in set (0.00 sec)

接近操作符类似于AND操作符,但增加了单词之间的最大距离,以便仍然可以被视为匹配。让我们以仅使用AND操作符的示例为例:

mysql> SELECT * FROM testrt WHERE MATCH('brown fox jumps');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.00 sec)

我们的查询返回2个结果:一个是所有单词彼此接近,另一个是其中一个单词更远。
如果我们希望仅在单词之间的某个距离内匹配,我们可以使用接近操作符进行限制:

mysql> SELECT * FROM testrt WHERE MATCH('"brown fox jumps"~5');
+------+------+---------------------------------------------+---------------------------------------+
| id   | gid  | title                                       | content                               |
+------+------+---------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+---------------------------------------------+---------------------------------------+
1 row in set (0.00 sec)

更通用的接近操作符是 NEAR 操作符。在接近的情况下,指定了一个距离用于一组词,而 NEAR 操作符则使用两个操作数,这两个操作数可以是单个单词或表达式。

在以下示例中,'brown' 和 'fox' 必须在距离 2 之内,而 'fox' 和 'jumps' 必须在距离 6 之内:

mysql> SELECT * FROM testrt WHERE MATCH('brown NEAR/2 fox NEAR/6 jumps');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                             | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                       |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+-------------------------------------------------------------------+---------------------------------------+
2 rows in set (0.00 sec)

查询省略了一个文档,因为它不符合第一个 NEAR 条件(这里的最后一个):

mysql> SELECT * FROM testrt WHERE MATCH('brown NEAR/3 fox NEAR/6 jumps');
+------+------+----------------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                                      | content                               |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
|    4 |    1 | The quick brown fox jumps over the lazy dog                                |  The five boxing wizards jump quickly |
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog          |  The five boxing wizards jump quickly |
|    8 |    1 | The  brown and beautiful fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+----------------------------------------------------------------------------+---------------------------------------+
3 rows in set (0.09 sec)

NEAR 操作符的一个变体是 NOTNEAR,只有在操作数之间有最小距离时才匹配。

mysql> SELECT * FROM testrt WHERE MATCH('"brown fox" NOTNEAR/5 jumps');
+------+------+-------------------------------------------------------------------+---------------------------------------+
| id   | gid  | title                                                             | content                               |
+------+------+-------------------------------------------------------------------+---------------------------------------+
|    7 |    1 | The quick brown fox take a step back and  jumps over the lazy dog |  The five boxing wizards jump quickly |
+------+------+-------------------------------------------------------------------+---------------------------------------+
1 row in set (0.00 sec)

Manticore 还可以检测纯文本中的句子和 HTML 内容中的段落。
要对句子进行索引,需要启用 index_sp 选项,而段落还需要 html_strip =1。我们来看以下示例:

mysql> select * from testrt where match('"the brown fox" jumps')G
*************************** 1. row ***************************
id: 15
gid: 2
title: The brown fox takes a step back. Then it jumps over the lazydog
content:
1 row in set (0.00 sec)

该文档包含 2 个句子,而短语只在第一个句子中找到,'jumps' 仅在第二个句子中。

使用 SENTENCE 操作符,我们可以限制搜索,仅在操作数在同一句子中时匹配:

mysql> select * from testrt where match('"the brown fox" SENTENCE jumps')G
Empty set (0.00 sec)

我们可以看到该文档不再匹配。如果我们纠正搜索查询,使所有单词都来自同一句子,我们将看到匹配:

mysql> select * from testrt where match('"the brown fox" SENTENCE back')G<br></br>*************************** 1. row ***************************<br></br>id: 15<br></br>gid: 2<br></br>title: The brown fox takes a step back. Then it jumps over the lazydog<br></br>content:<br></br>1 row in set (0.00 sec)

为了演示 PARAGRAPH,让我们使用以下搜索:

mysql> select * from testrt where match('Samsung  Galaxy');
+------+------+-------------------------------------------------------------------------------------+---------+
| id   | gid  | title                                                                               | content |
+------+------+-------------------------------------------------------------------------------------+---------+
|    9 |    2 | <h1>Samsung Galaxy S10</h1>Is a smartphone introduced by Samsung in 2019            |         |
|   10 |    2 | <h1>Samsung</h1>Galaxy,Note,A,J                                                     |         |
+------+------+-------------------------------------------------------------------------------------+---------+
2 rows in set (0.00 sec)

这 2 个文档有不同的 HTML 标签

如果我们添加 PARAGRAPH,只有在单个标签中找到搜索词的文档将保留。

更通用的操作符是 ZONE 及其变体 ZONESPAN。“zone” 是 HTML 或 XML 标签内的文本。

需要考虑的区域标签需要在 index_zones 设置中声明,例如 index_zones = h*, th, title

例如:

mysql> select * from testrt where match('hello world');
+------+------+-------------------------------+---------+
| id   | gid  | title                         | content |
+------+------+-------------------------------+---------+
|   12 |    2 | Hello world                   |         |
|   14 |    2 | <h1>Hello world</h1>          |         |
|   13 |    2 | <h1>Hello</h1> <h1>world</h1> |         |
+------+------+-------------------------------+---------+
3 rows in set (0.00 sec)

我们有 3 个文档,其中 'hello' 和 'world' 在纯文本中找到,在相同类型的不同区域或在单个区域中。

mysql> select * from testrt where match('ZONE:h1 hello world');
+------+------+-------------------------------+---------+
| id   | gid  | title                         | content |
+------+------+-------------------------------+---------+
|   14 |    2 | <h1>Hello world</h1>          |         |
|   13 |    2 | <h1>Hello</h1> <h1>world</h1> |         |
+------+------+-------------------------------+---------+
2 rows in set (0.00 sec)

在这种情况下,单词出现在 H1 区域中,但它们不需要在同一区域中。如果我们想将匹配限制为单个区域,可以使用 ZONESPAN:

mysql> select * from testrt where match('ZONESPAN:h1 hello world');
+------+------+----------------------+---------+
| id   | gid  | title                | content |
+------+------+----------------------+---------+
|   14 |    2 | <h1>Hello world</h1> |         |
+------+------+----------------------+---------+
1 row in set (0.00 sec)

希望通过这篇文章,您已经了解了 Manticore 中 全文搜索操作符 的工作原理。如果您想获得更好的实践经验,可以立即在浏览器中 尝试我们的互动课程

互动课程

<img src="Manticore-Full-text-operators-Interactive-course-optimized.webp" alt="img">

如果您尝试我们的“全文操作符介绍” 互动课程 ,您可以了解更多关于全文匹配的内容,该课程提供了一个命令行以便于学习。

安装Manticore Search

安装Manticore Search