关于我

第二步:REPLACE, UPDATE, 词形
在我写我的 第一篇文章 时,宠物店的朋友们联系了我。他们想使用数据库来管理他们的商店,由于他们的销售人员仍然对牵引绳和项圈感到困惑,我想实现数据库的附加功能。
这篇文章是对 上一篇文章 的延续,并基于它。
与大多数数据库不同,Manticore使用一种先进的模型来识别其存储的文本。文本处理系统基于NLP(自然语言处理)解决方案。
在这里简单描述一下这项技术以便于理解是有益的。NLP旨在识别我们交流时使用的“自然”语言。乍一看,文本识别似乎并不复杂,技术上随着使用机器学习算法的文本处理解决方案的出现而变得如此。在我们的案例中,我们不会深入探讨,而是使用Manticore Search中已经内置的现成文本处理解决方案。该系统使用分词(将文本分离)为小的独立部分:按句子和单个单词,这使得能够快速在数据库中找到所需的单词、短语和段落。(有关数据分词的更多信息,请查看此 链接。 )
以下是Manticore所使用的一些技术:
- 词干提取是将单词简化为其根形式的过程。例如,“walking”、“walks”和“walked”都是“walk”的词干。
- 词形还原是将单词的不同形式转换回其基本形式(称为词元)的过程。例如,单词“eat”可以出现为“eating”、“eats”和“ate”。所有这些变体的词元是“eat”,这是它的基本字典形式。
- 为了提高高级搜索的准确性和质量,还涉及其他几种解决方案:词形、例外和停用词。
词形还原器和词干提取器执行一个共同的功能,即将单词标准化为相同的形式,但方式不同,各有其优缺点。
此外,包含所选语言常用词列表的停用词文件,如冠词、连词和感叹词,有助于加快处理速度。基本上,所有那些让我们的语言听起来优美但对计算机来说并没有太大意义的小词。
如果标准功能集不足以满足搜索引擎的舒适操作,例如,当数据库中有专业术语或地方俚语,并且单词之间存在具有附加语义意义的逻辑关系时,可以使用附加词形文件。数据库管理员可以在文件中添加单词之间的链接,这些单词在定义规则上有所不同,但在上下文中与该数据库相似。例如,牵引绳和马具。在不同的上下文中,这两个词可能具有相同的含义或完全不同。
在表中使用词形文件
新表
宠物店的工作人员提到,他们的销售人员对牵引绳和马具并不太了解。他们建议在询问牵引绳时,也应考虑询问马具,反之亦然。
由于牵引绳和马具属于同一产品类别,并且数据库中没有特定字段来指示它们的组,因此使用词形文件可能会带来好处。该文件可以通过允许在搜索“牵引绳”时添加相关单词来提供帮助。例如,通过将“马具”或“弹性”这样的词添加到词形字典中,搜索“牵引绳”时也会得到“马具”和“弹性”的结果。
让我们看看这家宠物店提供的产品:
| 标题 | 描述 | 价格 | 可用性 |
|---|---|---|---|
| 帆布牵引绳,绿色,承重50公斤,5米 | 适合大型犬种的坚固帆布狗牵引绳,长度为5米 | 5.00€ | 是 |
| 弹性牵引绳,粉色,承重10公斤,3米 | 适合优雅女士及其四条腿伴侣的美丽牵引绳 | 12.00€ | 否 |
| 粉色背带,最多10公斤 | 用于放出到街上的房间守卫,误称为狗 | 8.00€ | 是 |
| flexi可伸缩狗绳,10公斤,5米 | 适用于最多10公斤的狗。长度为5米,考虑到主人伸出的手臂的长度 | 7.50€ | 是 |
| 狗粮,1公斤 | 适合您宠物的干粮 | 4.30€ | 否 |
| 猫粮,1公斤 | 如果您的猫大声叫喊并要求食物! | 2.80€ | 是 |
| 猫用跳蚤项圈 | 猫不应该是跳蚤的携带者。 | 23.20€ | 是 |
| 适用于最多10公斤狗的跳蚤滴剂 | 从您防卫者的皮肤上去除不请自来的乘客 | 14.30€ | 是 |
我们将把 Title 字段设置为字符串,使用 Description 进行全文搜索,将 Price 设置为浮点数,并将 Availability 设置为布尔值。
为了在“绳索”、“背带”和“flexi”之间建立逻辑词连接,我们将它们放在 /tmp/wordforms/ 的文件中,并确保它对系统上的所有用户可共享。
mkdir /tmp/wordforms
cd /tmp/wordforms
mcedit wf_pet_products.txt
MC中的内置编辑器称为 mcedit。要执行它,请输入 mcedit <文件名>。
让我们把我们的词形放在那里:
flexi > leash
harness > leash
如果您做对了所有事情,您将在终端中看到类似这样的内容:
F2 保存
Enter 确认
F10 退出
现在我们有一个包含单词的文件,稍后将与新表一起使用。请记住或记下它保存的位置:
/tmp/wordforms/wf_pet_products.txt
现在,为了对数据库进行更多更改,我们必须连接到它并设置一个包含我们列的表。然后,我们需要链接一个词干提取器和我们之前制作的词形文件:
mysql -h0 -P9306
CREATE TABLE products (name STRING, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
现在让我们检查一下我们的表中创建了哪些字段:
DESC products;

我们指定的所有字段都存在,但出现了一个额外的字段 - id。该字段用于Manticore唯一标识数据库中的文档,因此在初始化表时会自动创建,无论规范如何。info字段具有 indexed stored 属性,表示它参与全文搜索过程。需要注意的是,字段的顺序与创建表时指定的顺序不同。因此,在填充表时,必须考虑这一点,特别是在不指定命令中的字段顺序时更新整个行。例如,在我们稍后将讨论的 REPLACE 命令中。
接下来,我们应该验证表的一般参数,例如词形文件和之前连接的词干提取器。当创建表时,如果词形文件名有误,系统将忽略它,并且不会产生任何错误或警告。
SHOW TABLE products SETTINGS;

在上面,您可能会注意到,尽管我们指定了 /tmp/wordforms/ 中的文件路径,但Manticore将其保存在 /var/lib/manticore/products/ 中。这表明Manticore已将文件的副本附加到创建的表中。该文件现在已固定用于索引此表。按设计,动态替换词形文件不可用,因为这将需要重新索引整个表,如果您有一个非常大的表,这可能并不理想。我们稍后将讨论如何安全且可预测地替换这样的文件。
这完成了表的设置,下一步是输入所需的数据。数据输入与任何SQL数据库类似。
INSERT INTO products (name, price, avl) VALUES ('Canvas leash, green, up to 50 kg, 5m', 5.0, 1);
INSERT INTO products (name, price, avl, info) VALUES ('Elastic leash, pink, up to 10 kg, 3m', 12.00, 0, 'A beautiful leash for sophisticated ladies and their four-legged companions');
INSERT INTO products (name, price, avl, info) VALUES ('Pink harness, up to 10 kg', 8.00, 1, 'For room guards released into the street and mistakenly called a dog');
INSERT INTO products (name, price, avl, info) VALUES ('The flexi retractable dog leash, 10 kg, 5m', 7.50, 1, 'A flexi for dogs up to 10 kg. The length is 5 meters, taking into account the length of the owner\'s pulled out arm');
INSERT INTO products (name, price, avl, info) VALUES ('Dog food, 1kg', 4.30, 0, 'Dry food for your pet');
INSERT INTO products (name, price, avl, info) VALUES ('Cat food, 1kg', 2.80, 1, 'If your cat is yelling loudly and demanding food!');
INSERT INTO products (name, price, avl, info) VALUES ('Flea collar for cats , 1kg', 23.20, 1, 'Cats shouldn\'t be flea carriers.');
INSERT INTO products (name, price, avl, info) VALUES ('Flea drops for dogs up to 10 kg', 14.30, 1, 'Drops from uninvited passengers on the skin of your defender');
小心撇号;文本中有使用它们的缩写,重要的是用 \ 将它们与其余文本隔离:'Isn\'t it'。Manticore 不支持双引号 用于字符串,这将有助于避免转义单撇号。
在上面的第一个请求中,故意省略了 info 字段,以演示如何更新全文字段。需要注意的是,更新文本字段和属性字段的处理方式不同。全文字段使用 REPLACE 命令进行更新,这会触发新值的重新索引,而其他字段只需使用 UPDATE 命令即可。这是因为属性字段不参与全文搜索索引过程。
现在,让我们使用 REPLACE 命令向现有记录的某些字段添加数据。要使用此命令,您需要我们想要更改或添加信息的行的唯一ID。首先,我们将使用 SELECT * FROM products 获取必要的数据;
SELECT * FROM products WHERE name = 'Canvas leash, green, up to 50 kg, 5m';

确保记住字段的位置。我们稍后需要这些信息。此外,我们需要字段中的所有当前数据,因为替换命令将使用这些数据更新整个文件。如果我们不指定所有数据,则未指定的字段将被重置。根据字段位置填写 REPLACE 命令。
REPLACE INTO products VALUES (<id>, <info>, <avl>, <price>, <name>);
在开发版本中,您阅读时可能已经发布,可以替换特定字段。有关更多详细信息,请参阅 文档 。
REPLACE INTO products VALUES (8217224102746783745, 'Sturdy canvas dog leash for 5 meters, suitable for large dog breeds', 1, 5.0, 'Canvas leash, green, up to 50 kg, 5m');
小心这里的代码,如果您只是从文章中复制命令,您表中的ID字段值将不同!如果Manticore未找到具有指定ID的记录,将创建一条新记录。
To update the "attributes" fields, you can use the UPDATE command. Just as a side note, while entering the data, the price dropped slightly on the green leash:
update products set price = 4.6 where id = 8217224102746783745;
Let's check the result:
SELECT * FROM products WHERE id = 8217224102746783745;

We have achieved the desired result, now let's try the search, because we are creating a database of products for the store, to make it easier for the seller to find the product. The command used to search in the table is select * from <table> where match('<query>').
SELECT * FROM products WHERE match('harness');
SELECT * FROM products WHERE match ('leash');

Great, now the database is providing answers to our queries, using the connection we created through the word form file. But it seems like something is missing in the output? Where is the entry for a pink leash for indoor guards?
In the info field, there are no words like leash or collar, they only appear in the name field, so that entry didn't make it into the output. Let's fix that:
SELECT * FROM products WHERE name = 'Pink harness, up to 10 kg';

REPLACE INTO products VALUES (8217224102746783747,'Harness for room guards released into the street and mistakenly called a dog', 1, 8.0, 'Pink harness, up to 10 kg');
Let's check what happened:
SELECT * FROM products WHERE match('harness');

Now this record is included in the output. As you can see from the example, only the information related to indexed fields is involved in the search; the rest of the fields are attributes for indexed fields.
Extending the table
At the pet store they brought in new stuff, now they have aquarium equipment too. To make sure we can easily find words related to aquarium stuff, like pump or drive, we need to add entries to our word list.
| Title | Description | Price | Availability |
|---|---|---|---|
| The pump for the aquarium. | A pump with a built-in filter for the aquarium. Capacity 150 l/h | 32.00€ | yes |
| Automatic filter for aquarium | Disposable drive with filter, capacity 100 l/h. | 28.00€ | yes |
| Fish net | Atraumatic aquarium fish net | 3.00€ | yes |
Let's add them to the database:
INSERT INTO products (name, info, price, avl) VALUES ('The pump for the aquarium.', 'Pump with built-in aquarium filter. Capacity 150 l/h', 32, 1), ('Automatic filter for aquarium', 'Disposable drive with filter, capacity 100 l/h', 28.00, 1), ('Fish net', 'Atraumatic aquarium fish net', 3, 1);
Here we did the filling with one command with a comma-separated enumeration of new rows, so you can add a large batch of documents with one command.
Let's check the search:
SELECT * FROM products WHERE match ('pump');
SELECT * FROM products WHERE match ('filter');

Okay, that means something. However, the drive is not considered a pump. This should be added to the list of words we use…
Let's do it.
Oh, there's a little problem here. It's not that easy...
When we create a search system for text, we place all the words into a table and convert them into tokens to enhance the efficiency of the search. These tokens are not updated subsequently to expedite the search process. Nevertheless, there are cases when we need to update the wordforms file and consequently modify the tokens as well. Let's update the list of words in our wordforms file and also include the product name in the name field to the search index.
To update the wordforms file in the table, I followed these steps:
- Create a dump of this table using
mysqldump. - Update the wordforms file.
- Drop the old table.
- Create a new table with the updated word forms in the morphology section.
- Populate the new table from the dump file.
Creating dump file (backup):
In any unclear situation, make a backup of the table to prevent data loss. (This rule can be displayed on the wall as a reminder).
This also helps us reach our goal.
To do this job, we need to disconnect from the SQL connection and use the mysqldump tool.
exit;
mkdir /tmp/manticore_dumps
cd /tmp/manticore_dumps
mysqldump -h0 -P9306 -tc --compact manticore products > products.sql
The flags I used were:
-t- to exclude thecreate tablestatement from the dump.-c- to specify column names in theINSERTcommands.--compact- to perform a database dump in a compact format, which omits procedures for checking the availability of tables, deleting them, creating new ones, prescribing configurations for created tables, and other organizational actions for deployment. In our case, these procedures were not necessary, as we needed to register a new dictionary file in the table.manticore- is the database name formysqldump, and must always bemanticorewhen usingmysqldumpwith Manticore.products- is the name of our table for which the dump is being created.products.sql- is the name of the file where the dump will be uploaded. The directory for placing the file defaults to the directory from whichmysqldumpwas launched. I recommend specifying/tmp/to ensure the examples work successfully further. Alternatively, you can type the full path:/tmp/manticore_dumps/products.sql.
Don't worry if you see: "-- Warning: column statistics not supported by the server." It's normal when you use this mysql tool with Manticore.
Great, we have our first dump!
Updating wordforms file:
Have you forgotten where the source file for the word forms is located?
cd /tmp/wordforms
If you are using MC and for some reason do not see columns with files, try using the Ctrl + O combination. If nothing changes, perhaps MC is not running.

Next, use the arrow keys to navigate. When you need to select a file, press F4 to open the selected file.
Add new string: "drive > pump"
Save & close. F2, Enter, F10.
File updated.
Deleting the table and creating a new one:
Before deleting the table, please make sure that you have the dump and that it definitely contains our data.
cd /tmp/manticore_dumps/
Choose the dump file and press the F3 button to view it.
You can do the same in the console using the command cat /tmp/manticore_dumps/products.sql.
您应该看到类似这样的内容:
好的,让我们继续删除旧表并创建一个新表:
mysql -h0 -P9306
DROP TABLE products;
CREATE TABLE products (name TEXT, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
所以,我们删除了这个表,然后立即重新创建了它(使用与之前相同的命令,但有一个小变化:现在名称字段也用于文本),现在我们需要向其中添加信息。
exit;
从转储中重新填充:
mysql -h0 -P9306 < /tmp/manticore_dumps/products.sql
mysql -h0 -P9306
我们填好了。让我们检查一下最后的结果:
SELECT * FROM products WHERE match ('pump');
太好了,一切正常!
在更新表时,特别是大型表,有一个时期旧表已被删除而新表尚未创建。在此期间,系统可能会对请求作出错误响应。为了确保Manticore的顺利操作,系统中有几种机制可以帮助避免丢失用户请求。我稍后会学习如何实现这一点。
在这篇文章中,我展示了如何为宠物商店的库存设置和使用Manticore Search。通过使用词形和形态学,Manticore帮助改善搜索结果,通过链接相关的产品名称和类型。我介绍了如何添加新项目,更新现有项目,并确保在重大更新(如更改词形文件)期间数据的一致性。这有助于初学者有效理解和应用Manticore Search的功能,使数据搜索更加高效。未来的帖子将探索更多功能,因此让我们一起继续尝试Manticore,以增强您的项目。
今天就到这里。迈克,签名离开。

