blog-post

Advanced Full-Text Matching with Manticore Search’s REGEX Operator

Introduction

Advanced full-text matching in search engines is key to enhancing accuracy and relevance in search results. This is particularly critical in areas such as patent analysis, contract reviews, clause identification, and trademark searches, where precision in search functionality is essential. Manticore Search’s progression from a simple REGEX() function to a full-text REGEX operator available since 6.3.0 is a significant development in this context, offering intricate and nuanced pattern matching in queries.

The new REGEX operator is especially beneficial for complex search scenarios. For instance, in data analytics platforms, it can search for specific patterns in log files or code repositories, identifying unique error codes or programming constructs. In academic research databases, it enables researchers to locate publications with specific citation styles or bibliographic reference patterns. Furthermore, in trademark searches, the REGEX operator is invaluable for finding exact or similar trademarks, given the varied and complex nature of trademark texts.

Using REGEX in Queries

To use the REGEX operator, your table needs to be configured with either min_infix_len or min_prefix_len. This setup might already be familiar if you use substring search. The new REGEX operator is akin to an advanced wildcard operator. For example, REGEX(/t.?e/) matches any term starting with ’t’, followed by any character, and ending with ’e’. This highlights the operator’s capacity to handle diverse and intricate text patterns, providing a sophisticated approach to search queries.

Example

Let’s assume we have this table:

create table brands(name text) min_infix_len='2' charset_table='non_cjk, -'

select * from brands;
--------------

+---------------------+------------------------+
| id                  | name                   |
+---------------------+------------------------+
| 1515699435999330620 | SeaCrest Flower        |
| 1515699435999330621 | C-Crest Flour          |
| 1515699435999330622 | CCrest Flower          |
| 1515699435999330623 | Flower SeaCrest        |
| 1515699435999330624 | RightWrite Stationery  |
| 1515699435999330625 | WriteRight Stationery  |
| 1515699435999330626 | SoleSoul Footwear      |
| 1515699435999330627 | SoulSole Footwear      |
| 1515699435999330628 | PeakBeak Aviaries      |
| 1515699435999330629 | BeakPeak Aviaries      |
| 1515699435999330630 | GrateGreat Kitchenware |
| 1515699435999330631 | GreatGrate Kitchenware |
| 1515699435999330632 | Sunnyside Cyder        |
| 1515699435999330633 | SunnyCide Cyder        |
| 1515699435999330634 | ThymeTime Cooking      |
| 1515699435999330635 | TimeThyme Cooking      |
| 1515699435999330636 | KnightNight Security   |
| 1515699435999330637 | NightKnight Security   |
| 1515699435999330638 | PearPair Electronics   |
| 1515699435999330639 | PairPear Electronics   |
+---------------------+------------------------+
20 rows in set (0.00 sec)

Previously, to find the brands SeaCrest Flower, C-Crest Flour, and CCrest Flower - which all sound similar — a query would need to be exhaustive:

select * from brands where match('"SeaCrest flower"|"SeaCrest flour"|"CCrest flower"|"CCrest flour"|"C-Crest flower"|"C-Crest flour"');

With the new REGEX operator, we can simplify this query to:

select * from brands where match('"REGEX(/(c|sea).*crest/) REGEX(/flo(we|u)r/)"')
--------------

+---------------------+-----------------+
| id                  | name            |
+---------------------+-----------------+
| 1515699435999330620 | SeaCrest Flower |
| 1515699435999330621 | C-Crest Flour   |
| 1515699435999330622 | CCrest Flower   |
+---------------------+-----------------+
3 rows in set (0.00 sec)

RE2 Syntax and Performance

The REGEX operator in Manticore Search adheres to the RE2 syntax, known for its high performance and safety in processing regular expressions. Developed by Google, RE2 avoids common pitfalls in regex processing, such as catastrophic backtracking, making it efficient and secure for various applications. While powerful, the REGEX operator’s use can affect search times, especially when scanning large dictionaries. Users might need to find a balance between the depth of search capabilities and query performance. In performance-critical situations, simplifying REGEX patterns and employing additional search filters can help reduce search times, thus maintaining a balance between complexity and efficiency.

Conclusion

The introduction of the REGEX operator in Manticore Search marks a notable advancement in full-text search capabilities. It elevates the precision and relevance of search results and opens new avenues for complex query handling across various domains, including data analytics, academic research, and trademark searches. By leveraging the REGEX operator, users can access a higher level of search efficiency and accuracy, making Manticore Search an even more powerful and versatile tool in the full-text search engine landscape.

Install Manticore Search

Install Manticore Search