In this article we talk about regular expressions and how you can use that in Manticore Search
If we talk about text matching, the simplest one is just string comparison. Both the search string and the document or a field must be exactly the same. The next text search method is by performing a simple pattern match, which can allow to match the search string as a sub-part of the field. This simple pattern matching can make use of basic operators like wildcards, where a meta character like *
or %
means ‘match any characters’. In databases this is usually accomplished by LIKE operator, for example:
SELECT * FROM mytable WHERE title LIKE '%and%'
will match rows with titles containing literal ‘and’ substring.
Regular expressions operate on plain texts, same as LIKE, but they come with a powerful set of meta characters-operators that allow to build complex match patterns. A regex pattern works at character level and can match letters, numeric digits and other characters, either by literal value (like /abc/
will match abc
) or by class type, for example \w
will match [A-Aa-z0-9_]
. The pattern can also include boolean or
, like abc|xyz
, quantifications - ?
marks zero or one occurrences of preceding characters or wildcards (metacharacter .
).
Unlike regexps, full-text searches don’t operate on plain texts, but they parse the texts and segment them into words. A fulltext search doesn’t try to find a pattern on a string, but looks in a collection of words. While regular expressions can find and isolate words, they do have natural limitations as they can’t perform easy binary searches like abc -xyz
or deal with morphology functionality. Some of the features found in regular expressions can be also found in fulltext searches, like start/end operators. Some work different way. For example, wildcards in regexps can hold place for any number of characters, but in case of fulltext searches the wildcards can only expand within the bounds of a word or they can be full substitutes of a word.
How regexp are used in Manticore Search?
Regular expressions were available until now as an optional feature at indexing time by the regexp_filter. Regexps can be used to perform transformations of strings which otherwise are harder to implement by the regular tokenization options. For example ‘ae’ ligature can be replaced with ‘a’ or you could replace a pattern like XX"
to XX inch
.
Since 2.7.5 regular expressions can be used in SELECT queries using the REGEX() function. REGEX() accepts a string at input, that can come from a string attribute or a JSON string property and a regular expression that will be tested against the input. REGEX only works with attribute values, it can’t be used for fulltext fields.
Why the need of a REGEX function when we can already have texts indexed in the fulltext fields?
First, we could have strings (either as attributes or in JSON data) on which we need to apply a search pattern that is not necessary related to full-text matching. For example the metadata can contain some codes and we want to narrow the search down to provide only records when these codes start with a digit like
SELECT * FROM myindex WHERE MATCH('something') AND REGEX(json_attr.code, "^\d\w+")
Another reason is that currently the text data from JSON attributes is not indexed as fulltext. If we need to perform on some string properties more than a comparison match, we would need to bring them in a fulltext field ( and make use of ZONEs to allow limit matching to specific labels) or in multiple fulltext fields (which can be a problem if these properties can vary in number).
For example, metadata of a product encapsulated in a JSON object can contain a property with list of compatible devices of that product. With REGEX we can create a pattern that can match a product that we want or even a series of that product:
SELECT * FROM myindex WHERE MATCH('red case') AND REGEX(json_attr.compatible_devices, "Galaxy S[7|8|9]")
In some cases REGEX() can be indispensable. Let us know if you find it useful!