Regular expressions in Manticore Search

In this article we talk about regular expressions and how you can use that in Manticore Search

If we talk about text matching, the simplest one is just string comparison. Both the search string and the document or a field must be exactly the same. The next text search method is by performing a simple pattern match, which can allow to match the search string as a sub-part of the field. This simple pattern matching can make use of basic operators like wildcards, where a meta character like * or % means 'match any characters'. In databases this is usually accomplished by LIKE operator, for example:

SELECT * FROM mytable WHERE title LIKE '%and%'

will match rows with titles containing literal 'and' substring.

Regular expressions operate on plain texts, same as LIKE, but they come with a powerful set of meta characters-operators that allow to build complex match patterns.  A regex pattern works at character level  and can match letters, numeric digits and other characters, either by literal value (like /abc/ will match abc) or by class type, for example \w will match [A-Aa-z0-9_]. The pattern can also include boolean or, like abc|xyz, quantifications - ? marks zero or one occurrences of preceding characters or wildcards (metacharacter .).

Unlike regexps, full-text searches don't operate on plain texts, but they parse the texts and segment them into words. A fulltext search doesn't try to find a pattern on a string, but looks  in a collection of words.  While regular expressions can find and isolate words, they do have natural limitations as they can't perform easy binary searches like abc -xyz or deal with morphology functionality. Some of the features found in regular expressions can be also found in fulltext searches, like start/end operators. Some work different way. For example, wildcards in regexps can hold place for any number of characters, but in case of fulltext searches the wildcards can only expand within the bounds of a word or they can be full substitutes of a word.

How regexp are used in Manticore Search?

Regular expressions were available until now as an optional feature at indexing time by the regexp_filter. Regexps can be used to perform transformations of strings which otherwise are harder to implement by the regular tokenization options. For example 'ae' ligature can be replaced with 'a'  or you could replace a pattern like XX" to XX inch.

Since 2.7.5 regular expressions can be used  in SELECT queries using the REGEX() function. REGEX() accepts a string at input, that can come from a string attribute or a JSON string property and a regular expression that will be tested against the input. REGEX only works with attribute values, it can't be used for fulltext fields.

Why the need of a REGEX function when we can already have texts indexed in the fulltext fields?
First, we could have strings (either as attributes or in JSON data) on which we need to apply a search pattern that is not necessary related to full-text matching. For example the metadata can contain some codes and we want to narrow the search down to provide only records when these codes start with a digit like

SELECT * FROM myindex WHERE MATCH('something') AND REGEX(json_attr.code, "^\d\w+")

Another reason is that currently the text data from JSON attributes is not indexed as fulltext. If we need to perform on some string properties more than a comparison match, we would need to bring them in a fulltext field ( and make use of ZONEs to allow limit matching to specific labels) or in multiple fulltext fields (which can be a problem if these properties can vary in number).

For example, metadata of a product encapsulated in a JSON object can contain a property with list of compatible devices of that product. With REGEX we can create a pattern that can match a product that we want or even a series of that product:

SELECT * FROM myindex WHERE MATCH('red case') AND REGEX(json_attr.compatible_devices, "Galaxy S[7|8|9]")

In some cases REGEX() can be indispensable. Let us know if you find it useful!

Leave a Reply

Training

Personal and team training will maximize them performance. 

Custom development

Need cone custom or individual features?

Fill the form and don’t forget to make the description of what you need.

Free config review

There are often optimizations that can be made to a Sphinx / Manticore setup by changing some simple directives in the configuration or making quick changes to an index definition.

Some common mistakes and issues can include:

  • doing main+delta without kill-lists, even if the delta does include updated records found in the main
  • using wildcarding with very short prefix/infix which can hammer performance in some cases
  • disabled (unintentional) seamless rotates and getting stalls on index rotations
  • adding texts as string attributes even if they are not using for any kind of operation (filtering, grouping, sorting) or mandatory to be present in results
  • using deprecated settings 

Having a quick look on the configuration can show issues or potential issues, this is why we want to offer a gift to our growing community!

When uploading your configuration file, we recommend to remove any database credentials first.

We suggest also you give as many possible details about your setup: how big is the data you have, how typical queries look and what issues you experience.

Contact us