blog-post

How to Make xt850 Match xt 850

TL;DR

Since version 23.0.0, Manticore can make searches like xt850 match xt 850 using bigram_delimiter together with digit-aware bigram_index modes.

This solves a common tokenization mismatch in product search, where users remove spaces from model names but the source data stores them as separate tokens.

Assumptions and verification

This article assumes:

  • RT tables created with SQL examples exactly as shown
  • default tokenization unless the example explicitly changes a setting
  • ASCII digits in model names, because second_numeric and second_has_digit are digit-aware modes built around 0-9

All SQL examples and expected outputs in this article were verified against a real Manticore 23.0.0 instance before publishing, using fresh tables created from scratch for each scenario.

The broader search problem

Imagine a catalog containing:

  • xt 850 action camera
  • iphone 5se battery case
  • canon eos 80d body
  • thinkpad x1 carbon

Now imagine users searching for:

  • xt850
  • iphone5se
  • eos80d
  • thinkpadx1

From the user's point of view, these should obviously match. From the engine's point of view, they often do not, because the indexed text is tokenized as separate terms.

Search systems usually attack that mismatch in one of four ways:

  • index prefixes or infixes
  • add custom normalization rules
  • duplicate content into alternate normalized fields
  • index adjacent token pairs and optionally store glued variants too

Manticore's newer bigram functionality is a structured way to do the fourth option without awkward field duplication.

Baseline: why xt850 fails by default

Here is the problem in its simplest form:

DROP TABLE IF EXISTS bi_default_demo;

CREATE TABLE bi_default_demo(title text);

INSERT INTO bi_default_demo VALUES
  (1,'xt 850 action camera');

SELECT id, title FROM bi_default_demo WHERE MATCH('xt850');

Expected result:

Empty set

Why does this fail?

Because the document is indexed as two separate tokens, xt and 850, while the query is a single token, xt850.

By default, Manticore does not assume that:

  • xt850 should be split into xt + 850
  • or xt + 850 should also be searchable as xt850

So this is not really a typo-tolerance problem or a phrase problem. It is a tokenization mismatch: the index sees two tokens, while the query provides one.

That is the gap the newer bigram settings are designed to close. They let Manticore index selected adjacent token pairs in a form that can also match glued queries.

Why bigrams help here

bigram_index can help with both phrase acceleration and model-name matching, and in this article we focus on the xt 850 vs xt850 problem.

The key idea is simple:

  • detect adjacent token pairs that look like model names
  • store those pairs in a glued form too
  • let queries such as xt850, iphone5se, or thinkpadx1 hit the spaced text

That is where bigram_delimiter matters.

A note about bigram_delimiter

bigram_index decides which adjacent pairs are eligible.

bigram_delimiter decides how eligible bigrams are stored:

  • true: internal delimited token only
  • none: glued token only, such as galaxy24
  • both: both forms

The practical difference is easiest to understand from the query side:

  • with true, Manticore keeps the internal bigram form used for phrase optimization, but it does not keep the glued user-facing form, so a query like xt850 will not match xt 850
  • with none, Manticore keeps only the glued form, so xt850 can match xt 850, but you are leaning entirely on the glued representation for those pairs
  • with both, Manticore keeps both the internal bigram representation and the glued form, so xt850 can match xt 850 without giving up ordinary phrase behavior

For this use case, both is usually the safer default because it covers the user-visible problem directly while keeping behavior less surprising for normal phrase queries and mixed workloads.

Mode 1: second_numeric

bigram_index = second_numeric
bigram_delimiter = both

This mode is aimed at model names where the second token is purely numeric.

That is common in product catalogs:

  • xt 850
  • galaxy 24
  • playstation 5
  • pixel 8

The idea is simple: users often search these as glued terms such as xt850, galaxy24, or playstation5, even though the source text stores them with a space.

second_numeric stores the pair only when the second token is ASCII digits only.

Use it when:

  • you have product generations and numbered models
  • users often remove spaces in search
  • the second token is usually just digits

Example

DROP TABLE IF EXISTS bi_second_numeric_demo;

CREATE TABLE bi_second_numeric_demo(title text)
  bigram_index='second_numeric'
  bigram_delimiter='both';

INSERT INTO bi_second_numeric_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon');

Then test the queries one by one:

SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('playstation5');

+------+--------------------+
| id   | title              |
+------+--------------------+
|    3 | playstation 5 slim |
+------+--------------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('iphone5se');

Empty set
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('eos80d');

Empty set
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('thinkpadx1');

Empty set

That boundary is the whole point of the mode:

  • 24 and 5 qualify
  • 5se, 80d, and x1 do not

Mode 2: second_has_digit

bigram_index = second_has_digit
bigram_delimiter = both

This mode is the more flexible sibling of second_numeric.

It stores the pair when the second token contains at least one ASCII digit. That makes it a much better fit for real product catalogs, where model identifiers are often mixed alphanumeric strings:

  • xt 850
  • iphone 5se
  • eos 80d
  • thinkpad x1

Use it when:

  • your model names mix letters and digits
  • users frequently remove spaces in their searches
  • you want catalog-friendly matching without indexing every pair in the table

Example

DROP TABLE IF EXISTS bi_second_has_digit_demo;

CREATE TABLE bi_second_has_digit_demo(title text)
  bigram_index='second_has_digit'
  bigram_delimiter='both';

INSERT INTO bi_second_has_digit_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon'),
  (7,'kindle paperwhite signature');

Then test the queries one by one:

SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('iphone5se');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    4 | iphone 5se case     |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('eos80d');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    5 | canon eos 80d body  |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('thinkpadx1');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    6 | thinkpad x1 carbon  |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('kindlesignature');

Empty set

This is often the better fit for mixed model identifiers, because real catalog data frequently includes forms like 5se, 80d, or x1 rather than only clean numeric suffixes like 24.

How to choose between the two

If your search problem is specifically "How do I make xt850 find xt 850?", the practical rule is:

  • use second_numeric when the second token is digits-only
  • use second_has_digit when the second token may be mixed, like 5se, 80d, or x1

There is one practical caveat: this is compatible with other common text-processing settings in the straightforward case. xt 850 still matches xt850 with morphology='stem_en' enabled and with a wordforms rule enabled.

But that does not mean those settings rewrite the glued query for you. In tests, iphones 5 matched iphones5, but not iphone5, even with stemming or a wordforms rule mapping iphones to iphone. So the short version is: basic xt 850 vs xt850 matching stays compatible with morphology and wordforms, but if you rely on them, test the exact query shape you care about.

Final takeaway

The xt850 problem is not really about one product name. It is about a broader mismatch between how users type model names and how search engines tokenize them.

Since version 23.0.0, Manticore gives you a built-in way to handle that mismatch with bigram_delimiter plus the digit-aware bigram_index modes, which is much cleaner than duplicating fields or inventing custom preprocessing pipelines.

If your main problem is phrase-search performance rather than glued model-name matching, see How to Speed Up Phrase Search with bigram_index .

Install Manticore Search

Install Manticore Search