TL;DR
Since version 23.0.0, Manticore can make searches like xt850 match xt 850 using
bigram_delimiter
together with digit-aware
bigram_index
modes.
This solves a common tokenization mismatch in product search, where users remove spaces from model names but the source data stores them as separate tokens.
Assumptions and verification
This article assumes:
- RT tables created with SQL examples exactly as shown
- default tokenization unless the example explicitly changes a setting
- ASCII digits in model names, because
second_numericandsecond_has_digitare digit-aware modes built around0-9
All SQL examples and expected outputs in this article were verified against a real Manticore 23.0.0 instance before publishing, using fresh tables created from scratch for each scenario.
The broader search problem
Imagine a catalog containing:
xt 850 action cameraiphone 5se battery casecanon eos 80d bodythinkpad x1 carbon
Now imagine users searching for:
xt850iphone5seeos80dthinkpadx1
From the user's point of view, these should obviously match. From the engine's point of view, they often do not, because the indexed text is tokenized as separate terms.
Search systems usually attack that mismatch in one of four ways:
- index prefixes or infixes
- add custom normalization rules
- duplicate content into alternate normalized fields
- index adjacent token pairs and optionally store glued variants too
Manticore's newer bigram functionality is a structured way to do the fourth option without awkward field duplication.
Baseline: why xt850 fails by default
Here is the problem in its simplest form:
DROP TABLE IF EXISTS bi_default_demo;
CREATE TABLE bi_default_demo(title text);
INSERT INTO bi_default_demo VALUES
(1,'xt 850 action camera');
SELECT id, title FROM bi_default_demo WHERE MATCH('xt850');
Expected result:
Empty set
Why does this fail?
Because the document is indexed as two separate tokens, xt and 850, while the query is a single token, xt850.
By default, Manticore does not assume that:
xt850should be split intoxt+850- or
xt+850should also be searchable asxt850
So this is not really a typo-tolerance problem or a phrase problem. It is a tokenization mismatch: the index sees two tokens, while the query provides one.
That is the gap the newer bigram settings are designed to close. They let Manticore index selected adjacent token pairs in a form that can also match glued queries.
Why bigrams help here
bigram_index
can help with both
phrase acceleration
and model-name matching, and in this article we focus on the xt 850 vs xt850 problem.
The key idea is simple:
- detect adjacent token pairs that look like model names
- store those pairs in a glued form too
- let queries such as
xt850,iphone5se, orthinkpadx1hit the spaced text
That is where bigram_delimiter matters.
A note about bigram_delimiter
bigram_index decides which adjacent pairs are eligible.
bigram_delimiter decides how eligible bigrams are stored:
true: internal delimited token onlynone: glued token only, such asgalaxy24both: both forms
The practical difference is easiest to understand from the query side:
- with
true, Manticore keeps the internal bigram form used for phrase optimization, but it does not keep the glued user-facing form, so a query likext850will not matchxt 850 - with
none, Manticore keeps only the glued form, soxt850can matchxt 850, but you are leaning entirely on the glued representation for those pairs - with
both, Manticore keeps both the internal bigram representation and the glued form, soxt850can matchxt 850without giving up ordinary phrase behavior
For this use case, both is usually the safer default because it covers the user-visible problem directly while keeping behavior less surprising for normal phrase queries and mixed workloads.
Mode 1: second_numeric
bigram_index = second_numeric
bigram_delimiter = both
This mode is aimed at model names where the second token is purely numeric.
That is common in product catalogs:
xt 850galaxy 24playstation 5pixel 8
The idea is simple: users often search these as glued terms such as xt850, galaxy24, or playstation5, even though the source text stores them with a space.
second_numeric stores the pair only when the second token is ASCII digits only.
Use it when:
- you have product generations and numbered models
- users often remove spaces in search
- the second token is usually just digits
Example
DROP TABLE IF EXISTS bi_second_numeric_demo;
CREATE TABLE bi_second_numeric_demo(title text)
bigram_index='second_numeric'
bigram_delimiter='both';
INSERT INTO bi_second_numeric_demo VALUES
(1,'xt 850 action camera'),
(2,'galaxy 24 ultra'),
(3,'playstation 5 slim'),
(4,'iphone 5se case'),
(5,'canon eos 80d body'),
(6,'thinkpad x1 carbon');
Then test the queries one by one:
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('xt850');
+------+----------------------+
| id | title |
+------+----------------------+
| 1 | xt 850 action camera |
+------+----------------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('galaxy24');
+------+-----------------+
| id | title |
+------+-----------------+
| 2 | galaxy 24 ultra |
+------+-----------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('playstation5');
+------+--------------------+
| id | title |
+------+--------------------+
| 3 | playstation 5 slim |
+------+--------------------+
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('iphone5se');
Empty set
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('eos80d');
Empty set
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('thinkpadx1');
Empty set
That boundary is the whole point of the mode:
24and5qualify5se,80d, andx1do not
Mode 2: second_has_digit
bigram_index = second_has_digit
bigram_delimiter = both
This mode is the more flexible sibling of second_numeric.
It stores the pair when the second token contains at least one ASCII digit. That makes it a much better fit for real product catalogs, where model identifiers are often mixed alphanumeric strings:
xt 850iphone 5seeos 80dthinkpad x1
Use it when:
- your model names mix letters and digits
- users frequently remove spaces in their searches
- you want catalog-friendly matching without indexing every pair in the table
Example
DROP TABLE IF EXISTS bi_second_has_digit_demo;
CREATE TABLE bi_second_has_digit_demo(title text)
bigram_index='second_has_digit'
bigram_delimiter='both';
INSERT INTO bi_second_has_digit_demo VALUES
(1,'xt 850 action camera'),
(2,'galaxy 24 ultra'),
(3,'playstation 5 slim'),
(4,'iphone 5se case'),
(5,'canon eos 80d body'),
(6,'thinkpad x1 carbon'),
(7,'kindle paperwhite signature');
Then test the queries one by one:
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('xt850');
+------+----------------------+
| id | title |
+------+----------------------+
| 1 | xt 850 action camera |
+------+----------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('galaxy24');
+------+-----------------+
| id | title |
+------+-----------------+
| 2 | galaxy 24 ultra |
+------+-----------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('iphone5se');
+------+---------------------+
| id | title |
+------+---------------------+
| 4 | iphone 5se case |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('eos80d');
+------+---------------------+
| id | title |
+------+---------------------+
| 5 | canon eos 80d body |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('thinkpadx1');
+------+---------------------+
| id | title |
+------+---------------------+
| 6 | thinkpad x1 carbon |
+------+---------------------+
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('kindlesignature');
Empty set
This is often the better fit for mixed model identifiers, because real catalog data frequently includes forms like 5se, 80d, or x1 rather than only clean numeric suffixes like 24.
How to choose between the two
If your search problem is specifically "How do I make xt850 find xt 850?", the practical rule is:
- use
second_numericwhen the second token is digits-only - use
second_has_digitwhen the second token may be mixed, like5se,80d, orx1
There is one practical caveat: this is compatible with other common text-processing settings in the straightforward case. xt 850 still matches xt850 with morphology='stem_en' enabled and with a wordforms rule enabled.
But that does not mean those settings rewrite the glued query for you. In tests, iphones 5 matched iphones5, but not iphone5, even with stemming or a wordforms rule mapping iphones to iphone. So the short version is: basic xt 850 vs xt850 matching stays compatible with morphology and wordforms, but if you rely on them, test the exact query shape you care about.
Final takeaway
The xt850 problem is not really about one product name. It is about a broader mismatch between how users type model names and how search engines tokenize them.
Since version 23.0.0, Manticore gives you a built-in way to handle that mismatch with bigram_delimiter plus the digit-aware bigram_index modes, which is much cleaner than duplicating fields or inventing custom preprocessing pipelines.
If your main problem is phrase-search performance rather than glued model-name matching, see How to Speed Up Phrase Search with bigram_index .
