# How to Make xt850 Match xt 850

A practical guide to solving glued-vs-spaced search queries like xt850 vs xt 850, with clear explanations of bigram_delimiter, second_numeric, second_has_digit, and reproducible examples.

## TL;DR

Since version `23.0.0`, Manticore can make searches like `xt850` match `xt 850` using [bigram_delimiter](https://manual.manticoresearch.com/dev/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_delimiter) together with digit-aware [bigram_index](https://manual.manticoresearch.com/dev/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_index) modes.

This solves a common tokenization mismatch in product search, where users remove spaces from model names but the source data stores them as separate tokens.

## Assumptions and verification

This article assumes:

- RT tables created with SQL examples exactly as shown
- default tokenization unless the example explicitly changes a setting
- ASCII digits in model names, because `second_numeric` and `second_has_digit` are digit-aware modes built around `0-9`

All SQL examples and expected outputs in this article were verified against a real Manticore `23.0.0` instance before publishing, using fresh tables created from scratch for each scenario.

## The broader search problem

Imagine a catalog containing:

- `xt 850 action camera`
- `iphone 5se battery case`
- `canon eos 80d body`
- `thinkpad x1 carbon`

Now imagine users searching for:

- `xt850`
- `iphone5se`
- `eos80d`
- `thinkpadx1`

From the user's point of view, these should obviously match. From the engine's point of view, they often do not, because the indexed text is tokenized as separate terms.

Search systems usually attack that mismatch in one of four ways:

- index prefixes or infixes
- add custom normalization rules
- duplicate content into alternate normalized fields
- index adjacent token pairs and optionally store glued variants too

Manticore's newer bigram functionality is a structured way to do the fourth option without awkward field duplication.

## Baseline: why `xt850` fails by default

Here is the problem in its simplest form:

```sql
DROP TABLE IF EXISTS bi_default_demo;

CREATE TABLE bi_default_demo(title text);

INSERT INTO bi_default_demo VALUES
  (1,'xt 850 action camera');

SELECT id, title FROM bi_default_demo WHERE MATCH('xt850');
```

Expected result:

```sql
Empty set
```

Why does this fail?

Because the document is indexed as two separate tokens, `xt` and `850`, while the query is a single token, `xt850`.

By default, Manticore does not assume that:

- `xt850` should be split into `xt` + `850`
- or `xt` + `850` should also be searchable as `xt850`

So this is not really a typo-tolerance problem or a phrase problem. It is a tokenization mismatch: the index sees two tokens, while the query provides one.

That is the gap the newer bigram settings are designed to close. They let Manticore index selected adjacent token pairs in a form that can also match glued queries.

## Why bigrams help here

[bigram_index](https://manual.manticoresearch.com/dev/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_index) can help with both [phrase acceleration](/blog/how-to-speed-up-phrase-search-with-bigram-index/) and model-name matching, and in this article we focus on the `xt 850` vs `xt850` problem.

The key idea is simple:

- detect adjacent token pairs that look like model names
- store those pairs in a glued form too
- let queries such as `xt850`, `iphone5se`, or `thinkpadx1` hit the spaced text

That is where [bigram_delimiter](https://manual.manticoresearch.com/dev/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_delimiter) matters.

## A note about [bigram_delimiter](https://manual.manticoresearch.com/dev/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_delimiter)

`bigram_index` decides which adjacent pairs are eligible.

`bigram_delimiter` decides how eligible bigrams are stored:

- `true`: internal delimited token only
- `none`: glued token only, such as `galaxy24`
- `both`: both forms

The practical difference is easiest to understand from the query side:

- with `true`, Manticore keeps the internal bigram form used for phrase optimization, but it does not keep the glued user-facing form, so a query like `xt850` will not match `xt 850`
- with `none`, Manticore keeps only the glued form, so `xt850` can match `xt 850`, but you are leaning entirely on the glued representation for those pairs
- with `both`, Manticore keeps both the internal bigram representation and the glued form, so `xt850` can match `xt 850` without giving up ordinary phrase behavior

For this use case, `both` is usually the safer default because it covers the user-visible problem directly while keeping behavior less surprising for normal phrase queries and mixed workloads.

## Mode 1: `second_numeric`

```ini
bigram_index = second_numeric
bigram_delimiter = both
```

This mode is aimed at model names where the second token is purely numeric.

That is common in product catalogs:

- `xt 850`
- `galaxy 24`
- `playstation 5`
- `pixel 8`

The idea is simple: users often search these as glued terms such as `xt850`, `galaxy24`, or `playstation5`, even though the source text stores them with a space.

`second_numeric` stores the pair only when the second token is ASCII digits only.

Use it when:

- you have product generations and numbered models
- users often remove spaces in search
- the second token is usually just digits

### Example

```sql
DROP TABLE IF EXISTS bi_second_numeric_demo;

CREATE TABLE bi_second_numeric_demo(title text)
  bigram_index='second_numeric'
  bigram_delimiter='both';

INSERT INTO bi_second_numeric_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon');
```

Then test the queries one by one:

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+
```

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+
```

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('playstation5');

+------+--------------------+
| id   | title              |
+------+--------------------+
|    3 | playstation 5 slim |
+------+--------------------+
```

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('iphone5se');

Empty set
```

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('eos80d');

Empty set
```

```sql
SELECT id, title FROM bi_second_numeric_demo WHERE MATCH('thinkpadx1');

Empty set
```

That boundary is the whole point of the mode:

- `24` and `5` qualify
- `5se`, `80d`, and `x1` do not

## Mode 2: `second_has_digit`

```ini
bigram_index = second_has_digit
bigram_delimiter = both
```

This mode is the more flexible sibling of `second_numeric`.

It stores the pair when the second token contains at least one ASCII digit. That makes it a much better fit for real product catalogs, where model identifiers are often mixed alphanumeric strings:

- `xt 850`
- `iphone 5se`
- `eos 80d`
- `thinkpad x1`

Use it when:

- your model names mix letters and digits
- users frequently remove spaces in their searches
- you want catalog-friendly matching without indexing every pair in the table

### Example

```sql
DROP TABLE IF EXISTS bi_second_has_digit_demo;

CREATE TABLE bi_second_has_digit_demo(title text)
  bigram_index='second_has_digit'
  bigram_delimiter='both';

INSERT INTO bi_second_has_digit_demo VALUES
  (1,'xt 850 action camera'),
  (2,'galaxy 24 ultra'),
  (3,'playstation 5 slim'),
  (4,'iphone 5se case'),
  (5,'canon eos 80d body'),
  (6,'thinkpad x1 carbon'),
  (7,'kindle paperwhite signature');
```

Then test the queries one by one:

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('xt850');

+------+----------------------+
| id   | title                |
+------+----------------------+
|    1 | xt 850 action camera |
+------+----------------------+
```

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('galaxy24');

+------+-----------------+
| id   | title           |
+------+-----------------+
|    2 | galaxy 24 ultra |
+------+-----------------+
```

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('iphone5se');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    4 | iphone 5se case     |
+------+---------------------+
```

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('eos80d');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    5 | canon eos 80d body  |
+------+---------------------+
```

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('thinkpadx1');

+------+---------------------+
| id   | title               |
+------+---------------------+
|    6 | thinkpad x1 carbon  |
+------+---------------------+
```

```sql
SELECT id, title FROM bi_second_has_digit_demo WHERE MATCH('kindlesignature');

Empty set
```

This is often the better fit for mixed model identifiers, because real catalog data frequently includes forms like `5se`, `80d`, or `x1` rather than only clean numeric suffixes like `24`.

## How to choose between the two

If your search problem is specifically "How do I make `xt850` find `xt 850`?", the practical rule is:

- use `second_numeric` when the second token is digits-only
- use `second_has_digit` when the second token may be mixed, like `5se`, `80d`, or `x1`

There is one practical caveat: this is compatible with other common text-processing settings in the straightforward case. `xt 850` still matches `xt850` with `morphology='stem_en'` enabled and with a wordforms rule enabled.

But that does not mean those settings rewrite the glued query for you. In tests, `iphones 5` matched `iphones5`, but not `iphone5`, even with stemming or a wordforms rule mapping `iphones` to `iphone`. So the short version is: basic `xt 850` vs `xt850` matching stays compatible with morphology and wordforms, but if you rely on them, test the exact query shape you care about.

## Final takeaway

The `xt850` problem is not really about one product name. It is about a broader mismatch between how users type model names and how search engines tokenize them.

Since version `23.0.0`, Manticore gives you a built-in way to handle that mismatch with `bigram_delimiter` plus the digit-aware `bigram_index` modes, which is much cleaner than duplicating fields or inventing custom preprocessing pipelines.

If your main problem is phrase-search performance rather than glued model-name matching, see [How to Speed Up Phrase Search with bigram_index](/blog/how-to-speed-up-phrase-search-with-bigram-index/).
