# How to Speed Up Phrase Search with bigram_index

A practical guide to using bigram_index to accelerate phrase queries in Manticore Search, with clear explanations of all, first_freq, both_freq, and a reproducible manticore-load benchmark.

## TL;DR

[bigram_index](https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_index) can be used for several purposes, and in this article we focus specifically on phrase-search performance: on the 1M-document benchmark below, `bigram_index='all'` improved QPS by about `2.9x` and cut average phrase-query latency by about `3.2x`.

If your main problem is matching `xt850` against `xt 850` rather than speeding up phrase search, see [How to Make xt850 Match xt 850](/blog/how-to-make-searches-like-xt850-match-xt-850/).

Phrase search can be expensive. Even when a query is short, the engine still has to verify ordering and adjacency, and that work gets more noticeable when:

- the individual words are common
- the dataset is large
- phrase queries are frequent in your workload

That is exactly what [bigram_index](https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Low-level_tokenization#bigram_index) is for.

## What bigram indexing actually does

Normally, a phrase like `"noise cancelling headphones"` is handled as separate tokens that also need to appear in the right order and next to each other. Bigram indexing lets Manticore pre-store adjacent token pairs such as:

- `noise cancelling`
- `cancelling headphones`

That gives the engine a faster way to narrow down candidate documents during phrase matching.

This article focuses specifically on phrase acceleration.

## Important caveat: bigrams work at tokenization level

This is the part that is easy to miss when you only look at the happy-path speedup story.

`bigram_index` works at the tokenization level only. It does not account for later transformations such as morphology, wordforms, or stopwords, and that can materially change phrase-matching expectations.

The practical conclusion is simple: bigrams can be excellent for phrase speed, but if your index relies heavily on morphology, wordforms, or stopwords, test the actual phrase behavior you care about before rolling the setting out broadly.

## Mode 1: Default behavior

This is the baseline. No explicit bigram indexing is enabled, so no bigram posting lists are stored.

Use it when:

- phrase search is rare
- documents are short
- you want the leanest indexing path

### Example

```sql
DROP TABLE IF EXISTS bi_none_demo;

CREATE TABLE bi_none_demo(title text);

INSERT INTO bi_none_demo VALUES
  (1,'wireless noise cancelling headphones'),
  (2,'noise cancelling microphone'),
  (3,'wireless gaming headset');

SELECT id, title FROM bi_none_demo WHERE MATCH('"noise cancelling"');
```

This is the baseline behavior. The query matches the expected rows, but Manticore has no precomputed bigram posting lists to help resolve the phrase more efficiently.

## Mode 2: `all`

```ini
bigram_index = all
```

This is the most aggressive phrase-acceleration mode. Every adjacent token pair gets indexed as a bigram.

Use it when:

- exact phrase search is a core feature
- phrase queries often include common words and produce many candidates
- you want the strongest phrase acceleration
- you do not want to tune a frequent-word list

### Example

```sql
DROP TABLE IF EXISTS bi_all_demo;

CREATE TABLE bi_all_demo(title text)
  bigram_index='all';

INSERT INTO bi_all_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_all_demo WHERE MATCH('"house of the dragon"');
SELECT id, title FROM bi_all_demo WHERE MATCH('"made for iphone"');
```

The important point here is not different matches, but different indexing strategy: `all` stores every adjacent pair, so phrase queries have the maximum amount of bigram help available at search time.

The reason to choose `all` is when phrase search becomes more expensive because many documents match the individual words, and Manticore then has to do more positional verification to confirm the exact phrase. `all` helps by narrowing candidates earlier.

## Mode 3: `first_freq`

```ini
bigram_index = first_freq
bigram_freq_words = for, of, the, with
```

This mode stores a pair only when the first token is in your frequent-word list.

Use it when:

- phrase search matters
- you want a lighter alternative to `all`
- many phrases in your data contain words that are genuinely frequent in your own corpus

With the list above:

- `for iphone` is eligible
- `of the` is eligible
- `the dragon` is eligible
- `made for` is not eligible
- `lord of` is not eligible

For production use, do not pick `bigram_freq_words` from memory. Derive it from your own data. A practical way is to dump dictionary stats with [indextool](https://manual.manticoresearch.com/Miscellaneous_tools#indextool) using `--dumpdict ... --stats`, review the most frequent tokens, and then build a small `bigram_freq_words` list from those results.

### Example

```sql
DROP TABLE IF EXISTS bi_first_freq_demo;

CREATE TABLE bi_first_freq_demo(title text)
  bigram_index='first_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_first_freq_demo VALUES
  (1,'made for iphone charger'),
  (2,'lord of the rings trilogy'),
  (3,'house of the dragon season 2');

SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"made for iphone"');
SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"lord of the"');
```

The queries still return the expected rows. What changes is which pairs get indexed:

- `"made for iphone"` benefits from `for iphone`
- `"lord of the"` benefits from `of the`

This makes `first_freq` a lighter alternative to `all` when many useful phrases involve common bridge words.

## Mode 4: `both_freq`

```ini
bigram_index = both_freq
bigram_freq_words = for, of, the, with
```

This is the narrowest frequency-based mode. A pair is stored only when both tokens are in the frequent-word list.

Use it when:

- you want the most conservative bigram footprint
- you mainly care about pairs built from words that are highly frequent in your corpus
- you are tuning a large corpus and do not want to index every adjacent pair

With the same list:

- `of the` is eligible
- `for iphone` is not eligible
- `the dragon` is not eligible

### Example

```sql
DROP TABLE IF EXISTS bi_both_freq_demo;

CREATE TABLE bi_both_freq_demo(title text)
  bigram_index='both_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_both_freq_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"lord of the"');
SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"made for iphone"');
```

The queries still match, but the internal selectivity differs:

- `"lord of the"` includes `of the`, which `both_freq` is willing to store
- `"made for iphone"` includes `for iphone`, which `first_freq` would cover but `both_freq` would not

## Which performance mode should you choose?

The benchmark in this article shows that `all` can deliver a strong speedup, but it is still just one benchmark on one workload.

Manticore's own documentation says that for most use cases, `both_freq` is the best mode. That is a sensible default because it aims for a more balanced trade-off between phrase acceleration and indexing cost.

Use the modes like this:

- choose `both_freq` as the default starting point for general phrase-search workloads
- choose `all` when phrase search is especially important and you want the strongest acceleration, accepting higher indexing cost
- choose `first_freq` when many useful phrases in your data involve common bridge words and you want something broader than `both_freq`
- choose the default behavior when phrase acceleration is not important

## Benchmark: does bigram indexing really speed up phrase search?

Yes. In a simple local benchmark, the difference was easy to measure.

I used `manticore-load` to build two 1M-document tables against the same Manticore instance:

- one with no explicit `bigram_index` setting
- one with `bigram_index='all'`

The documents were random 60-80 word texts, and the benchmark repeatedly ran random 2-word phrase queries.

For clarity, both indexing and search were run with `--threads=1`. Multi-threaded numbers would of course be higher, but single-thread runs make it easier to see what the feature changes on one CPU core.

```sql
SELECT COUNT(*) FROM bench_bigram_* WHERE MATCH('"<text/2/2>"')
```

### Benchmark setup

Data load without bigrams:

```bash
manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_none_rand(title text)" \
  --load="INSERT INTO bench_bigram_none_rand(id,title) VALUES(<increment>,'<text/60/80>')"
```

Data load with all bigrams:

```bash
manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_all_rand(title text) bigram_index='all'" \
  --load="INSERT INTO bench_bigram_all_rand(id,title) VALUES(<increment>,'<text/60/80>')"
```

Search benchmark without bigrams:

```bash
manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_none_rand WHERE MATCH('\\\"<text/2/2>\\\"')"
```

Search benchmark with all bigrams:

```bash
manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_all_rand WHERE MATCH('\\\"<text/2/2>\\\"')"
```

### What I observed

On this local run:

| Table | QPS | Avg latency |
| --- | ---: | ---: |
| `bench_bigram_none_rand` | `755` | `1.3 ms` |
| `bench_bigram_all_rand` | `2175` | `0.4 ms` |

That is roughly a `2.9x` improvement in QPS and about a `3.2x` improvement in average latency on the same 1M-document workload.

Indexing was slower with `bigram_index='all'`, which is expected:

- without bigrams: about `45k docs/sec`
- with `all`: about `17k docs/sec`

That trade-off is exactly why multiple modes exist.

## Final takeaway

If your main problem is phrase-search performance, treat `bigram_index` first and foremost as an acceleration feature.

For most real workloads, start with `both_freq` and measure. Move to `all` if you need a stronger effect and can afford the extra indexing cost. Consider `first_freq` when your phrase workload is heavily shaped by common bridge words.
