blog-post

How to Speed Up Phrase Search with bigram_index

TL;DR

bigram_index can be used for several purposes, and in this article we focus specifically on phrase-search performance: on the 1M-document benchmark below, bigram_index='all' improved QPS by about 2.9x and cut average phrase-query latency by about 3.2x.

If your main problem is matching xt850 against xt 850 rather than speeding up phrase search, see How to Make xt850 Match xt 850 .

Phrase search can be expensive. Even when a query is short, the engine still has to verify ordering and adjacency, and that work gets more noticeable when:

  • the individual words are common
  • the dataset is large
  • phrase queries are frequent in your workload

That is exactly what bigram_index is for.

What bigram indexing actually does

Normally, a phrase like "noise cancelling headphones" is handled as separate tokens that also need to appear in the right order and next to each other. Bigram indexing lets Manticore pre-store adjacent token pairs such as:

  • noise cancelling
  • cancelling headphones

That gives the engine a faster way to narrow down candidate documents during phrase matching.

This article focuses specifically on phrase acceleration.

Important caveat: bigrams work at tokenization level

This is the part that is easy to miss when you only look at the happy-path speedup story.

bigram_index works at the tokenization level only. It does not account for later transformations such as morphology, wordforms, or stopwords, and that can materially change phrase-matching expectations.

The practical conclusion is simple: bigrams can be excellent for phrase speed, but if your index relies heavily on morphology, wordforms, or stopwords, test the actual phrase behavior you care about before rolling the setting out broadly.

Mode 1: Default behavior

This is the baseline. No explicit bigram indexing is enabled, so no bigram posting lists are stored.

Use it when:

  • phrase search is rare
  • documents are short
  • you want the leanest indexing path

Example

DROP TABLE IF EXISTS bi_none_demo;

CREATE TABLE bi_none_demo(title text);

INSERT INTO bi_none_demo VALUES
  (1,'wireless noise cancelling headphones'),
  (2,'noise cancelling microphone'),
  (3,'wireless gaming headset');

SELECT id, title FROM bi_none_demo WHERE MATCH('"noise cancelling"');

This is the baseline behavior. The query matches the expected rows, but Manticore has no precomputed bigram posting lists to help resolve the phrase more efficiently.

Mode 2: all

bigram_index = all

This is the most aggressive phrase-acceleration mode. Every adjacent token pair gets indexed as a bigram.

Use it when:

  • exact phrase search is a core feature
  • phrase queries often include common words and produce many candidates
  • you want the strongest phrase acceleration
  • you do not want to tune a frequent-word list

Example

DROP TABLE IF EXISTS bi_all_demo;

CREATE TABLE bi_all_demo(title text)
  bigram_index='all';

INSERT INTO bi_all_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_all_demo WHERE MATCH('"house of the dragon"');
SELECT id, title FROM bi_all_demo WHERE MATCH('"made for iphone"');

The important point here is not different matches, but different indexing strategy: all stores every adjacent pair, so phrase queries have the maximum amount of bigram help available at search time.

The reason to choose all is when phrase search becomes more expensive because many documents match the individual words, and Manticore then has to do more positional verification to confirm the exact phrase. all helps by narrowing candidates earlier.

Mode 3: first_freq

bigram_index = first_freq
bigram_freq_words = for, of, the, with

This mode stores a pair only when the first token is in your frequent-word list.

Use it when:

  • phrase search matters
  • you want a lighter alternative to all
  • many phrases in your data contain words that are genuinely frequent in your own corpus

With the list above:

  • for iphone is eligible
  • of the is eligible
  • the dragon is eligible
  • made for is not eligible
  • lord of is not eligible

For production use, do not pick bigram_freq_words from memory. Derive it from your own data. A practical way is to dump dictionary stats with indextool using --dumpdict ... --stats, review the most frequent tokens, and then build a small bigram_freq_words list from those results.

Example

DROP TABLE IF EXISTS bi_first_freq_demo;

CREATE TABLE bi_first_freq_demo(title text)
  bigram_index='first_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_first_freq_demo VALUES
  (1,'made for iphone charger'),
  (2,'lord of the rings trilogy'),
  (3,'house of the dragon season 2');

SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"made for iphone"');
SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"lord of the"');

The queries still return the expected rows. What changes is which pairs get indexed:

  • "made for iphone" benefits from for iphone
  • "lord of the" benefits from of the

This makes first_freq a lighter alternative to all when many useful phrases involve common bridge words.

Mode 4: both_freq

bigram_index = both_freq
bigram_freq_words = for, of, the, with

This is the narrowest frequency-based mode. A pair is stored only when both tokens are in the frequent-word list.

Use it when:

  • you want the most conservative bigram footprint
  • you mainly care about pairs built from words that are highly frequent in your corpus
  • you are tuning a large corpus and do not want to index every adjacent pair

With the same list:

  • of the is eligible
  • for iphone is not eligible
  • the dragon is not eligible

Example

DROP TABLE IF EXISTS bi_both_freq_demo;

CREATE TABLE bi_both_freq_demo(title text)
  bigram_index='both_freq'
  bigram_freq_words='for,of,the,with';

INSERT INTO bi_both_freq_demo VALUES
  (1,'lord of the rings trilogy'),
  (2,'house of the dragon season 2'),
  (3,'made for iphone charger');

SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"lord of the"');
SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"made for iphone"');

The queries still match, but the internal selectivity differs:

  • "lord of the" includes of the, which both_freq is willing to store
  • "made for iphone" includes for iphone, which first_freq would cover but both_freq would not

Which performance mode should you choose?

The benchmark in this article shows that all can deliver a strong speedup, but it is still just one benchmark on one workload.

Manticore's own documentation says that for most use cases, both_freq is the best mode. That is a sensible default because it aims for a more balanced trade-off between phrase acceleration and indexing cost.

Use the modes like this:

  • choose both_freq as the default starting point for general phrase-search workloads
  • choose all when phrase search is especially important and you want the strongest acceleration, accepting higher indexing cost
  • choose first_freq when many useful phrases in your data involve common bridge words and you want something broader than both_freq
  • choose the default behavior when phrase acceleration is not important

Yes. In a simple local benchmark, the difference was easy to measure.

I used manticore-load to build two 1M-document tables against the same Manticore instance:

  • one with no explicit bigram_index setting
  • one with bigram_index='all'

The documents were random 60-80 word texts, and the benchmark repeatedly ran random 2-word phrase queries.

For clarity, both indexing and search were run with --threads=1. Multi-threaded numbers would of course be higher, but single-thread runs make it easier to see what the feature changes on one CPU core.

SELECT COUNT(*) FROM bench_bigram_* WHERE MATCH('"<text/2/2>"')

Benchmark setup

Data load without bigrams:

manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_none_rand(title text)" \
  --load="INSERT INTO bench_bigram_none_rand(id,title) VALUES(<increment>,'<text/60/80>')"

Data load with all bigrams:

manticore-load \
  --drop \
  --wait \
  --threads=1 \
  --batch-size=1000 \
  --total=1000000 \
  --init="CREATE TABLE bench_bigram_all_rand(title text) bigram_index='all'" \
  --load="INSERT INTO bench_bigram_all_rand(id,title) VALUES(<increment>,'<text/60/80>')"

Search benchmark without bigrams:

manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_none_rand WHERE MATCH('\\\"<text/2/2>\\\"')"

Search benchmark with all bigrams:

manticore-load \
  --threads=1 \
  --total=5000 \
  --load="SELECT COUNT(*) FROM bench_bigram_all_rand WHERE MATCH('\\\"<text/2/2>\\\"')"

What I observed

On this local run:

TableQPSAvg latency
bench_bigram_none_rand7551.3 ms
bench_bigram_all_rand21750.4 ms

That is roughly a 2.9x improvement in QPS and about a 3.2x improvement in average latency on the same 1M-document workload.

Indexing was slower with bigram_index='all', which is expected:

  • without bigrams: about 45k docs/sec
  • with all: about 17k docs/sec

That trade-off is exactly why multiple modes exist.

Final takeaway

If your main problem is phrase-search performance, treat bigram_index first and foremost as an acceleration feature.

For most real workloads, start with both_freq and measure. Move to all if you need a stronger effect and can afford the extra indexing cost. Consider first_freq when your phrase workload is heavily shaped by common bridge words.

Install Manticore Search

Install Manticore Search