TL;DR
bigram_index
can be used for several purposes, and in this article we focus specifically on phrase-search performance: on the 1M-document benchmark below, bigram_index='all' improved QPS by about 2.9x and cut average phrase-query latency by about 3.2x.
If your main problem is matching xt850 against xt 850 rather than speeding up phrase search, see
How to Make xt850 Match xt 850
.
Phrase search can be expensive. Even when a query is short, the engine still has to verify ordering and adjacency, and that work gets more noticeable when:
- the individual words are common
- the dataset is large
- phrase queries are frequent in your workload
That is exactly what bigram_index is for.
What bigram indexing actually does
Normally, a phrase like "noise cancelling headphones" is handled as separate tokens that also need to appear in the right order and next to each other. Bigram indexing lets Manticore pre-store adjacent token pairs such as:
noise cancellingcancelling headphones
That gives the engine a faster way to narrow down candidate documents during phrase matching.
This article focuses specifically on phrase acceleration.
Important caveat: bigrams work at tokenization level
This is the part that is easy to miss when you only look at the happy-path speedup story.
bigram_index works at the tokenization level only. It does not account for later transformations such as morphology, wordforms, or stopwords, and that can materially change phrase-matching expectations.
The practical conclusion is simple: bigrams can be excellent for phrase speed, but if your index relies heavily on morphology, wordforms, or stopwords, test the actual phrase behavior you care about before rolling the setting out broadly.
Mode 1: Default behavior
This is the baseline. No explicit bigram indexing is enabled, so no bigram posting lists are stored.
Use it when:
- phrase search is rare
- documents are short
- you want the leanest indexing path
Example
DROP TABLE IF EXISTS bi_none_demo;
CREATE TABLE bi_none_demo(title text);
INSERT INTO bi_none_demo VALUES
(1,'wireless noise cancelling headphones'),
(2,'noise cancelling microphone'),
(3,'wireless gaming headset');
SELECT id, title FROM bi_none_demo WHERE MATCH('"noise cancelling"');
This is the baseline behavior. The query matches the expected rows, but Manticore has no precomputed bigram posting lists to help resolve the phrase more efficiently.
Mode 2: all
bigram_index = all
This is the most aggressive phrase-acceleration mode. Every adjacent token pair gets indexed as a bigram.
Use it when:
- exact phrase search is a core feature
- phrase queries often include common words and produce many candidates
- you want the strongest phrase acceleration
- you do not want to tune a frequent-word list
Example
DROP TABLE IF EXISTS bi_all_demo;
CREATE TABLE bi_all_demo(title text)
bigram_index='all';
INSERT INTO bi_all_demo VALUES
(1,'lord of the rings trilogy'),
(2,'house of the dragon season 2'),
(3,'made for iphone charger');
SELECT id, title FROM bi_all_demo WHERE MATCH('"house of the dragon"');
SELECT id, title FROM bi_all_demo WHERE MATCH('"made for iphone"');
The important point here is not different matches, but different indexing strategy: all stores every adjacent pair, so phrase queries have the maximum amount of bigram help available at search time.
The reason to choose all is when phrase search becomes more expensive because many documents match the individual words, and Manticore then has to do more positional verification to confirm the exact phrase. all helps by narrowing candidates earlier.
Mode 3: first_freq
bigram_index = first_freq
bigram_freq_words = for, of, the, with
This mode stores a pair only when the first token is in your frequent-word list.
Use it when:
- phrase search matters
- you want a lighter alternative to
all - many phrases in your data contain words that are genuinely frequent in your own corpus
With the list above:
for iphoneis eligibleof theis eligiblethe dragonis eligiblemade foris not eligiblelord ofis not eligible
For production use, do not pick bigram_freq_words from memory. Derive it from your own data. A practical way is to dump dictionary stats with
indextool
using --dumpdict ... --stats, review the most frequent tokens, and then build a small bigram_freq_words list from those results.
Example
DROP TABLE IF EXISTS bi_first_freq_demo;
CREATE TABLE bi_first_freq_demo(title text)
bigram_index='first_freq'
bigram_freq_words='for,of,the,with';
INSERT INTO bi_first_freq_demo VALUES
(1,'made for iphone charger'),
(2,'lord of the rings trilogy'),
(3,'house of the dragon season 2');
SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"made for iphone"');
SELECT id, title FROM bi_first_freq_demo WHERE MATCH('"lord of the"');
The queries still return the expected rows. What changes is which pairs get indexed:
"made for iphone"benefits fromfor iphone"lord of the"benefits fromof the
This makes first_freq a lighter alternative to all when many useful phrases involve common bridge words.
Mode 4: both_freq
bigram_index = both_freq
bigram_freq_words = for, of, the, with
This is the narrowest frequency-based mode. A pair is stored only when both tokens are in the frequent-word list.
Use it when:
- you want the most conservative bigram footprint
- you mainly care about pairs built from words that are highly frequent in your corpus
- you are tuning a large corpus and do not want to index every adjacent pair
With the same list:
of theis eligiblefor iphoneis not eligiblethe dragonis not eligible
Example
DROP TABLE IF EXISTS bi_both_freq_demo;
CREATE TABLE bi_both_freq_demo(title text)
bigram_index='both_freq'
bigram_freq_words='for,of,the,with';
INSERT INTO bi_both_freq_demo VALUES
(1,'lord of the rings trilogy'),
(2,'house of the dragon season 2'),
(3,'made for iphone charger');
SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"lord of the"');
SELECT id, title FROM bi_both_freq_demo WHERE MATCH('"made for iphone"');
The queries still match, but the internal selectivity differs:
"lord of the"includesof the, whichboth_freqis willing to store"made for iphone"includesfor iphone, whichfirst_freqwould cover butboth_freqwould not
Which performance mode should you choose?
The benchmark in this article shows that all can deliver a strong speedup, but it is still just one benchmark on one workload.
Manticore's own documentation says that for most use cases, both_freq is the best mode. That is a sensible default because it aims for a more balanced trade-off between phrase acceleration and indexing cost.
Use the modes like this:
- choose
both_freqas the default starting point for general phrase-search workloads - choose
allwhen phrase search is especially important and you want the strongest acceleration, accepting higher indexing cost - choose
first_freqwhen many useful phrases in your data involve common bridge words and you want something broader thanboth_freq - choose the default behavior when phrase acceleration is not important
Benchmark: does bigram indexing really speed up phrase search?
Yes. In a simple local benchmark, the difference was easy to measure.
I used manticore-load to build two 1M-document tables against the same Manticore instance:
- one with no explicit
bigram_indexsetting - one with
bigram_index='all'
The documents were random 60-80 word texts, and the benchmark repeatedly ran random 2-word phrase queries.
For clarity, both indexing and search were run with --threads=1. Multi-threaded numbers would of course be higher, but single-thread runs make it easier to see what the feature changes on one CPU core.
SELECT COUNT(*) FROM bench_bigram_* WHERE MATCH('"<text/2/2>"')
Benchmark setup
Data load without bigrams:
manticore-load \
--drop \
--wait \
--threads=1 \
--batch-size=1000 \
--total=1000000 \
--init="CREATE TABLE bench_bigram_none_rand(title text)" \
--load="INSERT INTO bench_bigram_none_rand(id,title) VALUES(<increment>,'<text/60/80>')"
Data load with all bigrams:
manticore-load \
--drop \
--wait \
--threads=1 \
--batch-size=1000 \
--total=1000000 \
--init="CREATE TABLE bench_bigram_all_rand(title text) bigram_index='all'" \
--load="INSERT INTO bench_bigram_all_rand(id,title) VALUES(<increment>,'<text/60/80>')"
Search benchmark without bigrams:
manticore-load \
--threads=1 \
--total=5000 \
--load="SELECT COUNT(*) FROM bench_bigram_none_rand WHERE MATCH('\\\"<text/2/2>\\\"')"
Search benchmark with all bigrams:
manticore-load \
--threads=1 \
--total=5000 \
--load="SELECT COUNT(*) FROM bench_bigram_all_rand WHERE MATCH('\\\"<text/2/2>\\\"')"
What I observed
On this local run:
| Table | QPS | Avg latency |
|---|---|---|
bench_bigram_none_rand | 755 | 1.3 ms |
bench_bigram_all_rand | 2175 | 0.4 ms |
That is roughly a 2.9x improvement in QPS and about a 3.2x improvement in average latency on the same 1M-document workload.
Indexing was slower with bigram_index='all', which is expected:
- without bigrams: about
45k docs/sec - with
all: about17k docs/sec
That trade-off is exactly why multiple modes exist.
Final takeaway
If your main problem is phrase-search performance, treat bigram_index first and foremost as an acceleration feature.
For most real workloads, start with both_freq and measure. Move to all if you need a stronger effect and can afford the extra indexing cost. Consider first_freq when your phrase workload is heavily shaped by common bridge words.
