blog-post

Parallel chunk merging in Manticore Search

Starting from Manticore Search 24.4.0, RT table compaction has a more capable execution model. Instead of merging chunk pairs one-by-one in a serial flow, optimization now supports two important improvements:

  • disk chunk merges can run in parallel

  • each merge job can merge more than two chunks at once

  • parallel_chunk_merges : how many RT disk chunk merge jobs may run at the same time

  • merge_chunks_per_job : how many RT disk chunks a single job can merge in one pass

The compaction docs were also updated to describe optimization as an N-way merge handled by a background worker pool rather than a single serial merge thread.

Why this matters

For RT workloads, the interesting number is often not just how fast you can insert documents, but how long it takes until compaction catches up and the table returns to its target chunk count.

That is especially noticeable when:

  • you ingest data at a sustained rate
  • optimize_cutoff is low enough that merges kick in early
  • you wait for compaction to finish before considering the load fully complete

This matters most in two common cases:

  • you are doing an initial bulk upload into a real-time table and want the table not just searchable, but already compacted to its steady state before putting more pressure on it
  • you regularly ingest large batches and want each batch to finish cleanly before the next one arrives

The table is searchable before compaction finishes, but "fully searchable" and "fully optimized" are not the same thing. A higher chunk count can still matter if you care about keeping the table close to its target shape, limiting background merge work before the next ingest wave, or reducing the window where storage is busy with post-load compaction.

To show the difference, we loaded 10 million documents into an RT table. Each document contains:

  • id bigint
  • name text with generated text between 10 and 100 words
  • type int

The table was created with:

CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'

So the target was to compact the table back down to roughly 16 disk chunks.

For the benchmark we used manticore-load , our load generation and benchmarking tool. It is useful for reproducing scenarios like this, stress-testing ingestion, and comparing configuration changes without building custom scripts every time.

The data was loaded with:

manticore-load \
  --cache-gen-workers=5 \
  --drop \
  --batch-size=1000 \
  --threads=5 \
  --total=10000000 \
  --init="CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'" \
  --load="INSERT INTO test(id,name,type) VALUES(<increment>,'<text/10/100>',<int/1/100>)" \
  --wait

Before: one merge job, two chunks at a time

With the old behavior forced explicitly:

mysql -P9306 -h0 -e "set global parallel_chunk_merges=1; set global merge_chunks_per_job=2"

the run looked like this:

  • merging started at 14 seconds, when about 1.8M documents had been inserted
  • all 10M documents were loaded after 1 minute 18 seconds
  • at that point the data was already fully searchable
  • compaction kept running in the background until 3 minutes 23 seconds

At 01:18, the table still had more than 50 chunks. Near the end of loading the status looked like:

17:14:50  01:17     98%         133      128.4K   21%     5          53        1         4.22GB      9.9M
17:14:51  01:18     100%        131      310.9K   15%     1          53        1         4.27GB      10.0M
...
17:16:55  03:22     100%        0        49.4K    4%      1          17        1         4.27GB      10.0M
...
Total time:       03:23

This is the classic pattern of a healthy ingest pipeline followed by a long merge tail.

After: parallel merges plus larger merge jobs

With the new settings:

mysql -P9306 -h0 -e "set global parallel_chunk_merges=3; set global merge_chunks_per_job=5"

the same workload finished much faster:

  • merging again started at about 14 seconds
  • all 10M documents were again loaded after about 1 minute 18 seconds
  • full compaction finished after only 1 minute 31 seconds

The end of the run looked like this:

17:19:22  01:17     99%         127      127.9K   28%     6          26        1         4.22GB      9.9M
17:19:23  01:18     100%        132      1883.8K  17%     1          23        1         4.25GB      10.0M
...
17:19:36  01:31     100%        0        110.2K   3%      1          17        1         4.25GB      10.0M
...
Total time:       01:31

What changed in practice

The ingest phase itself stayed roughly the same:

  • old settings: 1:18 to load all data
  • new settings: 1:18 to load all data

The big gain came from post-ingest compaction:

  • old settings: about 2:05 of additional merge time after loading finished
  • new settings: about 0:13 of additional merge time after loading finished

That is roughly:

  • 55% lower total time overall, from 3:23 down to 1:31
  • about 90% less merge tail after the last document was inserted

Chunk pressure during ingest was much lower too. Near the end of loading:

  • old settings: 53 chunks
  • new settings: 23 chunks

So the improvement is not just that compaction finishes sooner. It also keeps the chunk count under control much more aggressively while data is still being inserted.

What about the new defaults?

On this server, with the new default settings and no explicit tuning at all, the same workload finished in:

Total time:       01:57

That already cuts the old 03:23 result substantially, while still leaving room for additional tuning with:

  • parallel_chunk_merges
  • merge_chunks_per_job

In other words, the new defaults already improve the out-of-the-box experience, and systems with enough I/O headroom can push compaction even further by increasing both settings carefully.

Broader benchmark results: row-wise and columnar storage

The 10M-document example above shows the mechanics clearly, but the larger picture is even more interesting. In a wider test matrix we measured the combined load + optimize time for both row-wise and columnar storage across multiple values of:

  • parallel_chunk_merges
  • merge_chunks_per_job

The headline result is that, in some cases, tuning these settings can reduce total load + optimize time by:

  • up to 4.6x for row-wise storage
  • up to 6.8x for columnar storage

Here is the best-vs-worst picture from that test set:

StorageBest settingsBest timeSlowest settingsSlowest timeImprovement
Row-wiseparallel_chunk_merges=4, merge_chunks_per_job=514:35parallel_chunk_merges=1, merge_chunks_per_job=267:154.61x
Columnarparallel_chunk_merges=4, merge_chunks_per_job=515:10parallel_chunk_merges=1, merge_chunks_per_job=299:146.80x

There is also a useful tuning pattern in the full results:

  • the best runs for both storage modes clustered around parallel_chunk_merges=4..5
  • the best runs also clustered around merge_chunks_per_job=4..5
  • the slowest results were consistently at parallel_chunk_merges=1 with merge_chunks_per_job=2

In other words, the old serial two-chunk pattern is not just a little slower. On large workloads it can become dramatically slower, especially with columnar storage.

How to think about the two settings

The new docs describe two separate levers:

  • parallel_chunk_merges increases how many merge jobs can run at once
  • merge_chunks_per_job increases how many chunks each job can consume

Lower merge_chunks_per_job values make it easier to schedule more jobs in parallel because each job consumes fewer chunks from the available pool. If a table has many chunks waiting to be compacted, smaller jobs leave more independent chunks available for other workers, so the scheduler can keep several merges active at once. Higher values reduce the total number of merge steps, but each job becomes heavier and grabs a larger portion of the available chunks, which can leave less room for concurrent jobs.

The right balance depends on your storage and workload, but the benchmark above shows that combining both approaches can dramatically reduce the time spent waiting for RT chunk compaction to finish.

Takeaway

If your RT workloads spend too long waiting for chunk compaction after bulk inserts, the new parallel merge model changes that equation significantly.

On this 10M-document test with optimize_cutoff=16:

ModeSearchable atFully optimized at
Old settings: parallel_chunk_merges=1, merge_chunks_per_job=21:183:23
New defaults1:181:57
Tuned new settings: parallel_chunk_merges=3, merge_chunks_per_job=51:181:31
  • the time until all data became searchable stayed the same
  • the time until chunk compaction completed dropped from 3:23 to 1:31
  • even the new defaults reduced the total time to 1:57

This is exactly the kind of improvement that matters for operational RT indexing. The data is searchable as soon as it is loaded, and that point stayed about the same in both runs. The difference is what happens after that: how long the server keeps spending time compacting chunks in the background before the table returns to its target shape. If your workflow depends on the table becoming compact again before the next heavy ingest, before a maintenance window closes, or before you hand the system over to a search workload that should run with fewer chunks and less background merge pressure, the improvement is substantial.

Install Manticore Search

Install Manticore Search