# Parallel chunk merging in Manticore Search

Manticore Search now supports parallel RT disk chunk merging and N-way merges, reducing compaction time dramatically while keeping ingest throughput stable.

Starting from **Manticore Search 24.4.0**, RT table compaction has a more capable execution model. Instead of merging chunk pairs one-by-one in a serial flow, optimization now supports two important improvements:

- disk chunk merges can run in parallel
- each merge job can merge more than two chunks at once

- [parallel_chunk_merges](https://manual.manticoresearch.com/Server_settings/Searchd#parallel_chunk_merges): how many RT disk chunk merge jobs may run at the same time
- [merge_chunks_per_job](https://manual.manticoresearch.com/Server_settings/Searchd#merge_chunks_per_job): how many RT disk chunks a single job can merge in one pass

The compaction docs were also updated to describe optimization as an **N-way merge** handled by a **background worker pool** rather than a single serial merge thread.

## Why this matters

For RT workloads, the interesting number is often not just how fast you can insert documents, but how long it takes until compaction catches up and the table returns to its target chunk count.

That is especially noticeable when:

- you ingest data at a sustained rate
- `optimize_cutoff` is low enough that merges kick in early
- you wait for compaction to finish before considering the load fully complete

This matters most in two common cases:

- you are doing an initial bulk upload into a real-time table and want the table not just searchable, but already compacted to its steady state before putting more pressure on it
- you regularly ingest large batches and want each batch to finish cleanly before the next one arrives

The table is searchable before compaction finishes, but "fully searchable" and "fully optimized" are not the same thing. A higher chunk count can still matter if you care about keeping the table close to its target shape, limiting background merge work before the next ingest wave, or reducing the window where storage is busy with post-load compaction.

To show the difference, we loaded **10 million documents** into an RT table. Each document contains:

- `id bigint`
- `name text` with generated text between 10 and 100 words
- `type int`

The table was created with:

```sql
CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'
```

So the target was to compact the table back down to roughly 16 disk chunks.

For the benchmark we used [manticore-load](/blog/manticore-load/), our load generation and benchmarking tool. It is useful for reproducing scenarios like this, stress-testing ingestion, and comparing configuration changes without building custom scripts every time.

The data was loaded with:

```bash
manticore-load \
  --cache-gen-workers=5 \
  --drop \
  --batch-size=1000 \
  --threads=5 \
  --total=10000000 \
  --init="CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'" \
  --load="INSERT INTO test(id,name,type) VALUES(<increment>,'<text/10/100>',<int/1/100>)" \
  --wait
```

## Before: one merge job, two chunks at a time

With the old behavior forced explicitly:

```bash
mysql -P9306 -h0 -e "set global parallel_chunk_merges=1; set global merge_chunks_per_job=2"
```

the run looked like this:

- merging started at **14 seconds**, when about **1.8M** documents had been inserted
- all **10M** documents were loaded after **1 minute 18 seconds**
- at that point the data was already fully searchable
- compaction kept running in the background until **3 minutes 23 seconds**

At `01:18`, the table still had more than 50 chunks. Near the end of loading the status looked like:

```text
17:14:50  01:17     98%         133      128.4K   21%     5          53        1         4.22GB      9.9M
17:14:51  01:18     100%        131      310.9K   15%     1          53        1         4.27GB      10.0M
...
17:16:55  03:22     100%        0        49.4K    4%      1          17        1         4.27GB      10.0M
...
Total time:       03:23
```

This is the classic pattern of a healthy ingest pipeline followed by a long merge tail.

## After: parallel merges plus larger merge jobs

With the new settings:

```bash
mysql -P9306 -h0 -e "set global parallel_chunk_merges=3; set global merge_chunks_per_job=5"
```

the same workload finished much faster:

- merging again started at about **14 seconds**
- all **10M** documents were again loaded after about **1 minute 18 seconds**
- full compaction finished after only **1 minute 31 seconds**

The end of the run looked like this:

```text
17:19:22  01:17     99%         127      127.9K   28%     6          26        1         4.22GB      9.9M
17:19:23  01:18     100%        132      1883.8K  17%     1          23        1         4.25GB      10.0M
...
17:19:36  01:31     100%        0        110.2K   3%      1          17        1         4.25GB      10.0M
...
Total time:       01:31
```

## What changed in practice

The ingest phase itself stayed roughly the same:

- old settings: **1:18** to load all data
- new settings: **1:18** to load all data

The big gain came from post-ingest compaction:

- old settings: about **2:05** of additional merge time after loading finished
- new settings: about **0:13** of additional merge time after loading finished

That is roughly:

- **55% lower total time** overall, from **3:23** down to **1:31**
- about **90% less merge tail** after the last document was inserted

Chunk pressure during ingest was much lower too. Near the end of loading:

- old settings: **53 chunks**
- new settings: **23 chunks**

So the improvement is not just that compaction finishes sooner. It also keeps the chunk count under control much more aggressively while data is still being inserted.

## What about the new defaults?

On this server, with the new default settings and no explicit tuning at all, the same workload finished in:

```text
Total time:       01:57
```

That already cuts the old `03:23` result substantially, while still leaving room for additional tuning with:

- `parallel_chunk_merges`
- `merge_chunks_per_job`

In other words, the new defaults already improve the out-of-the-box experience, and systems with enough I/O headroom can push compaction even further by increasing both settings carefully.

## Broader benchmark results: row-wise and columnar storage

The 10M-document example above shows the mechanics clearly, but the larger picture is even more interesting. In a wider test matrix we measured the combined **load + optimize** time for both row-wise and columnar storage across multiple values of:

- `parallel_chunk_merges`
- `merge_chunks_per_job`

The headline result is that, in some cases, tuning these settings can reduce total load + optimize time by:

- up to **4.6x** for **row-wise** storage
- up to **6.8x** for **columnar** storage

Here is the best-vs-worst picture from that test set:

| Storage | Best settings | Best time | Slowest settings | Slowest time | Improvement |
| --- | --- | --- | --- | --- | --- |
| Row-wise | `parallel_chunk_merges=4`, `merge_chunks_per_job=5` | 14:35 | `parallel_chunk_merges=1`, `merge_chunks_per_job=2` | 67:15 | 4.61x |
| Columnar | `parallel_chunk_merges=4`, `merge_chunks_per_job=5` | 15:10 | `parallel_chunk_merges=1`, `merge_chunks_per_job=2` | 99:14 | 6.80x |

There is also a useful tuning pattern in the full results:

- the best runs for both storage modes clustered around `parallel_chunk_merges=4..5`
- the best runs also clustered around `merge_chunks_per_job=4..5`
- the slowest results were consistently at `parallel_chunk_merges=1` with `merge_chunks_per_job=2`

In other words, the old serial two-chunk pattern is not just a little slower. On large workloads it can become dramatically slower, especially with columnar storage.

## How to think about the two settings

The new docs describe two separate levers:

- `parallel_chunk_merges` increases how many merge jobs can run at once
- `merge_chunks_per_job` increases how many chunks each job can consume

Lower `merge_chunks_per_job` values make it easier to schedule more jobs in parallel because each job consumes fewer chunks from the available pool. If a table has many chunks waiting to be compacted, smaller jobs leave more independent chunks available for other workers, so the scheduler can keep several merges active at once. Higher values reduce the total number of merge steps, but each job becomes heavier and grabs a larger portion of the available chunks, which can leave less room for concurrent jobs.

The right balance depends on your storage and workload, but the benchmark above shows that combining both approaches can dramatically reduce the time spent waiting for RT chunk compaction to finish.

## Takeaway

If your RT workloads spend too long waiting for chunk compaction after bulk inserts, the new parallel merge model changes that equation significantly.

On this 10M-document test with `optimize_cutoff=16`:

| Mode | Searchable at | Fully optimized at |
| --- | --- | --- |
| Old settings: `parallel_chunk_merges=1`, `merge_chunks_per_job=2` | 1:18 | 3:23 |
| New defaults | 1:18 | 1:57 |
| Tuned new settings: `parallel_chunk_merges=3`, `merge_chunks_per_job=5` | 1:18 | 1:31 |

- the time until all data became searchable stayed the same
- the time until chunk compaction completed dropped from **3:23** to **1:31**
- even the new defaults reduced the total time to **1:57**

This is exactly the kind of improvement that matters for operational RT indexing. The data is searchable as soon as it is loaded, and that point stayed about the same in both runs. The difference is what happens after that: how long the server keeps spending time compacting chunks in the background before the table returns to its target shape. If your workflow depends on the table becoming compact again before the next heavy ingest, before a maintenance window closes, or before you hand the system over to a search workload that should run with fewer chunks and less background merge pressure, the improvement is substantial.
