Starting from Manticore Search 24.4.0, RT table compaction has a more capable execution model. Instead of merging chunk pairs one-by-one in a serial flow, optimization now supports two important improvements:
disk chunk merges can run in parallel
each merge job can merge more than two chunks at once
parallel_chunk_merges : how many RT disk chunk merge jobs may run at the same time
merge_chunks_per_job : how many RT disk chunks a single job can merge in one pass
The compaction docs were also updated to describe optimization as an N-way merge handled by a background worker pool rather than a single serial merge thread.
Why this matters
For RT workloads, the interesting number is often not just how fast you can insert documents, but how long it takes until compaction catches up and the table returns to its target chunk count.
That is especially noticeable when:
- you ingest data at a sustained rate
optimize_cutoffis low enough that merges kick in early- you wait for compaction to finish before considering the load fully complete
This matters most in two common cases:
- you are doing an initial bulk upload into a real-time table and want the table not just searchable, but already compacted to its steady state before putting more pressure on it
- you regularly ingest large batches and want each batch to finish cleanly before the next one arrives
The table is searchable before compaction finishes, but "fully searchable" and "fully optimized" are not the same thing. A higher chunk count can still matter if you care about keeping the table close to its target shape, limiting background merge work before the next ingest wave, or reducing the window where storage is busy with post-load compaction.
To show the difference, we loaded 10 million documents into an RT table. Each document contains:
id bigintname textwith generated text between 10 and 100 wordstype int
The table was created with:
CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'
So the target was to compact the table back down to roughly 16 disk chunks.
For the benchmark we used manticore-load , our load generation and benchmarking tool. It is useful for reproducing scenarios like this, stress-testing ingestion, and comparing configuration changes without building custom scripts every time.
The data was loaded with:
manticore-load \
--cache-gen-workers=5 \
--drop \
--batch-size=1000 \
--threads=5 \
--total=10000000 \
--init="CREATE TABLE test(id bigint, name text, type int) optimize_cutoff='16'" \
--load="INSERT INTO test(id,name,type) VALUES(<increment>,'<text/10/100>',<int/1/100>)" \
--wait
Before: one merge job, two chunks at a time
With the old behavior forced explicitly:
mysql -P9306 -h0 -e "set global parallel_chunk_merges=1; set global merge_chunks_per_job=2"
the run looked like this:
- merging started at 14 seconds, when about 1.8M documents had been inserted
- all 10M documents were loaded after 1 minute 18 seconds
- at that point the data was already fully searchable
- compaction kept running in the background until 3 minutes 23 seconds
At 01:18, the table still had more than 50 chunks. Near the end of loading the status looked like:
17:14:50 01:17 98% 133 128.4K 21% 5 53 1 4.22GB 9.9M
17:14:51 01:18 100% 131 310.9K 15% 1 53 1 4.27GB 10.0M
...
17:16:55 03:22 100% 0 49.4K 4% 1 17 1 4.27GB 10.0M
...
Total time: 03:23
This is the classic pattern of a healthy ingest pipeline followed by a long merge tail.
After: parallel merges plus larger merge jobs
With the new settings:
mysql -P9306 -h0 -e "set global parallel_chunk_merges=3; set global merge_chunks_per_job=5"
the same workload finished much faster:
- merging again started at about 14 seconds
- all 10M documents were again loaded after about 1 minute 18 seconds
- full compaction finished after only 1 minute 31 seconds
The end of the run looked like this:
17:19:22 01:17 99% 127 127.9K 28% 6 26 1 4.22GB 9.9M
17:19:23 01:18 100% 132 1883.8K 17% 1 23 1 4.25GB 10.0M
...
17:19:36 01:31 100% 0 110.2K 3% 1 17 1 4.25GB 10.0M
...
Total time: 01:31
What changed in practice
The ingest phase itself stayed roughly the same:
- old settings: 1:18 to load all data
- new settings: 1:18 to load all data
The big gain came from post-ingest compaction:
- old settings: about 2:05 of additional merge time after loading finished
- new settings: about 0:13 of additional merge time after loading finished
That is roughly:
- 55% lower total time overall, from 3:23 down to 1:31
- about 90% less merge tail after the last document was inserted
Chunk pressure during ingest was much lower too. Near the end of loading:
- old settings: 53 chunks
- new settings: 23 chunks
So the improvement is not just that compaction finishes sooner. It also keeps the chunk count under control much more aggressively while data is still being inserted.
What about the new defaults?
On this server, with the new default settings and no explicit tuning at all, the same workload finished in:
Total time: 01:57
That already cuts the old 03:23 result substantially, while still leaving room for additional tuning with:
parallel_chunk_mergesmerge_chunks_per_job
In other words, the new defaults already improve the out-of-the-box experience, and systems with enough I/O headroom can push compaction even further by increasing both settings carefully.
Broader benchmark results: row-wise and columnar storage
The 10M-document example above shows the mechanics clearly, but the larger picture is even more interesting. In a wider test matrix we measured the combined load + optimize time for both row-wise and columnar storage across multiple values of:
parallel_chunk_mergesmerge_chunks_per_job
The headline result is that, in some cases, tuning these settings can reduce total load + optimize time by:
- up to 4.6x for row-wise storage
- up to 6.8x for columnar storage
Here is the best-vs-worst picture from that test set:
| Storage | Best settings | Best time | Slowest settings | Slowest time | Improvement |
|---|---|---|---|---|---|
| Row-wise | parallel_chunk_merges=4, merge_chunks_per_job=5 | 14:35 | parallel_chunk_merges=1, merge_chunks_per_job=2 | 67:15 | 4.61x |
| Columnar | parallel_chunk_merges=4, merge_chunks_per_job=5 | 15:10 | parallel_chunk_merges=1, merge_chunks_per_job=2 | 99:14 | 6.80x |
There is also a useful tuning pattern in the full results:
- the best runs for both storage modes clustered around
parallel_chunk_merges=4..5 - the best runs also clustered around
merge_chunks_per_job=4..5 - the slowest results were consistently at
parallel_chunk_merges=1withmerge_chunks_per_job=2
In other words, the old serial two-chunk pattern is not just a little slower. On large workloads it can become dramatically slower, especially with columnar storage.
How to think about the two settings
The new docs describe two separate levers:
parallel_chunk_mergesincreases how many merge jobs can run at oncemerge_chunks_per_jobincreases how many chunks each job can consume
Lower merge_chunks_per_job values make it easier to schedule more jobs in parallel because each job consumes fewer chunks from the available pool. If a table has many chunks waiting to be compacted, smaller jobs leave more independent chunks available for other workers, so the scheduler can keep several merges active at once. Higher values reduce the total number of merge steps, but each job becomes heavier and grabs a larger portion of the available chunks, which can leave less room for concurrent jobs.
The right balance depends on your storage and workload, but the benchmark above shows that combining both approaches can dramatically reduce the time spent waiting for RT chunk compaction to finish.
Takeaway
If your RT workloads spend too long waiting for chunk compaction after bulk inserts, the new parallel merge model changes that equation significantly.
On this 10M-document test with optimize_cutoff=16:
| Mode | Searchable at | Fully optimized at |
|---|---|---|
Old settings: parallel_chunk_merges=1, merge_chunks_per_job=2 | 1:18 | 3:23 |
| New defaults | 1:18 | 1:57 |
Tuned new settings: parallel_chunk_merges=3, merge_chunks_per_job=5 | 1:18 | 1:31 |
- the time until all data became searchable stayed the same
- the time until chunk compaction completed dropped from 3:23 to 1:31
- even the new defaults reduced the total time to 1:57
This is exactly the kind of improvement that matters for operational RT indexing. The data is searchable as soon as it is loaded, and that point stayed about the same in both runs. The difference is what happens after that: how long the server keeps spending time compacting chunks in the background before the table returns to its target shape. If your workflow depends on the table becoming compact again before the next heavy ingest, before a maintenance window closes, or before you hand the system over to a search workload that should run with fewer chunks and less background merge pressure, the improvement is substantial.
