blog-post

Faster KNN search in Manticore: 2-pass HNSW, batched distances, and AVX-512

author image

TL;DR: Three changes to the HNSW search engine improve KNN throughput by up to 29% at high k, with over 20% gains under concurrent load. No API changes, no index rebuild, no configuration. Just faster searches.

Faster KNN search in Manticore

Manticore's KNN search is built on top of hnswlib , an open-source HNSW implementation. Historically, most of our KNN work focused on custom distance functions, such as those used for binary quantization, rather than on hnswlib's core search loop. We also added features like prefiltering with ACORN-1 and early termination , but the main search loop stayed the same: hnswlib still visited neighbors, computed distances, and maintained its set of candidates the same way.

These changes go further, modifying hnswlib's core search loop itself - restructuring how it traverses neighbors, how it calls distance functions, and how it interacts with the CPU's memory hierarchy. Combined with new AVX-512 distance implementations in the columnar library, these changes target three sources of overhead: inefficient memory access patterns, redundant data loads, and indirect function call overhead.

Compile-time distance function specialization

Previously, the distance function was a runtime function pointer stored in the HNSW index and called for every candidate. For large search budgets, that can mean a large number of indirect calls per query. Indirect calls prevent the compiler from inlining the distance function into the search loop, and they create branch prediction overhead.

The new code resolves the distance function at compile time using C++ templates. When the search begins, a single switch statement selects the right template specialization based on the distance metric and quantization settings. From that point on, the entire inner loop - neighbor traversal, distance computation, candidate set updates - runs as one monolithic function with the distance calculation fully inlined. The compiler can now optimize register allocation, instruction scheduling, and loop unrolling across the distance computation boundary.

2-pass neighbor processing

The HNSW algorithm explores the graph by visiting nodes and computing distances to their neighbors. In the original implementation, each neighbor was processed in a single pass: check if visited, fetch its vector data, compute distance, update the set of candidates. This meant that memory prefetch hints had little time to take effect before the data was needed.

The new implementation splits this into two passes. Pass 1 iterates all neighbors of the current node, skips already-visited ones, and collects the unvisited neighbors into a small batch array. As each neighbor is added to the batch, a prefetch hint is issued for its vector data. Pass 2 iterates the batch and computes distances. By the time Pass 2 reaches each vector, the prefetch from Pass 1 has had time to bring the data into cache.

Pass 2 walks a compact sequential array of candidate IDs, not the graph structure itself. The underlying vector loads are still scattered, but the data has been prefetched ahead of time.

For unfiltered queries (no WHERE clause on the KNN search), the new code also takes a fast path that eliminates the per-candidate filter check entirely.

Batched distance computation

The 2-pass structure helps in two ways: it gives prefetching more time to work, and it makes batching easy. Once Pass 2 has a compact list of candidates, it can score them two at a time instead of one by one.

When scoring two candidates, the query vector is loaded once per SIMD iteration and reused for both distance computations, eliminating redundant loads.

This reduces repeated query-side loads and lets the scoring loop process candidates in pairs, with a fallback for an odd remainder. Batch-2 functions are provided for inner product, L2, and their binary-quantized variants.

AVX-512 support

The new AVX-512 distance code processes 16 floats per iteration instead of 8 with AVX2. For inner product and L2 distance, the core loop uses fused multiply-add (_mm512_fmadd_ps), which combines multiplication and accumulation in a single instruction. For binary-quantized vectors, the AVX-512 VPOPCNTDQ extension speeds up bit-counting operations used in distance calculation.

Manticore now ships three library variants: a baseline build, an AVX2 build, and an AVX-512 build. At startup, Manticore detects the CPU's capabilities and loads the appropriate library automatically. No configuration is needed.

Benchmark results

The following benchmarks were run on the dbpedia-openai-1M-1536-angular dataset (1M vectors, 1536 dimensions, cosine distance) on an AMD Ryzen 7 9700X (Zen 5, 8 physical cores / 16 logical cores). All data uses 1-bit binary quantization with oversampling and rescoring disabled. For multithreaded runs, throughput is reported as average per-thread queries per second: each worker runs its own batch of queries, its QPS is measured independently, and the final number is the average across workers. Each result is the average of 6 independent runs. Early termination was also disabled to isolate the effect of these optimizations on raw HNSW traversal.

Zen 5 was chosen because it supports AVX-512 with native 512-bit datapaths, avoiding the split-512 execution behavior and heavy AVX-512 downclocking associated with some older Intel processors. This helps isolate the algorithmic effects of these changes from CPU-specific AVX-512 throttling behavior.

Algorithmic improvements alone

The first chart isolates the effect of the algorithmic changes (2-pass processing, batched distances, compile-time dispatch) by comparing the new AVX2 build against the previous AVX2 build. Both builds use the same SIMD instruction set, so the difference is purely from the new code structure.

AVX2 (new) vs AVX2 (old) throughput improvement

On a single thread, the gain grows steadily from +3% at k=10 to +24% at k=1000 as distance computation comes to dominate the search workload. With more threads competing for memory bandwidth, the per-thread gain shrinks: +9-10% at 4 or 8 threads, and only +2-5% at 16 threads.

The 16-thread case is SMT (each physical core runs two threads). Distance computation is memory-bound, so when two threads share a core's L1/L2 caches, the prefetching and batching wins are partially absorbed by shared-resource contention. The algorithmic improvements still help, but the headroom shrinks.

SIMD width benefit (AVX-512 vs AVX2)

The second chart isolates the effect of AVX-512 by comparing the AVX-512 build against the new AVX2 build (both share the same algorithmic improvements).

AVX-512 vs AVX2 (new) throughput difference

AVX-512 is slightly slower than AVX2 at k=10 (around -2%) regardless of thread count. This is specific to AVX-512: the algorithmic improvements alone don't show this regression, so it's not a uniform per-query overhead. From k=30 upward, AVX-512 pulls ahead at every thread count.

The interesting pattern is that AVX-512's benefit grows with thread count. Although this benchmark disables oversampling, the default Manticore KNN query uses LIMIT 20, and with the default oversampling=3.0 (which multiplies the effective HNSW search budget for rescoring after quantized search) that becomes k=60 internally. At k=60, AVX-512 vs AVX2 (new) is +1.2% on a single thread, +2.6% at 4 threads, +3.4% at 8 threads, and +6.5% at 16 threads.

Combined improvement (AVX-512 vs the old code)

The third chart shows the cumulative effect: AVX-512 with all the new code, compared against the previous AVX2 build. This is what a user upgrading from the previous Manticore version to the new one would see if their CPU supports AVX-512.

AVX-512 (new) vs AVX2 (old) throughput improvement

The single-thread curve climbs from +0.5% at k=10 to +29% at k=1000. The multi-thread curves all reach +22-24% at k=1000. The improvement is broadly distributed across thread counts - the algorithmic and SIMD gains compose differently at different concurrency levels, but the combined result is consistently large at moderate-to-high k.

Why the gain grows with k

All three charts show the same shape: small improvement at low k, large at high k. The reason is that low-k queries spend a larger share of their time on graph traversal (visiting nodes, checking visited bits, popping the candidate set) - work that scales with the graph structure, not k. As k grows, the effective search budget grows proportionally, and the queries spend more time on distance computation. The optimizations target distance computation and the loops around it, so their benefit scales with the share of work that distance computation represents.

What this means for you

These improvements require no action. They are available in the recent Manticore Search 27.1.5 release ; there are no API changes, no new configuration options, and no need to rebuild indexes.

The gains stack with the KNN early termination : early termination reduces the number of distance computations per query, and these optimizations make each computation faster.

The biggest improvements show up with:

  • High-dimensional vectors (more arithmetic per distance computation, more SIMD benefit)
  • Large k values (more total distance computations, more opportunity for batching and cache optimization)
  • Queries with oversampling (oversampling multiplies the effective k, pushing queries into the range where gains are largest)

Further reading

Install Manticore Search

Install Manticore Search