How kill-lists changed in Manticore Search 3

In this article we’ll discuss how kill-lists are working in Manticore Search 3.

Plain indexes are immutable in terms of adding new documents: once created, it’s not possible to add more data, you can only update attributes of existing documents. To keep the index in line with the primary data (which can be a database or files), the index needs to be refreshed  by performing a complete rebuild. This operation can take time. In some cases a full rebuild can take hours or even more.

This means the searchable data might be behind the original data storage and the latest content has a waiting time until the index is rebuilt. To fix that, the concept of a delta index was introduced. The delta is an index with the same structure as the bigger (aka ‘main’) index, but it is to pickup documents that were added to database after the main index was built. The pick point is usually a document id or a time point.

While the delta captures newer documents (that never existed), in most cases it’s also desired to capture modified documents. Another common requirement is to be able to discard somehow documents that were deleted. Modified documents can be included in the delta, but they introduce a problem: searching on both main and delta indexes will bring two versions of the same document – the old one in the main and the newer one in the delta – and  the engine doesn’t know which one to pick.

To overcome these issues, a new concept was added: kill-list – a list of document ids in the delta index which are known to be either modified or deleted so the index would ignore them when searching in the main index.

So far so good.  However in v2 the engine had to apply the kill-list to the result set extracted from the main index before merging with the result set from the delta to provide the final result. While in general this worked fine for many users, applying on every query the kill-lists does have an effect on performance. This is not noticeable on small indexes or small kill-lists but it starts to hit on big indexes that ‘take ages’ to rebuild for which the delta indexes can end up with large kill-lists sets.

Kill-lists in 2.x

Manticore 3 introduced a breaking change by doing a massive upgrade to the index storage engine. The kill-lists needed a change as well.

This was for two reasons: the existing way of kill-lists proved to be problematic in some edge cases and it also didn’t fit well with the new storage engine.

So instead of applying the kill-lists every time a query is made, why not do that once? In v3 in the source(s) of a delta index we still define a source for kill-lists (sql_query_killlist), but in the index configuration we need to define the targets of kill-lists.

When the indexes are loaded, the engine looks if for an index there is a kill-list that needs to be applied. If one (or more) are found, it applies that kill-list to the index by marking the matching docs as deleted. When a query is performed over the index, these marked docs are simply ignored (like they don’t exist). As this suppression is already made at startup or index rotation, the delta presence is not required for deleted documents to not show up anymore in the results.

Kill-list targets are defined in the killlist_target index directive. The directive expects a list of indexes and a mode of applied kill-lists.  There are 3 modes supported right now:

  • kl – document ids from the defined kill-list (by sql_query_killlist) will be used for suppression
  • id – document ids from the index are used instead (and any defined sql_query_killlist will be ignored)
  • or by default if you don’t specify ‘kl’ or ‘id’ both document ids from the index and sql_query_killlist will be used

Example

index delta {
  ...
  killlist_target = index_one:kl, index_two:id
  ...
}

Target directive can be changed using ALTER TABLE command, but in order to get applied, the target indexes have to be reloaded. Also ALTER cannot ‘disable’ already deleted ids on a target index (by taking out the index from target list).

 

The new way of how kill-lists works improve query performance, as now at query time there is no extra operation to compare found document ids with a suppression list. Even more,  permanent marking of suppressed document ids means shorter ids list that needs to be searched at every query.  It adds also flexibility as not only the ids from the dedicated list (generated by sql_killlist_query) can be used for suppression, but also the doc ids of an entire index if needed.

Leave a Reply