Manticore Search kill-list feature

Plain indexes text data is immutable,  this means to refresh the data we need to issue a full reindexing. In many cases, the reindexing can take a long time. For that, a main+delta schema is used.

The concept assume a big index that holds a snapshot of the data at a given time and a smaller index, which holds the changes (delta) from the snapshot time to a more present date.  As the latter is smaller, it can be reindexed more frequent.  The delta  changes can be new records, updated or deleted records. Updated or deleted records introduce an issue: when the engine searches in both indexes, it doesn’t know whenever a record in the main index is not actual anymore. This leads to keep showing records that are actually deleted or (in case of updated records) to include old versions of records that instead of newer ones from the delta index.

To overcome this, the kill-list feature has been introduced. The kill-list defines a list of document IDs in the delta index which tells the engine that those records should be ignored on previous indexes.

sql_query_killlist = \
    SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
    SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex

In this example we include in kill-list documents IDs updated since @last_reindex, the date when last main index occured and also the deleted documents IDs.  The documents_deleted table can be filled manually when a record in documents is deleted or a trigger can be used.

An important thing to remember about kill-list is that the removals are made on preceding indexes in the order they are declared.

If you are doing a sequential search on the indexes, the delta must come after the main index.

   mysql> SELECT * FROM main,delta WHERE MATCH('...');

The same applies if we are using multiple deltas (like delta_daily, delta_houly), the sequence should be main,delta_daily,delta_hourly and not main,delta_hourly,delta_daily.

The kill-list is used also in local distributed indexes and and the indexes defined order matters in this case too, even if we do parallel processing (using dist_threads > 0) of the local indexes:

  index dist 
  {
     local = main
     local = delta
  }

 

The article is based on “SphinxSearch kill list feature” by Yaroslav Vorozhko https://www.ivinco.com/blog/sphinxsearch-kill-list-feature/ and publication is authorized by the owner.

Leave a Reply