Dictionary types: CRC vs keywords

In this article we’ll discuss differences between the two dictionary types available in Manticore Search.

dictionary is an index component that stores indexed words. The first indexes used the ‘crc‘ dictionary type, in which words are replaced with their control sum value using either CRC32 or FVN64, depending if Sphinx was compiled with enable-id64 or not. In Manticore only FVN64 is used, since the 32bit IDs have been removed.

With the introduction of RealTime indexes and because of some potential drawbacks the need for an alternative dictionary type arose. In Sphinx 2 the ‘keywords’ dictionary was added, which stores the actual words and fixes the drawbacks of ‘crc’ we’ll discuss below, but it comes with it’s own minor issues.

RealTime indexes (as well as the newly percolate index) can only use ‘keywords’ type, while the ‘crc’ type is available only to plain indexes. Plain indexes require ‘keywords’ dictionary to allow conversion to RealTime. Some newer functionality like CALL QSUGGEST requires storing actual words and that can’t be done with ‘crc’ dictionary.

In case prefix/infixing is not enabled: indexes with ‘keywords’ dictionary are slower to index. The first reason is that ‘keywords‘ stores the whole word in the dictionary (up to 127 chars), while ‘crc’ folds any words to a 4-byte hash. At the final stage of indexing, ‘keywords’ type has to perform a words sorting, which can take some time. In short, Indexing in case of ‘keywords’ can be 10%-40% slower than in case of ‘crc’. The Index size and search time are similar when prefix/infix are disabled.

In case of prefix/infixing is enabled: The ‘crc’ needs to build additional word permutations for the substrings and can be slower than ‘keywords‘ in this case. ‘keywords’ dictionary has no additional steps to perform during indexing when only prefixing is enabled. Only for infixing it extracts and collect trigrams from words (for QSUGGEST). Due to additional word permutations, ‘crc‘ index size can explode even more than 10x compared to version with prefix/infix disabled. As it doesn’t need to store additional substrings, index with ‘keywords‘ dict doesn’t grow in size compared to the version with prefix/infix disabled.

For wildcarding, in case of ‘keywords’ mode a keyword is expanded into all possible stored words in the index that match the wildcard search. In some edge cases, the expansion could number thousands of words, operation which can slow down the query. The effect can be limited with ‘expansion_limit‘ or by avoiding problematic expansions. ‘CRC‘ doesn’t have this problem, as the expansions are calculated at indexing time and not query time. While it has the speed on its side, index size remains the main problem for ‘crc‘ type. It’s also not possible to use special symbols '?' and '%' for wildcard searches like 't?st*' or 'run%'.

Overall, ‘keywords‘ type is a better choice as it provides more functionality and the index size is kept in control if wildcarding searches are desired to be used. However, wildcard searches can require attention in some cases.
While ‘crc‘ was marked as deprecated in Sphinx, we decided to remove this mark as ‘crc‘ can stil be useful in some cases.

The advantage of ‘crc‘  is indexing speed, which can matter in case of using delta indexes that need to fit into a certain window time. For wildcard searching, ‘crc‘ is double-edged: while it doesn’t have potential performance issues like ‘keywords‘, it’s space requirements must be considered when used.

Leave a Reply


Personal and team training will maximize them performance. 

Custom development

Need cone custom or individual features?

Fill the form and don’t forget to make the description of what you need.

Free config review

There are often optimizations that can be made to a Sphinx / Manticore setup by changing some simple directives in the configuration or making quick changes to an index definition.

Some common mistakes and issues can include:

  • doing main+delta without kill-lists, even if the delta does include updated records found in the main
  • using wildcarding with very short prefix/infix which can hammer performance in some cases
  • disabled (unintentional) seamless rotates and getting stalls on index rotations
  • adding texts as string attributes even if they are not using for any kind of operation (filtering, grouping, sorting) or mandatory to be present in results
  • using deprecated settings 

Having a quick look on the configuration can show issues or potential issues, this is why we want to offer a gift to our growing community!

When uploading your configuration file, we recommend to remove any database credentials first.

We suggest also you give as many possible details about your setup: how big is the data you have, how typical queries look and what issues you experience.

Contact us