Dictionary types: CRC vs keywords

In this article we’ll discuss differences between the two dictionary types available in Manticore Search.

dictionary is an index component that stores indexed words. The first indexes used the ‘crc‘ dictionary type, in which words are replaced with their control sum value using either CRC32 or FVN64, depending if Sphinx was compiled with enable-id64 or not. In Manticore only FVN64 is used, since the 32bit IDs have been removed.

With the introduction of RealTime indexes and because of some potential drawbacks the need for an alternative dictionary type arose. In Sphinx 2 the ‘keywords’ dictionary was added, which stores the actual words and fixes the drawbacks of ‘crc’ we’ll discuss below, but it comes with it’s own minor issues.

RealTime indexes (as well as the newly percolate index) can only use ‘keywords’ type, while the ‘crc’ type is available only to plain indexes. Plain indexes require ‘keywords’ dictionary to allow conversion to RealTime. Some newer functionality like CALL QSUGGEST requires storing actual words and that can’t be done with ‘crc’ dictionary.

In case prefix/infixing is not enabled: indexes with ‘keywords’ dictionary are slower to index. The first reason is that ‘keywords‘ stores the whole word in the dictionary (up to 127 chars), while ‘crc’ folds any words to a 4-byte hash. At the final stage of indexing, ‘keywords’ type has to perform a words sorting, which can take some time. In short, Indexing in case of ‘keywords’ can be 10%-40% slower than in case of ‘crc’. The Index size and search time are similar when prefix/infix are disabled.

In case of prefix/infixing is enabled: The ‘crc’ needs to build additional word permutations for the substrings and can be slower than ‘keywords‘ in this case. ‘keywords’ dictionary has no additional steps to perform during indexing when only prefixing is enabled. Only for infixing it extracts and collect trigrams from words (for QSUGGEST). Due to additional word permutations, ‘crc‘ index size can explode even more than 10x compared to version with prefix/infix disabled. As it doesn’t need to store additional substrings, index with ‘keywords‘ dict doesn’t grow in size compared to the version with prefix/infix disabled.

For wildcarding, in case of ‘keywords’ mode a keyword is expanded into all possible stored words in the index that match the wildcard search. In some edge cases, the expansion could number thousands of words, operation which can slow down the query. The effect can be limited with ‘expansion_limit‘ or by avoiding problematic expansions. ‘CRC‘ doesn’t have this problem, as the expansions are calculated at indexing time and not query time. While it has the speed on its side, index size remains the main problem for ‘crc‘ type. It’s also not possible to use special symbols '?' and '%' for wildcard searches like 't?st*' or 'run%'.

Overall, ‘keywords‘ type is a better choice as it provides more functionality and the index size is kept in control if wildcarding searches are desired to be used. However, wildcard searches can require attention in some cases.
While ‘crc‘ was marked as deprecated in Sphinx, we decided to remove this mark as ‘crc‘ can stil be useful in some cases.

The advantage of ‘crc‘  is indexing speed, which can matter in case of using delta indexes that need to fit into a certain window time. For wildcard searching, ‘crc‘ is double-edged: while it doesn’t have potential performance issues like ‘keywords‘, it’s space requirements must be considered when used.

Leave a Reply