New way of tokenization of Chinese

Why is the Chinese language considered difficult to handle with search engines? The answer is that the Chinese language belongs to the so-called CJK languages that include, alongside with Chinese, Japanese and Korean languages. And the CJK languages have no spaces or any other word separators between words. So why is it bad? Because all the sentences in the text, being divided with spaces or not, consist of words. To find a correct match in full-text search, we need to perform a text tokenization, i.e. to determine the boundaries between text's words.

Chinese text segmentation

The common ways of Chinese texts segmentation assume the use of "N-gram" or "RLP" algorithms.
The "N-gram" algorithm is free to use, but is known to lack in quality and gives considerable overhead that grows with the length of the text being processed. (The latter issue has to do with the fact that N-gram regards each symbol as potential word and have to store it internally while processing the text)
The "RLP" algorithm is an effective and efficient instrument. However, it has one major disadvantage - it's quite expensive.

In version 3.1.0 we introduced a new way of the tokenization of Chinese texts which is based on the ICU text segmentation algorithm.

ICU is a set of open-source libraries providing Unicode and Globalization support for software applications. Along with many other features, it decides the task of Text Boundaries determination. ICU algorithms locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

The algorithm of how ICU tokenization works in Manticore can be briefly described as follows:

  • Original text is regarded as the array of symbols
  • All the parts of the text that consist of Chinese symbols only are identified and preprocessed by ICU library
  • The results of preprocessing executed at the previous step, i.e., segmented parts of Chinese text, replace original, unsegmented parts.
  • Other tokenization tools(charset_table, wordforms, etc.) are applied to the modified text like in a common tokenization workflow

To enable ICU-Chinese tokenization, the following index configuration options must be set:

  • morphology = icu_chinese
  • charset_table = cjk

If it's necessary to handle both cjk and non-cjk symbols, use the syntax for charset_table option as shown below:

  • charset_table = cjk, non_cjk

You can find an illustrative example of how ICU-Chinese tokenization works in the followin interactive tutorial.

Interactive course on Chinese segmentation

Leave a Reply