Using Manticore Search with Chinese, Japanese, and Korean language documents

This article explains step by step how to implement full-text search on a set of documents written in Chinese, Korean and Japanese languages (CJK).

About CJK languages

CJK languages have more than 40,000 characters. Most of them are Chinese. Sometimes you can see acronym CJKV. “V” here stands for the Vietnamese language.

CJK characters include:

  1. For the Chinese language: hànzì – traditional Chinese characters; Bopomofo – Chinese Phonetic Alphabet; Pinyin – Romanization of Chinese language (a concept close to the concept of transliteration).
  2. For the Japanese language: Hiragana – Japanese syllabary; Katakana – Japanese syllabary; Arabic numerals.
  3. For the Korean Language: Hangul (Korean alphabet)

In addition, each language has a set of hieroglyphic keys (radicals), which act as a grouping elements to search for characters in the dictionary or as a semantic elements that define the meaning of the characters following the key.

To display text in CJK languages you can use the following encodings: Big5, EUC-JP, EUC-KR, ISO 2022-JP, KS C 5861, Shift-JIS, Unicode, etc. For CJK-language alphabets there are such Unicode blocks (http://www.unicode.org/Public/UNIDATA/Blocks.txt):

The rangeBlockComments
1100 .. 11FFHangul JamoA single character out of a syllable in the Korean Hangul alphabet. Letters Jamo used to form the syllables Hangul
2E80 .. 2EFFCJK Radicals SupplementKey (radical) – an element of the hieroglyphic alphabet, which allows grouping of words or acts as a semantic element that defines the meaning of the following characters.
2F00 .. 2FDFKangxi Radicalslist of keys Kangxi adopted in Japan, Korea, Taiwan, traditionally includes 214 characters
3000 .. 303FCJK Symbols and PunctuationIdeographic characters and punctuation
3040 .. 309FHiraganaJapanese syllabary
30A0 .. 30FFKatakanaJapanese syllabary
3100 .. 312FBopomofoChinese Phonetic Alphabet
3130 .. 318FHangul Compatibility Jamo
3190 .. 319FKanbun Camboon or kanbunOne of the written languages of medieval Japan
31A0 .. 31BFBopomofo Extended
31C0 .. 31EFCJK Strokes simple features (elements) characters
31F0 .. 31FFKatakana Phonetic Extensions
3200 .. 32FFEnclosed CJK Letters and MonthsCJK letters and months in circles
3300 .. 33FFCJK Compatibility
3400 .. 4DBFCJK Unified Ideographs ExtensionA CJK Ideographs
4DC0 .. 4DFFYijing Hexagram Symbols
4E00 .. 9FFFCJK Unified IdeographsIdeographs – written sign, conditional image or picture, is not the appropriate speech sounds, and whole word
A000 .. A48FYi Syllables Yi languageThe language of the province of South Sichuan
A490 .. A4CFYi Radicals
AC00 .. D7AFHangul Syllables Syllables Hangul
D7B0 .. D7FFHangul Jamo Extended-B
20000 .. 2A6DFCJK Unified Ideographs Extension B
2A700 .. 2B73FCJK Unified Ideographs Extension C
2F800 .. 2FA1FCJK Compatibility Ideographs Supplement

Note that the Arabic numerals, which can be used in CJK texts, correspond widespace character codes (see section FFF0 .. FFFF; Specials).

You can see here http://www.utf8-chartable.de/ how certain characters look.

How to tell Manticore Search that your document has CJK characters?

Manticore Search filters the texts at character level. Characters not accepted for tokenization are considered invalid and replace with whitespace, which acts as separator. By default, only english and russian characters are tokenized (along with underscore and letters).

CJK languages feature characters that can form unsegmented texts. For these types of characters, Manticore can index contiguous groups of these characters as n-grams.

In the index configuration we need to adjust 3 settings:

  1. charset_table – main parameter to describe the characters. Contains a table of symbols and rules for case folding.
  2. ngram_chars – description of characters needed to split CJK text to words using the N-gram model;
  3. Set the value ngram_len to 1. This enables the n-gram feature. Currently only 1-grams are supported ( a text “ABCDEF” [where A to F are in ngram_chars list] is indexed as “A B C D E F”).

How to create descriptions for the parameters charset_table and ngram_chars

Update: newer versions of Manticore Search introduce a charset_table alias containing all CJK characters needed as well as a Chinese ICU morphology processor. For working with CJK in latest version, read the following article.

Or in other words, how to explain Manticore Search which UTF-8 character codes belong to the family of CJK languages?

You can use the sets for blocks of language from charset_tables Sphinx’s wiki page or using the data in the table above and the rules set in charset_table to make your description of the options (see 1-3 above) for the characters and letters for CJK languages. Be careful and double check that all blocks of the character ranges that you need are included into Manticore Search index character description in configuration file. For example, if you would use character set range descriptions that you get on the link above for indexing documents containing Lisu or Vai languages, search will not work properly.

Pay special attention to setting the ngram_chars parameter correctly. When searching Manticore Search will not look into these characters as search matches.

An example of a index with CJK setup can be found at cjk_index_example.zip.

Useful links:

http://en.wikipedia.org/wiki/CJK
http://en.wikipedia.org/wiki/Chinese_character
http://en.wikipedia.org/wiki/Pinyin
http://en.wikipedia.org/wiki/Space_%28punctuation%29
http://www.babelstone.co.uk/Yi/unicode.html

Install Manticore Search

Install Manticore Search