Default charset tables and stopwords files

In this article we talk about the new additions in the character set tables and stopwords which aim to simplify the usage of these options when configuring indexes.

When initially analyzing a document text, Manticore Search needs to know which symbols in the text are meaningful for further processing (breaking a full text into separate words, processing morphology and so on) and which are not. To define valid text characters,  charset_table  option is used. Through this option you can specify the set of symbols you want to work with.

Also, to provide better text search quality, Manticore Search performs so-called character folding. For example, when analyzing your search query uppercase symbols are mapped to lowercase ones, diacritic symbols (such as â, ê, î, ô etc.) are be mapped to their basic equivalents and so on. Charset table is the mechanism which defines all these transformations.

A special case are languages like Chinese which feature unsegmented text that cannot be easily split into separate words as they have no clear word separators. (‘CJK’ is a widely used term for these languages, an abbreviation from chinese-japanese-korean.) To provide effective search for these languages,  ngram_chars option together with ngram_len should be used. It treats the text as a set of separate N-grams, each N-gram being a sequence of characters with a length equal to  ngram_len value. (Currently ngram_len = 1 is supported only.)  These characters need to be define in ngram_chars directive and cannot appear in the regular charset_table.

So if you want your search to support different languages you’ll need to define sets of valid characters and folding rules for all of them which can be quite a laborious task.

We’ve performed this task for you by preparing default charset tables that comprise non-cjk and cjk-languages respectively. These tables should be sufficient to use in most cases. They are based on ICU character foldings data that can be seen here and define the following foldings:

  • ascii native digit folding
  • case folding
  • diacritic folding
  • Han radical folding
  • Hiragana folding
  • Katakana folding
  • letterform folding
  • simplified Han folding
  • subscript folding
  • superscript folding
  • width folding

Here you can see the source files used for default charsets : cjk.txtnon_cjk.txt

 

To be able to work with both cjk and non-cjk languages you should set the options in your configuration file as shown below:

charset_table     = non_cjk

...

ngram_len         = 1

ngram_chars     = cjk

In case you don’t need support for cjk-languages you can just omit ngram_len and ngram_chars options.

If you still need some specific rules to be enabled for certain languages you can create the appropriate characters tables manually and put them in your configuration file instead of default ones.

In addition to the above-mentioned options,  you can use a stopwords option to exclude some of most frequent words which usually don’t have much relevance to the search queries from further processing and indexing . E.g., for English language those will probably include a, the, this, that and so on. To enable this option, create your stopwords file and set the path to it in your configuration file:

stopwords         = /path/to/your/stopwords/file.txt

The format of stopwords file is a plain text, containing the list of all words you would like to be regarded as stop words, with one word per line. To create a stopwords file one would either had to build it manually or by using indexer tool which can provide  a list with the highest frequency words in the index.

Alternatively, just as in the case with charset_table and ngram_chars options,  now you can use one of our default stopwords files. Currently stopwords for 50 languages are available.  Here is the full list of aliases for them:

  • af – Africaans
  • ar – Arabic
  • bg – Bulgarian
  • bn – Bengali
  • ca – Catalan
  • ckb- Curdish
  • cz – Czech
  • da – Danish
  • de – German
  • el – Greek
  • en – English
  • eo – Esperanto
  • es – Spain
  • et – Estonian
  • eu – Basque
  • fa – Persian
  • fi – Finnish
  • fr – French
  • ga – Irish
  • gl – Galician
  • hi – Hindi
  • he – Hebrew
  • hr – Croatian
  • hu – Hungarian
  • hy – Armenian
  • id – Indonesian
  • it – Italian
  • ja – Japanese
  • ko – Korean
  • la – Latin
  • lt – Lithuanian
  • lv – Latvian
  • mr – Marathi
  • nl – Dutch
  • no – Norwegian
  • pl – Polish
  • pt – Portuguese
  • ro – Romanian
  • ru – Russian
  • sk – Slovak
  • sl – Slovenian
  • so – Somali
  • st – Sotho
  • sv – Swedish
  • sw – Swahili
  • th – Thai
  • tr – Turkish
  • yo – Yoruba
  • zh – Chinese
  • zu – Zulu

 

E.g.,  to use stopwords for Italian language, just put the following line in your config file:

stopwords         = it

 

If you need to use stopwords for multiple languages you should list all their aliases, separated with commas:

stopwords         = en,it,ru

 

All default stopwords files are stored in their separate folder.  You can find them in /share/stopwords within your Manticore installation folder on Windows machines and in /usr/local/share/manticore/stopwords on Linux machines. The list of all languages which default stopwords are currently provided for can be found there in languages_list.txt as well.

 

Leave a Reply

© 2019 Manticore Software Ltd. Registered Address: Office 2, Derby House, 123 Watling Street, Gillingham, Kent, ME7 2YY
Company No. 10772872