Default charset tables and stopwords files

Author: Sergey Nikolaev
Published: Jan 28, 2019 - 4 Min read

In this article we talk about the new additions in the character set tables and stopwords which aim to simplify the usage of these options when configuring indexes.

When initially analyzing a document text, Manticore Search needs to know which symbols in the text are meaningful for further processing (breaking a full text into separate words, processing morphology and so on) and which are not. To define valid text characters, charset_table option is used. Through this option you can specify the set of symbols you want to work with.

Also, to provide better text search quality, Manticore Search performs so-called character folding. For example, when analyzing your search query uppercase symbols are mapped to lowercase ones, diacritic symbols (such as â, ê, î, ô etc.) are be mapped to their basic equivalents and so on. Charset table is the mechanism which defines all these transformations.

A special case are languages like Chinese which feature unsegmented text that cannot be easily split into separate words as they have no clear word separators. ('CJK' is a widely used term for these languages, an abbreviation from chinese-japanese-korean.) To provide effective search for these languages, ngram_chars option together with ngram_len should be used. It treats the text as a set of separate N-grams, each N-gram being a sequence of characters with a length equal to ngram_len value. (Currently ngram_len = 1 is supported only.) These characters need to be define in ngram_chars directive and cannot appear in the regular charset_table.

So if you want your search to support different languages you’ll need to define sets of valid characters and folding rules for all of them which can be quite a laborious task.

We’ve performed this task for you by preparing default charset tables that comprise non-cjk and cjk-languages respectively. These tables should be sufficient to use in most cases. They are based on ICU character foldings data that can be seen here and define the following foldings:

ascii native digit folding
case folding
diacritic folding
Han radical folding
Hiragana folding
Katakana folding
letterform folding
simplified Han folding
subscript folding
superscript folding
width folding

Here you can see the source files used for default charsets : cjk.txt , non_cjk.txt

To be able to work with both cjk and non-cjk languages you should set the options in your configuration file as shown below:

charset_table     = non_cjk

...

ngram_len         = 1

ngram_chars     = cjk

In case you don’t need support for cjk-languages you can just omit ngram_len and ngram_chars options.

If you still need some specific rules to be enabled for certain languages you can create the appropriate characters tables manually and put them in your configuration file instead of default ones.

In addition to the above-mentioned options, you can use a stopwords option to exclude some of most frequent words which usually don’t have much relevance to the search queries from further processing and indexing . E.g., for English language those will probably include a, the, this, that and so on. To enable this option, create your stopwords file and set the path to it in your configuration file:

stopwords         = /path/to/your/stopwords/file.txt

The format of stopwords file is a plain text, containing the list of all words you would like to be regarded as stop words, with one word per line. To create a stopwords file one would either had to build it manually or by using indexer tool which can provide a list with the highest frequency words in the index.

Alternatively, just as in the case with charset_table and ngram_chars options, now you can use one of our default stopwords files. Currently stopwords for 50 languages are available. Here is the full list of aliases for them:

af - Africaans
ar - Arabic
bg - Bulgarian
bn - Bengali
ca - Catalan
ckb- Curdish
cz - Czech
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spain
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
ga - Irish
gl - Galician
hi - Hindi
he - Hebrew
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
it - Italian
ja - Japanese
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sl - Slovenian
so - Somali
st - Sotho
sv - Swedish
sw - Swahili
th - Thai
tr - Turkish
yo - Yoruba
zh - Chinese
zu - Zulu

E.g., to use stopwords for Italian language, just put the following line in your config file:

stopwords         = it

If you need to use stopwords for multiple languages you should list all their aliases, separated with commas:

stopwords         = en,it,ru

All default stopwords files are stored in their separate folder. You can find them in /share/stopwords within your Manticore installation folder on Windows machines and in /usr/local/share/manticore/stopwords on Linux machines. The list of all languages which default stopwords are currently provided for can be found there in languages_list.txt as well.

Default charset tables and stopwords files

Read also

Go from zero to Manticore in seconds