Manticore Search
now supports
inline specification of tokenization dictionary settings directly in the CREATE TABLE statement. This enhancement eliminates the need for external files when configuring stopwords, exceptions, wordforms, and hitless words, making table creation more streamlined and deployment-friendly.
New Features
Four new configuration options are now available in RT mode :
stopwords_list- Specify stop words directly in the table definitionexceptions_list- Define tokenization exceptions inlinewordforms_list- Configure word form mappings without external fileshitless_words_list- Set hitless words as part of table creation
All of these options use semicolon (;) as a separator between entries, making them easy to use in SQL and HTTP JSON interfaces.
The Problem They Solve
Traditionally, configuring tokenization dictionaries required creating external files that Manticore would read during table creation. While this approach works well in many scenarios, it presents several challenges:
File Permission Issues
Web applications running under restricted user accounts often struggle to create files in directories that are both:
- Writable by the web server process
- Readable by the Manticore daemon process
This is particularly problematic in shared hosting environments where web applications run under restricted user accounts (such as in Virtualmin or similar control panel setups), where user home directories are typically only readable by the owner, while system directories may have restrictive permissions.
Sticky Directory Problems
Using system temporary directories (like /tmp) introduces another issue: the sticky bit on these directories can prevent proper cleanup of stopword files. When indexes are frequently rebuilt, orphaned files can accumulate, consuming disk space and creating maintenance headaches.
File Lifecycle Management
When tables are frequently created and destroyed, managing the associated tokenization dictionary files becomes cumbersome. Developers must:
- Create the file before table creation
- Ensure the file is readable by Manticore
- Remember to clean up the file when the table is dropped
This manual process is error-prone and can lead to file system clutter.
The New Options
The new *_list options let you specify tokenization dictionary settings directly in the CREATE TABLE statement. With external files, SHOW CREATE TABLE shows file paths and you maintain dictionary content in separate files; with the inline options, you never create or reference external paths. Dictionary content lives in the DDL (internally it still ends up as files in the table directory, same as with file paths). SHOW CREATE TABLE shows the full dictionary settings inline (e.g., stopwords_list = 'a; the; an'), so the table definition is self-contained in one statement, easier to version control and to copy or share. The table definition is portable across different environments.
Usage Examples
Stopwords
Instead of creating a stopwords file:
-- Old way (requires external file)
CREATE TABLE products(title text, price float)
stopwords = '/usr/local/manticore/data/stopwords.txt'
You can now specify stopwords inline:
-- New way (no external file needed)
CREATE TABLE products(title text, price float)
stopwords_list = 'a; the; an; and; or; but'
Exceptions
Exceptions (synonyms) can be defined inline:
CREATE TABLE products(title text, price float)
exceptions_list = 'AT&T => ATT; MS Windows => ms windows; C++ => cplusplus'
Wordforms
Word form mappings can be specified directly:
CREATE TABLE products(title text, price float)
wordforms_list = 'walks > walk; walked > walk; walking > walk'
Hitless Words
Hitless words can be configured inline:
CREATE TABLE products(title text, price float)
hitless_words_list = 'hello; world; test'
Combining Multiple Options
You can combine all these options in a single CREATE TABLE statement:
CREATE TABLE products(title text, price float)
stopwords_list = 'a; the; an'
exceptions_list = 'AT&T => ATT'
wordforms_list = 'walks > walk; walked > walk'
hitless_words_list = 'hello; world'
When to Use Inline Configuration
Inline configuration is ideal when:
- Small to Medium Lists: The lists are reasonably sized (typically under a few hundred entries). For very large dictionaries, external files may still be more practical.
- Dynamic Table Creation: Your application programmatically creates and destroys tables, making file management cumbersome.
- Restricted File System Access: You're running in an environment with limited file system permissions (shared hosting, containers, etc.).
- Simplified Deployment: You want to avoid managing additional files as part of your deployment process.
- Frequent Index Rebuilding: Tables are frequently recreated, making file cleanup a maintenance burden.
When External Files Are Better
While inline configuration is convenient, external files remain the better choice in these scenarios:
- Large Dictionaries: When you have thousands of entries, external files are more manageable and don't bloat your
CREATE TABLEstatements. - Shared Dictionaries: If the same dictionary is used across multiple tables, an external file allows you to define it once and reference it from multiple tables, reducing duplication.
- Version Control: External files can be easily tracked in version control systems, making it easier to review changes and maintain history.
- Dynamic Updates: If you need to update dictionaries without recreating tables, external files can be modified and then use
ALTER TABLE <table_name> RECONFIGUREto apply the changes. For RT tables, this makes the new tokenization settings take effect for new documents (existing documents remain unchanged). For plain tables, rotation is required to pick up changes from modified dictionary files. - Complex Formatting: Very complex wordform or exception rules may be easier to edit in a dedicated file with proper formatting and comments.
- Legacy Systems: If you already have well-maintained external dictionary files, there's no need to migrate unless you're facing the specific problems that inline configuration solves.
Format Details
Separator
All *_list options use semicolons (;) to separate entries. Spaces around semicolons are normalized, so 'word1; word2' and 'word1 ; word2' are equivalent.
Escaping
If you need to use a semicolon as part of the value itself (not as a separator), escape it with a backslash: \;. For example, if you want to map a source form that contains a semicolon:
exceptions_list = 'test\;value => testvalue; another => mapping'
This creates two mappings:
test;value(with a semicolon) →testvalueanother→mapping
The escaped semicolon (\;) is treated as a literal semicolon character, not as a separator between entries.
Wordforms Format
Wordforms support both > and => as separators:
wordforms_list = 'word1 > form1; word2 => form2'
Exceptions Format
Exceptions use => as the separator between source and destination forms:
exceptions_list = 'source form => destination form'
Note: When using exceptions_list, you may see warnings in the searchd log about mapping token (=>) not found in temporary exception files. These warnings are harmless and can be safely ignored—the exceptions function correctly despite these messages. The warnings occur during internal file processing and don't affect the actual exception mapping behavior.
Example: Stopwords, Wordforms, and Exceptions Together
Here's a practical example using inline stopwords, wordforms, and exceptions on a single table. Wordforms normalize variants to a single form (e.g. "learning" → "learn"); exceptions map shorthand to a normalized form (e.g. "JS" → "javascript") so that both "JS" and "JavaScript" match the same documents. Use lowercase in the exception destination so it matches the token form produced by charset_table.
-- Create a table with inline stopwords, wordforms, and exceptions
CREATE TABLE articles(id bigint, title text)
stopwords_list = 'a; the; an; and; or; but; in; on; at; to; for; of; with'
wordforms_list = 'learning > learn; programming > program; reference > refer; introduction > intro; complete > complet; basics > basic'
exceptions_list = 'JS => javascript; ML => machine learning';
-- Insert test data
INSERT INTO articles VALUES
(1, 'The Quick Guide to Python Programming'),
(2, 'A Complete Reference for JavaScript'),
(3, 'An Introduction to Machine Learning'),
(4, 'Python Programming Basics'),
(5, 'Getting Started with JS');
Stopwords: queries with or without stopwords match the same documents.
SELECT * FROM articles WHERE MATCH('python');
| id | title |
|---|---|
| 1 | The Quick Guide to Python Programming |
| 4 | Python Programming Basics |
SELECT * FROM articles WHERE MATCH('the python');
| id | title |
|---|---|
| 1 | The Quick Guide to Python Programming |
| 4 | Python Programming Basics |
Phrase search: stopwords are skipped for matching but still affect positions (tunable with stopword_step ).
SELECT * FROM articles WHERE MATCH('"the quick"');
| id | title |
|---|---|
| 1 | The Quick Guide to Python Programming |
Wordforms: "learn" matches "Learning" via the wordform.
SELECT * FROM articles WHERE MATCH('learn');
| id | title |
|---|---|
| 3 | An Introduction to Machine Learning |
Exceptions: the mapping JS => javascript normalizes "JS" to "javascript" when it appears in text or in the query. Because the destination is lowercase, it matches the token form that charset_table produces for "JavaScript", so both MATCH('JavaScript') and MATCH('JS') return the same rows.
SELECT * FROM articles WHERE MATCH('JavaScript');
| id | title |
|---|---|
| 2 | A Complete Reference for JavaScript |
| 5 | Getting Started with JS |
SELECT * FROM articles WHERE MATCH('JS');
| id | title |
|---|---|
| 2 | A Complete Reference for JavaScript |
| 5 | Getting Started with JS |
Benefits Summary
- No File Management: Eliminates the need to create, manage, and clean up external files
- Simplified Deployment: Configuration is part of the table definition, making deployments more straightforward
- Permission Independence: No file system permission issues between web server and Manticore processes
- Better for Automation: Easier to script and automate table creation
- Self-Contained and Self-Documenting: Table configuration is complete in the
CREATE TABLEstatement, andSHOW CREATE TABLEshows the full dictionary content inline, so definitions are easy to share and version control without managing separate dictionary files
Migration Path
If you're currently using external files, you can easily migrate to inline configuration:
- Read your existing file content
- Convert the format to use semicolons as separators
- Replace the file path with the
*_listoption in yourCREATE TABLEstatement
For example, if you have a stopwords.txt file containing:
a
the
an
and
You can convert it to:
stopwords_list = 'a; the; an; and'
Conclusion
The new inline tokenization dictionary configuration options (stopwords_list, exceptions_list, wordforms_list, and hitless_words_list) provide a cleaner, more maintainable way to configure tokenization settings. They're particularly valuable in environments where file management is challenging or when you want to simplify your deployment process and keep table definitions self-contained. While external files remain supported for large dictionaries, inline configuration offers a convenient alternative for most use cases.
