blog-post

Inline Stopwords, Exceptions, and Wordforms

Manticore Search now supports inline specification of tokenization dictionary settings directly in the CREATE TABLE statement. This enhancement eliminates the need for external files when configuring stopwords, exceptions, wordforms, and hitless words, making table creation more streamlined and deployment-friendly.

New Features

Four new configuration options are now available in RT mode :

  • stopwords_list - Specify stop words directly in the table definition
  • exceptions_list - Define tokenization exceptions inline
  • wordforms_list - Configure word form mappings without external files
  • hitless_words_list - Set hitless words as part of table creation

All of these options use semicolon (;) as a separator between entries, making them easy to use in SQL and HTTP JSON interfaces.

The Problem They Solve

Traditionally, configuring tokenization dictionaries required creating external files that Manticore would read during table creation. While this approach works well in many scenarios, it presents several challenges:

File Permission Issues

Web applications running under restricted user accounts often struggle to create files in directories that are both:

  • Writable by the web server process
  • Readable by the Manticore daemon process

This is particularly problematic in shared hosting environments where web applications run under restricted user accounts (such as in Virtualmin or similar control panel setups), where user home directories are typically only readable by the owner, while system directories may have restrictive permissions.

Sticky Directory Problems

Using system temporary directories (like /tmp) introduces another issue: the sticky bit on these directories can prevent proper cleanup of stopword files. When indexes are frequently rebuilt, orphaned files can accumulate, consuming disk space and creating maintenance headaches.

File Lifecycle Management

When tables are frequently created and destroyed, managing the associated tokenization dictionary files becomes cumbersome. Developers must:

  1. Create the file before table creation
  2. Ensure the file is readable by Manticore
  3. Remember to clean up the file when the table is dropped

This manual process is error-prone and can lead to file system clutter.

The New Options

The new *_list options let you specify tokenization dictionary settings directly in the CREATE TABLE statement. With external files, SHOW CREATE TABLE shows file paths and you maintain dictionary content in separate files; with the inline options, you never create or reference external paths. Dictionary content lives in the DDL (internally it still ends up as files in the table directory, same as with file paths). SHOW CREATE TABLE shows the full dictionary settings inline (e.g., stopwords_list = 'a; the; an'), so the table definition is self-contained in one statement, easier to version control and to copy or share. The table definition is portable across different environments.

Usage Examples

Stopwords

Instead of creating a stopwords file:

-- Old way (requires external file)
CREATE TABLE products(title text, price float) 
stopwords = '/usr/local/manticore/data/stopwords.txt'

You can now specify stopwords inline:

-- New way (no external file needed)
CREATE TABLE products(title text, price float) 
stopwords_list = 'a; the; an; and; or; but'

Exceptions

Exceptions (synonyms) can be defined inline:

CREATE TABLE products(title text, price float) 
exceptions_list = 'AT&T => ATT; MS Windows => ms windows; C++ => cplusplus'

Wordforms

Word form mappings can be specified directly:

CREATE TABLE products(title text, price float) 
wordforms_list = 'walks > walk; walked > walk; walking > walk'

Hitless Words

Hitless words can be configured inline:

CREATE TABLE products(title text, price float) 
hitless_words_list = 'hello; world; test'

Combining Multiple Options

You can combine all these options in a single CREATE TABLE statement:

CREATE TABLE products(title text, price float) 
stopwords_list = 'a; the; an' 
exceptions_list = 'AT&T => ATT' 
wordforms_list = 'walks > walk; walked > walk' 
hitless_words_list = 'hello; world'

When to Use Inline Configuration

Inline configuration is ideal when:

  1. Small to Medium Lists: The lists are reasonably sized (typically under a few hundred entries). For very large dictionaries, external files may still be more practical.
  2. Dynamic Table Creation: Your application programmatically creates and destroys tables, making file management cumbersome.
  3. Restricted File System Access: You're running in an environment with limited file system permissions (shared hosting, containers, etc.).
  4. Simplified Deployment: You want to avoid managing additional files as part of your deployment process.
  5. Frequent Index Rebuilding: Tables are frequently recreated, making file cleanup a maintenance burden.

When External Files Are Better

While inline configuration is convenient, external files remain the better choice in these scenarios:

  1. Large Dictionaries: When you have thousands of entries, external files are more manageable and don't bloat your CREATE TABLE statements.
  2. Shared Dictionaries: If the same dictionary is used across multiple tables, an external file allows you to define it once and reference it from multiple tables, reducing duplication.
  3. Version Control: External files can be easily tracked in version control systems, making it easier to review changes and maintain history.
  4. Dynamic Updates: If you need to update dictionaries without recreating tables, external files can be modified and then use ALTER TABLE <table_name> RECONFIGURE to apply the changes. For RT tables, this makes the new tokenization settings take effect for new documents (existing documents remain unchanged). For plain tables, rotation is required to pick up changes from modified dictionary files.
  5. Complex Formatting: Very complex wordform or exception rules may be easier to edit in a dedicated file with proper formatting and comments.
  6. Legacy Systems: If you already have well-maintained external dictionary files, there's no need to migrate unless you're facing the specific problems that inline configuration solves.

Format Details

Separator

All *_list options use semicolons (;) to separate entries. Spaces around semicolons are normalized, so 'word1; word2' and 'word1 ; word2' are equivalent.

Escaping

If you need to use a semicolon as part of the value itself (not as a separator), escape it with a backslash: \;. For example, if you want to map a source form that contains a semicolon:

exceptions_list = 'test\;value => testvalue; another => mapping'

This creates two mappings:

  • test;value (with a semicolon) → testvalue
  • anothermapping

The escaped semicolon (\;) is treated as a literal semicolon character, not as a separator between entries.

Wordforms Format

Wordforms support both > and => as separators:

wordforms_list = 'word1 > form1; word2 => form2'

Exceptions Format

Exceptions use => as the separator between source and destination forms:

exceptions_list = 'source form => destination form'

Note: When using exceptions_list, you may see warnings in the searchd log about mapping token (=>) not found in temporary exception files. These warnings are harmless and can be safely ignored—the exceptions function correctly despite these messages. The warnings occur during internal file processing and don't affect the actual exception mapping behavior.

Example: Stopwords, Wordforms, and Exceptions Together

Here's a practical example using inline stopwords, wordforms, and exceptions on a single table. Wordforms normalize variants to a single form (e.g. "learning" → "learn"); exceptions map shorthand to a normalized form (e.g. "JS" → "javascript") so that both "JS" and "JavaScript" match the same documents. Use lowercase in the exception destination so it matches the token form produced by charset_table.

-- Create a table with inline stopwords, wordforms, and exceptions
CREATE TABLE articles(id bigint, title text)
stopwords_list = 'a; the; an; and; or; but; in; on; at; to; for; of; with'
wordforms_list = 'learning > learn; programming > program; reference > refer; introduction > intro; complete > complet; basics > basic'
exceptions_list = 'JS => javascript; ML => machine learning';

-- Insert test data
INSERT INTO articles VALUES
  (1, 'The Quick Guide to Python Programming'),
  (2, 'A Complete Reference for JavaScript'),
  (3, 'An Introduction to Machine Learning'),
  (4, 'Python Programming Basics'),
  (5, 'Getting Started with JS');

Stopwords: queries with or without stopwords match the same documents.

SELECT * FROM articles WHERE MATCH('python');
idtitle
1The Quick Guide to Python Programming
4Python Programming Basics
SELECT * FROM articles WHERE MATCH('the python');
idtitle
1The Quick Guide to Python Programming
4Python Programming Basics

Phrase search: stopwords are skipped for matching but still affect positions (tunable with stopword_step ).

SELECT * FROM articles WHERE MATCH('"the quick"');
idtitle
1The Quick Guide to Python Programming

Wordforms: "learn" matches "Learning" via the wordform.

SELECT * FROM articles WHERE MATCH('learn');
idtitle
3An Introduction to Machine Learning

Exceptions: the mapping JS => javascript normalizes "JS" to "javascript" when it appears in text or in the query. Because the destination is lowercase, it matches the token form that charset_table produces for "JavaScript", so both MATCH('JavaScript') and MATCH('JS') return the same rows.

SELECT * FROM articles WHERE MATCH('JavaScript');
idtitle
2A Complete Reference for JavaScript
5Getting Started with JS
SELECT * FROM articles WHERE MATCH('JS');
idtitle
2A Complete Reference for JavaScript
5Getting Started with JS

Benefits Summary

  1. No File Management: Eliminates the need to create, manage, and clean up external files
  2. Simplified Deployment: Configuration is part of the table definition, making deployments more straightforward
  3. Permission Independence: No file system permission issues between web server and Manticore processes
  4. Better for Automation: Easier to script and automate table creation
  5. Self-Contained and Self-Documenting: Table configuration is complete in the CREATE TABLE statement, and SHOW CREATE TABLE shows the full dictionary content inline, so definitions are easy to share and version control without managing separate dictionary files

Migration Path

If you're currently using external files, you can easily migrate to inline configuration:

  1. Read your existing file content
  2. Convert the format to use semicolons as separators
  3. Replace the file path with the *_list option in your CREATE TABLE statement

For example, if you have a stopwords.txt file containing:

a
the
an
and

You can convert it to:

stopwords_list = 'a; the; an; and'

Conclusion

The new inline tokenization dictionary configuration options (stopwords_list, exceptions_list, wordforms_list, and hitless_words_list) provide a cleaner, more maintainable way to configure tokenization settings. They're particularly valuable in environments where file management is challenging or when you want to simplify your deployment process and keep table definitions self-contained. While external files remain supported for large dictionaries, inline configuration offers a convenient alternative for most use cases.

Install Manticore Search

Install Manticore Search