# Inline Stopwords, Exceptions, and Wordforms

Define stopwords, exceptions, wordforms, and hitless words inline in CREATE TABLE (RT mode) to remove external files and simplify deployment and table definitions.

Manticore Search [now supports](/blog/manticore-search-17-5-1/) inline specification of tokenization dictionary settings directly in the `CREATE TABLE` statement. This enhancement eliminates the need for external files when configuring stopwords, exceptions, wordforms, and hitless words, making table creation more streamlined and deployment-friendly.

## New Features

Four new configuration options are now available in [RT mode](https://manual.manticoresearch.com/Read_this_first#Real-time-mode-vs-plain-mode):

- **`stopwords_list`** - Specify stop words directly in the table definition
- **`exceptions_list`** - Define tokenization exceptions inline
- **`wordforms_list`** - Configure word form mappings without external files
- **`hitless_words_list`** - Set hitless words as part of table creation

All of these options use semicolon (`;`) as a separator between entries, making them easy to use in SQL and HTTP JSON interfaces.

## The Problem They Solve

Traditionally, configuring tokenization dictionaries required creating external files that Manticore would read during table creation. While this approach works well in many scenarios, it presents several challenges:

### File Permission Issues

Web applications running under restricted user accounts often struggle to create files in directories that are both:
- Writable by the web server process
- Readable by the Manticore daemon process

This is particularly problematic in shared hosting environments where web applications run under restricted user accounts (such as in [Virtualmin](https://www.virtualmin.com/) or similar control panel setups), where user home directories are typically only readable by the owner, while system directories may have restrictive permissions.

### Sticky Directory Problems

Using system temporary directories (like `/tmp`) introduces another issue: the sticky bit on these directories can prevent proper cleanup of stopword files. When indexes are frequently rebuilt, orphaned files can accumulate, consuming disk space and creating maintenance headaches.

### File Lifecycle Management

When tables are frequently created and destroyed, managing the associated tokenization dictionary files becomes cumbersome. Developers must:
1. Create the file before table creation
2. Ensure the file is readable by Manticore
3. Remember to clean up the file when the table is dropped

This manual process is error-prone and can lead to file system clutter.

### The New Options

The new `*_list` options let you specify tokenization dictionary settings directly in the `CREATE TABLE` statement. With external files, `SHOW CREATE TABLE` shows file paths and you maintain dictionary content in separate files; with the inline options, you never create or reference external paths. Dictionary content lives in the DDL (internally it still ends up as files in the table directory, same as with file paths). `SHOW CREATE TABLE` shows the full dictionary settings inline (e.g., `stopwords_list = 'a; the; an'`), so the table definition is self-contained in one statement, easier to version control and to copy or share. The table definition is portable across different environments.

## Usage Examples

### Stopwords

Instead of creating a stopwords file:

```sql
-- Old way (requires external file)
CREATE TABLE products(title text, price float) 
stopwords = '/usr/local/manticore/data/stopwords.txt'
```

You can now specify stopwords inline:

```sql
-- New way (no external file needed)
CREATE TABLE products(title text, price float) 
stopwords_list = 'a; the; an; and; or; but'
```

### Exceptions

Exceptions (synonyms) can be defined inline:

```sql
CREATE TABLE products(title text, price float) 
exceptions_list = 'AT&T => ATT; MS Windows => ms windows; C++ => cplusplus'
```

### Wordforms

Word form mappings can be specified directly:

```sql
CREATE TABLE products(title text, price float) 
wordforms_list = 'walks > walk; walked > walk; walking > walk'
```

### Hitless Words

Hitless words can be configured inline:

```sql
CREATE TABLE products(title text, price float) 
hitless_words_list = 'hello; world; test'
```

### Combining Multiple Options

You can combine all these options in a single `CREATE TABLE` statement:

```sql
CREATE TABLE products(title text, price float) 
stopwords_list = 'a; the; an' 
exceptions_list = 'AT&T => ATT' 
wordforms_list = 'walks > walk; walked > walk' 
hitless_words_list = 'hello; world'
```

## When to Use Inline Configuration

Inline configuration is ideal when:

1. **Small to Medium Lists**: The lists are reasonably sized (typically under a few hundred entries). For very large dictionaries, external files may still be more practical.
2. **Dynamic Table Creation**: Your application programmatically creates and destroys tables, making file management cumbersome.
3. **Restricted File System Access**: You're running in an environment with limited file system permissions (shared hosting, containers, etc.).
4. **Simplified Deployment**: You want to avoid managing additional files as part of your deployment process.
5. **Frequent Index Rebuilding**: Tables are frequently recreated, making file cleanup a maintenance burden.

## When External Files Are Better

While inline configuration is convenient, external files remain the better choice in these scenarios:

1. **Large Dictionaries**: When you have thousands of entries, external files are more manageable and don't bloat your `CREATE TABLE` statements.
2. **Shared Dictionaries**: If the same dictionary is used across multiple tables, an external file allows you to define it once and reference it from multiple tables, reducing duplication.
3. **Version Control**: External files can be easily tracked in version control systems, making it easier to review changes and maintain history.
4. **Dynamic Updates**: If you need to update dictionaries without recreating tables, external files can be modified and then use `ALTER TABLE <table_name> RECONFIGURE` to apply the changes. For RT tables, this makes the new tokenization settings take effect for new documents (existing documents remain unchanged). For plain tables, rotation is required to pick up changes from modified dictionary files.
5. **Complex Formatting**: Very complex wordform or exception rules may be easier to edit in a dedicated file with proper formatting and comments.
6. **Legacy Systems**: If you already have well-maintained external dictionary files, there's no need to migrate unless you're facing the specific problems that inline configuration solves.

## Format Details

### Separator

All `*_list` options use semicolons (`;`) to separate entries. Spaces around semicolons are normalized, so `'word1; word2'` and `'word1 ; word2'` are equivalent.

### Escaping

If you need to use a semicolon as part of the value itself (not as a separator), escape it with a backslash: `\;`. For example, if you want to map a source form that contains a semicolon:

```sql
exceptions_list = 'test\;value => testvalue; another => mapping'
```

This creates two mappings:
- `test;value` (with a semicolon) → `testvalue`
- `another` → `mapping`

The escaped semicolon (`\;`) is treated as a literal semicolon character, not as a separator between entries.

### Wordforms Format

Wordforms support both `>` and `=>` as separators:

```sql
wordforms_list = 'word1 > form1; word2 => form2'
```

### Exceptions Format

Exceptions use `=>` as the separator between source and destination forms:

```sql
exceptions_list = 'source form => destination form'
```

**Note**: When using `exceptions_list`, you may see warnings in the searchd log about `mapping token (=>) not found` in temporary exception files. These warnings are harmless and can be safely ignored—the exceptions function correctly despite these messages. The warnings occur during internal file processing and don't affect the actual exception mapping behavior.

## Example: Stopwords, Wordforms, and Exceptions Together

Here's a practical example using inline stopwords, wordforms, and exceptions on a single table. Wordforms normalize variants to a single form (e.g. "learning" → "learn"); exceptions map shorthand to a normalized form (e.g. "JS" → "javascript") so that both "JS" and "JavaScript" match the same documents. Use lowercase in the exception destination so it matches the token form produced by charset_table.

```sql
-- Create a table with inline stopwords, wordforms, and exceptions
CREATE TABLE articles(id bigint, title text)
stopwords_list = 'a; the; an; and; or; but; in; on; at; to; for; of; with'
wordforms_list = 'learning > learn; programming > program; reference > refer; introduction > intro; complete > complet; basics > basic'
exceptions_list = 'JS => javascript; ML => machine learning';

-- Insert test data
INSERT INTO articles VALUES
  (1, 'The Quick Guide to Python Programming'),
  (2, 'A Complete Reference for JavaScript'),
  (3, 'An Introduction to Machine Learning'),
  (4, 'Python Programming Basics'),
  (5, 'Getting Started with JS');
```

**Stopwords:** queries with or without stopwords match the same documents.

```sql
SELECT * FROM articles WHERE MATCH('python');
```

| id | title |
|----|-------|
| 1 | The Quick Guide to Python Programming |
| 4 | Python Programming Basics |

```sql
SELECT * FROM articles WHERE MATCH('the python');
```

| id | title |
|----|-------|
| 1 | The Quick Guide to Python Programming |
| 4 | Python Programming Basics |

**Phrase search:** stopwords are skipped for matching but still affect positions (tunable with [stopword_step](https://manual.manticoresearch.com/Creating_a_table/NLP_and_tokenization/Ignoring_stop-words#stopword_step)).

```sql
SELECT * FROM articles WHERE MATCH('"the quick"');
```

| id | title |
|----|-------|
| 1 | The Quick Guide to Python Programming |

**Wordforms:** "learn" matches "Learning" via the wordform.

```sql
SELECT * FROM articles WHERE MATCH('learn');
```

| id | title |
|----|-------|
| 3 | An Introduction to Machine Learning |

**Exceptions:** the mapping `JS => javascript` normalizes "JS" to "javascript" when it appears in text or in the query. Because the destination is lowercase, it matches the token form that charset_table produces for "JavaScript", so both `MATCH('JavaScript')` and `MATCH('JS')` return the same rows.

```sql
SELECT * FROM articles WHERE MATCH('JavaScript');
```

| id | title |
|----|-------|
| 2 | A Complete Reference for JavaScript |
| 5 | Getting Started with JS |

```sql
SELECT * FROM articles WHERE MATCH('JS');
```

| id | title |
|----|-------|
| 2 | A Complete Reference for JavaScript |
| 5 | Getting Started with JS |


## Benefits Summary

1. **No File Management**: Eliminates the need to create, manage, and clean up external files
2. **Simplified Deployment**: Configuration is part of the table definition, making deployments more straightforward
3. **Permission Independence**: No file system permission issues between web server and Manticore processes
4. **Better for Automation**: Easier to script and automate table creation
5. **Self-Contained and Self-Documenting**: Table configuration is complete in the `CREATE TABLE` statement, and `SHOW CREATE TABLE` shows the full dictionary content inline, so definitions are easy to share and version control without managing separate dictionary files

## Migration Path

If you're currently using external files, you can easily migrate to inline configuration:

1. Read your existing file content
2. Convert the format to use semicolons as separators
3. Replace the file path with the `*_list` option in your `CREATE TABLE` statement

For example, if you have a `stopwords.txt` file containing:
```
a
the
an
and
```

You can convert it to:
```sql
stopwords_list = 'a; the; an; and'
```

## Conclusion

The new inline tokenization dictionary configuration options (`stopwords_list`, `exceptions_list`, `wordforms_list`, and `hitless_words_list`) provide a cleaner, more maintainable way to configure tokenization settings. They're particularly valuable in environments where file management is challenging or when you want to simplify your deployment process and keep table definitions self-contained. While external files remain supported for large dictionaries, inline configuration offers a convenient alternative for most use cases.
