blog-post

Searching in Vietnamese with Manticore Search

Manticore Search provides basic support for Vietnamese language search. Vietnamese uses the Latin script with diacritics (accent marks), and Manticore automatically handles these diacritics for basic text matching. While this covers the fundamentals, note that advanced features like Vietnamese-specific stemming or stopwords are not included.

Quick Start

Vietnamese search works with the default configuration—no special setup required! Simply create a table with the default charset_table=non_cont:

CREATE TABLE vietnamese_content (
    id bigint,
    title text,
    content text
) charset_table='non_cont';

Automatic Diacritics Handling

One of the key features of Manticore's Vietnamese support is automatic diacritics normalization. This means users can search with or without diacritics, and both will match:

  • "tiếng" matches "tieng" and vice versa
  • "Hà Nội" matches "Ha Noi" and vice versa
  • "đường phố" matches "duong pho" and vice versa

This is particularly useful because:

  • Users often type without diacritics for speed
  • Different input methods may produce different diacritic combinations
  • It provides a better user experience by being forgiving of typing variations

Example Usage

Let's create a simple example with Vietnamese content:

-- Create table
CREATE TABLE vietnamese_news (
    id bigint,
    title text,
    body text
) charset_table='non_cont';

-- Insert some Vietnamese content
INSERT INTO vietnamese_news VALUES
(1, 'Tin tức về Hà Nội', 'Hà Nội là thủ đô của Việt Nam'),
(2, 'Tin tuc ve Ha Noi', 'Ha Noi la thu do cua Viet Nam'),
(3, 'Tiếng Việt trong công nghệ', 'Công nghệ thông tin phát triển nhanh'),
(4, 'Tieng Viet trong cong nghe', 'Cong nghe thong tin phat trien nhanh');

Now you can search using either form:

-- Search with diacritics
SELECT * FROM vietnamese_news WHERE MATCH('Hà Nội');

Result:

+------+--------------------------+-------------------------------------------+
| id   | title                    | body                                      |
+------+--------------------------+-------------------------------------------+
|    1 | Tin tức về Hà Nội        | Hà Nội là thủ đô của Việt Nam             |
|    2 | Tin tuc ve Ha Noi        | Ha Noi la thu do cua Viet Nam             |
+------+--------------------------+-------------------------------------------+
-- Search without diacritics - still matches!
SELECT * FROM vietnamese_news WHERE MATCH('Ha Noi');

Result:

+------+--------------------------+-------------------------------------------+
| id   | title                    | body                                      |
+------+--------------------------+-------------------------------------------+
|    1 | Tin tức về Hà Nội        | Hà Nội là thủ đô của Việt Nam             |
|    2 | Tin tuc ve Ha Noi        | Ha Noi la thu do cua Viet Nam             |
+------+--------------------------+-------------------------------------------+

As you can see, both queries return the same results, demonstrating that Manticore automatically handles diacritics normalization.

How It Works

Manticore's non_cont charset table (which is the default) includes comprehensive mappings for Vietnamese diacritics:

  • ă, âa
  • êe
  • ô, ơo
  • ưu
  • đd
  • All tone marks (à, á, ả, ã, ạ, etc.) are also normalized

These mappings are applied during both indexing and searching, ensuring consistent matching regardless of how the text is entered.

Limitations

Manticore's Vietnamese support is basic and includes:

  • Diacritics normalization - Works automatically
  • Basic tokenization - Handles Vietnamese characters correctly
  • No Vietnamese stemming - Word forms are not normalized (e.g., "tôi", "tôi", "tôi" are treated as different)
  • No Vietnamese stopwords - Common words like "của", "và", "là" are not filtered by default
  • No advanced morphology - No dictionary-based lemmatization

For many applications, the basic diacritics handling is sufficient. However, if you need more sophisticated Vietnamese language processing (like handling word variations or filtering stopwords), you may need to implement custom solutions or use external preprocessing.

Best Practices

  1. Use the default charset_table: No need to customize unless you have specific requirements
  2. Create custom stopwords if needed: You can create a stopwords file to filter common Vietnamese words like "của", "và", "là", "với", etc.
  3. Test your queries: Always test with both diacritic and non-diacritic forms to ensure your search works as expected
  4. Consider preprocessing: For better results, you might want to normalize text before indexing (e.g., handling word variations)

Advanced Configuration

If you need case-sensitive matching or want to preserve diacritics as distinct characters, you can customize the charset_table. However, for most use cases, the default behavior (matching with and without diacritics) provides the best user experience.

Conclusion

Manticore Search provides basic Vietnamese language support that handles the most common use case: matching text with and without diacritics. This makes search more forgiving when users type without accents. For simple search applications, this may be sufficient. However, for applications requiring advanced Vietnamese language processing (stemming, morphological analysis, or sophisticated relevance ranking), you may need to supplement Manticore with additional preprocessing or consider other solutions.

Install Manticore Search

Install Manticore Search