Manticore Search Indexes and document storage

Manticore Search Indexes and document storage

Published: Oct 22, 2019

Starting from v 3.2.0 Manticore Search introduced a new feature - Document Storage.

Historically Manticore Search was a text search engine which indexes texts but doesn’t keep the originals. A text is processed, transforming it from a plain string into special structures that form the full-text index, which allows fast text searching. In the result set, the user wasn’t receiving the original text, as recomposing it from the full-text index can be a complex process. In addition, indexation settings may include clauses that stop certain words from being indexed (like minimum word length or stopwords) which can make the reconstruction impossible.

Search workflow was created to perform the full-text search in Manticore and based on the result set to perform an additional lookup in the original source of the data (either database or files) to retrieve the content that was only indexed in Manticore. In some cases, the additional lookup is required to retrieve not just the ‘missing’ texts but other data not included in the index because it’s not used in search operations and that data would only increase (unnecessarily) the index size. But there are many cases when the only purpose of the lookup is to get the ‘missing’ texts, which is seen as an extra step that would be good to eliminate - as it means simplifying the application code, avoiding extra load, resource consumption and another point of failure from the data storage.

There are also some scenarios where the original data does not reside in a traditional database and retrieving at a query time the original texts from the data can be slow. Requiring storing this data in a database can be seen as unwanted and as the only reason for using the database to compensate for the inability of the search engine to store the original content. These scenarios can be:

a company may want to be able to search texts in office documents, like Word or PDFs - these can be part of internal tools or a portal offering scientific, legal or technical papers.
data is coming from a web service or from crawled pages
streaming data which can be logs, messages or emails. Reading these from sources can be difficult as in many cases they are stored in distributed systems and reading them at query time complicates the application code

You can run through our interactive course to learn how to use not just indexing but store document content too.

Now it’s possible to store the original texts and retrieve them back in the results.

We should mention that DS feature doesn’t make Manticore a traditional database. Manticore supports transactions and binlog recovery and while it offers most of ACID properties, it’s not considered a true ACID-compliant database.

There is also no encryption for storage at this point. Sensitive or critical data, like user credentials or payment transactions, is not viewed as suitable to be stored in Manticore. Also, there are features of traditional databases, like JOINs, that are not yet supported. But we will work on it in the future versions.

Still, there are many cases where using Manticore just to store data or store and not perform additional lookups makes a lot of sense.

The document storage can reduce the workload on the original data warehouse or reduce space requirements.
Another aspect at the time of writing this article is the performance of the document storage. But we expect to greatly improve this aspect in the next releases.

Usage

Storage can be enabled using the stored_fields index directive which accepts a list of field names separated by commas.

index myindex {
   type     = rt
   path     = /path/to/myindex
   rt_field  = title
   rt_field = short_description
   rt_field = long_description
   rt_attr_string = title
   rt_attr_uint   = group_id
   stored_fields  = short_description, long_description
}

Without stored fields a result would look like:

mysql> SELECT * FROM myindex WHERE ....\G
*************************** 1. row ***************************
      id: 11
   title: A title goes here
group_id:100
*************************** 2. row ***************************
...

With stored fields:

mysql> SELECT * FROM myindex WHERE ....\G
*************************** 1. row ***************************
               id: 11
            title: A title goes here
short_description: A short text description
goes here
 long_description: A long text description
can be found here
         group_id:100
*************************** 2. row ***************************
...

A showcase between an index with and without document storage can be seen in our course on docstore.

Advanced tuning

The texts are stored in a separate file on the disk. By default, they are compressed using lz4 algorithm. An alternative lz4hc algorithm is available or compression can be disabled (’none’) by setting the ‘docstore_compression’ index directive. For the ’lz4hc’ algorithm, the compression level can be tuned with docstore_compression_level index directive which can take values from 1 to 12 (default is 9). Texts are compressed on disk as blocks. The block size can be tuned with a docstore_block_size directive. The default value is 16k. Increasing the block size will improve the compression ratio and save more disk space while decreasing the block size should improve the access speed.

When a search requires a text from the storage, the block containing it is read from disk and uncompressed. Since it’s possible for a block to contain texts from multiple documents, the uncompressed blocks can be cached in memory. By default, a daemon-wide 16MB cache is used. The cache size can be manipulated with searchd’s docstore_cache_size setting.

Differences from string attributes

Manticore Search already featured a data type able to store and retrieve at result set texts and can also be used for comparison, regex filtering, sorting and aggregation operations. Even more, it’s possible in case of Real-Time indexes to use the same name for the full-text field and the string attribute. There is a question on the purpose of the stored fields.

String attributes are stored in the blob component of a Manticore Search 3 index, which also holds JSON and MVA attributes. Reading from the blob component is managed with mmap() and can be left on the disk and read as needed, read and loaded entirely at index load or read, loaded and locked in memory. Texts in string attributes are stored as uncompressed. If the texts stored this way are long (like descriptions, article contents) they can end up using a lot of disk space and they will also increase the memory used by the blob component of a document. For an index that previously was not fitting entirely in RAM, adding to it long texts as string attributes means less variable-length attributes (string, JSON, MVA) that can be cached in RAM, leading to less desirable performance. Locking in memory the blob component (access_blob_attrs=mlock) might not be possible due to increased size. The string attributes should be used for short texts on which is expected to perform non-full-text filtering, grouping and sorting operations, beside just retrieval.

On the other hand, stored fields are a separate component. Texts are only read from disk and if RAM is available the docstore_cache_size can be set to high values (GBs) to cache stored fields in RAM. Also, the stored fields by default are compressed, using less space on storage than string attributes. There are also some cases where texts need to be stored in the index and they are only used at the output and never used at all for filtering/sorting/grouping. From a resource management point, having them in the blob is not optimal. The examples of such strings can be product codes or UUIDs (which are supposed to be unique and sorting on them doesn’t make much sense). Moving them to stored fields will make the blob smaller, meaning less RAM used, faster loading at startup and, depending on the situation, allowing mlocking. One ‘problem’ that can raise here is that the new fields can play a role in full-text matches that don’t search in specific fields, so it might be a good idea to use the ignore field search operator to exclude them from getting searched.