Percolate Queries

Published: Dec 29, 2017

One of the long requested features is now available in the latest code: percolate queries.

In case if you want to go through an interactive course, go here.

Percolate Queries are also known as Persistent Queries, Prospective Search, document routing, search in reverse or inverse search.

The normal way of doing searches is to store documents we want to search and perform queries against them. However there are cases when we want to apply a query to an incoming new document to signal the matching. There are some scenarios where this is wanted. For example a monitoring system doesn’t just collect data, but it’s also desired to notify user on different events. That can be reaching some threshold for a metric or a certain value that appears in the monitored data. Another similar case is news aggregation. You can notify the user about any fresh news, but user might want to be notified only for certain categories or topics. Going further, they might be only interested about certain “keywords”.

This is where a traditional search is not a good fit, since would assume performed the desired search over the entire collection, which gets multiplied by the number of users and we end up with lots of queries running over the entire collection, which can put a lot of extra load. To overcome this, an idea was born to store instead the queries and test them against the incoming new document or a batch of documents.

A workaround was possible by using a Real-Time index on which only new documents were added, queries got tested and the RT was truncated. But this was by far a solution. Finally, we are adding this functionality into Manticore Search (and thus to anyone upgrading from Sphinx Search).

For the storage of the queries a new index type is used, which is based on the Real Time index. The percolate index allows adding queries, view or delete them.

The matching of queries is made with a new statement that takes at input a document or list of documents and return which queries got a match.

Please note that the ‘percolate’ index is not a replacement for an existing index, it doesn’t store documents. In a usual use case, documents get inserted in an index as they were until now. The percolate queries can be either executed after the insert is made, by the same worker or by another worker that works independent of the insert worker (which can be executed by a cron job or triggered by the inserts). The latter would be preferable so PQs don’t delay confirmation (to end user) of the insert, but that depends on the workflow of the application.

Storing Percolate Queries

As stated above, Percolate Queries are stored in a special Real-Time index defined as percolate type.

A simple PQ index requires just defining the type and path:

index pq {
    type = percolate
    path = /var/lib/sphinxsearch/pq
}

To make it available, just issue a RELOAD INDEXES command :

MySQL [(none)]> RELOAD INDEXES; SHOW TABLES LIKE 'pq';
Query OK, 0 rows affected (0.00 sec)

+-------+-----------+
| Index | Type      |
+-------+-----------+
| pq    | percolate |
+-------+-----------+
1 row in set (0.00 sec)

The document id supports auto incrementation, when doing inserts, you don’t need to generate the id like for normal RT indexes. Inserting a new query requires a mandatory ‘query’ text field, which contains the full-text search expression.

MySQL [(none)]> INSERT INTO pq(query) VALUES('catch me');
Query OK, 1 rows affected (0.00 sec)

MySQL [(none)]> SELECT * FROM pq;
+------+----------+------+---------+
| UID  | Query    | Tags | Filters |
+------+----------+------+---------+
|    1 | catch me |      |         |
+------+----------+------+---------+
1 row in set (0.00 sec)

There are 2 optional attributes with special usage.

The first is ’tags’, which supports a bag of strings (think of a string MVA). The tags can be used to filter the PQs when doing SELECTs or DELETEs, they have no role when performing percolate searches.

MySQL [(none)]> insert into pq(query,tags) values('catch me if','tag1');
Query OK, 1 rows affected (0.00 sec)

MySQL [(none)]> SELECT * FROM pq WHEREtags='tag1';
+------+-------------+------+---------+
| UID  | Query       | Tags | Filters |
+------+-------------+------+---------+
|    2 | catch me if | tag1 |         |
+------+-------------+------+---------+
2 rows in set (0.00 sec)

The second is ‘filters’ contain an additional query rules in which you can store attribute filtering in SphinxQL format. The attributes used here must be declared in the index configuration just like attributes in a RT index. To enable a new attribute we can just use ALTER RTINDEX RECONFIGURE.

index pq {
    type = percolate
     path = /var/lib/sphinxsearch/pq
     rt_attr_uint = catId
}

MySQL [(none)]> ALTER RTINDEX pq RECONFIGURE;
Query OK, 0 rows affected (0.00 sec)

MySQL [(none)]> INSERT INTO pq(query,tags,filters) VALUES('catch me','tag2,tag1','catId=10');
Query OK, 1 rows affected (0.00 sec)

MySQL [(none)]> select * from pq;
+------+-------------+-----------+-----------+
| UID  | Query       | Tags      | Filters   |
+------+-------------+-----------+-----------+
|    3 | catch me    |           |           |
|    4 | catch me if | tag1      |           |
|    5 | catch me    | tag2,tag1 |  catid=10 |
+------+-------------+-----------+-----------+
3 rows in set (0.00 sec)

If we want to enable per field text searches, we need to define full-text fields in the index configuration and issue a reconfigure.

index pq {
    type = percolate
    path = /var/lib/sphinxsearch/pq
    rt_field = subject
    rt_field = content
    rt_attr_uint = catId
}

MySQL [(none)]>  ALTER RTINDEX pq RECONFIGURE;
Query OK, 0 rows affected (0.00 sec)

MySQL [(none)]> INSERT INTO pq(query,tags,filters) VALUES('@subject match by field','tag2,tag3','catId=10');
Query OK, 1 rows affected (0.00 sec)

MySQL [(none)]> INSERT INTO pq(query,tags,filters) VALUES('@subject  match by field','tag2,tag3','catId=20');
Query OK, 1 rows affected (0.00 sec)

Performing Percolate Queries

The CALL PQ function can be used against a single percolate index (for now), which is set in the first parameter of the function. The second parameter, which is mandatory, contains a document or list of documents. The document(s) passed can be plain text or a JSON object. The JSON object allows passing multiple text fields, so per field matches are possible, and attributes which are tested against the expression defined in ‘filters’ attribute of the percolate query. This means the query can be not only a full-text match expression, but you can have attribute filters which allows a more complex matching (thing about someone searching for specific keywords, but they want to be notified only for certain types or categories of articles).

By default, CALL PQ expects documents in JSON format, so if we want to perform the most basic example, we need to pass 0 as docs_json. Also the default result set contains only the query ID, but we can use the option query to show all the information stored

MySQL [(none)]> CALL PQ('pq','catch me',0 AS docs_json,1 AS query);
+------+----------+------+---------+
| UID  | Query    | Tags | Filters |
+------+----------+------+---------+
|    6 | catch me |      |         |
+------+----------+------+---------+
1 row in set (0.00 sec)

To perform per field matching or additional attribute filtering we need to pass documents as json object.

MySQL [(none)]> CALL PQ('pq','{"subject":"expect to match by field","content":"here is some content","catid":20}',1 AS query);
+------+--------------------------+-----------+-----------+
| UID  | Query                    | Tags      | Filters   |
+------+--------------------------+-----------+-----------+
|   10 | @subject  match by field | tag2,tag3 |  catid=20 |
+------+--------------------------+-----------+-----------+
1 row in set (0.00 sec)

CALL PQ have 2 more runtime options : ‘docs’ option is useful when you have more than one document at the input, as it will tell you which document matched a search (starting from index 1).

MySQL [(none)]> CALL PQ('pq',('catch me if can','catch me'),0 AS docs_json,1 AS docs,1 AS verbose);
+------+-----------+-------------+------+---------+
| UID  | Documents | Query       | Tags | Filters |
+------+-----------+-------------+------+---------+
|    6 | 1,2       | catch me    |      |         |
|    7 | 1         | catch me if | tag1 |         |
+------+-----------+-------------+------+---------+
2 rows in set (0.00 sec)

When running a CALL PQ, the SHOW META provide information about the executed function, like execution time, number of queries and document matched, total of stored queries.
When ‘verbose’ is set in a CALL PQ, some extra information is displayed like time spent for each query.

MySQL [(none)]> SHOW META;
+-------------------------+-----------+
| Name                    | Value     |
+-------------------------+-----------+
| Total                   | 0.000 sec |
| Setup                   | 0.000 sec |
| Queries matched         | 2         |
| Document matches        | 2         |
| Total queries stored    | 6         |
| Term only queries       | 6         |
| Fast rejected queries   | 3         |
| Time per query          | 32, 21    |
| Time of matched queries | 53        |
+-------------------------+-----------+
9 rows in set (0.00 sec)

Feedback

As this is a new feature added to Manticore which opens path to new usage cases for Manticore Search, we are welcoming everyone to test it and provide us any kind of feedback about it. Meanwhile we’re continuing working on this, there’re few things remaining before we can call it production ready. We’ll appreciate any help with testing.