Relevance scoring in Manticore : part II

In the second part about relevance scoring we talk about how positions can be used for matching and scoring.

Having knowledge of positions of the words in the field is important and can provide better relevancy. Positions allow a wide range of text operators that can perform matches either by position of a word relative to the field or by calculating distance between the found keywords within the field or compared with the input query. One of the most used operators based on positions if the phrase operator – "A B". Phrase matches are restrictive as they enforce the keywords to be match as specified in the query. This means the keywords must be adjacents and in the same order as in the query.

Knowing the position of a word we can perform matches to find if there are fields starting or ending with specific words using the start (^) and end($) operators. The search within a field can also by limited to provide a match only if the word is found among the first N words of the field. This can be achieved by specifying the limit after the field name operator – @myfield [10] word. We can also make sure our input words are found in a specific order using the order operator – word1 << word2 << word3 or if the input words are found in field in a given proximity using the proximity operator: "word1 word2"~10.

The proximity match allows to have no more than N words between the matched keywords in a field. For example "A B C"~4 matches "A D E B F C", but not  "A D E B F G C".

For a less restrictive proximity the NEAR operator can be used. Unlike proximity operator which work with a set of keywords, NEAR works with two operands, which can be words or other subexpressions. For example we can do "A B" NEAR/2 C –  a match is valid if between phrase "A B" and C word there are maximum 2 words. Similar a NOTNEAR operator exists which match only when the operands (words or subexpressions) have a minimum number of words between them.

Several relevance metrics are using word positions to calculate a distance between the query and document.  LCS (Longest Common Subsequence) gives a length of maximum verbatim match between the document and the query. This means we try to find  if we have subsequences of the input  query in our documents. The maximum value LCS is the number of words in the query if we have an exact match of it. Subsequences will provide  inferior LCS values depend of their length and the lowest value – 1 – is when the words are found but don’t form any subsequence greater than 1 word. One thing to consider is that a subsequence doesn’t need to be formed by adjacent matched words. For example if we have A B C at input and in the document we find A D C, this subsequence will receive score 2 because A and C are found in order and in same position as in the query:

mysql> SELECT *,WEIGHT() FROM testrt WHERE MATCH('hello world program') OPTION ranker=expr('top(lcs)');
+------+--------------------------+----------------------------+----------+
| id   | title                    | content                    | weight() |
+------+--------------------------+----------------------------+----------+
|    6 | hello world program      | just some content          |        3 |
|    4 | hello test program       | just some world content    |        2 |
|    5 | hello test world program | just some content          |        2 |
|    9 | hello world              | just program world content |        2 |
|    7 | hello test world         | just program some content  |        1 |
|    8 | test program hello       | just some world content    |        1 |
+------+--------------------------+----------------------------+----------+
6 rows in set (0.00 sec)

The document  4 has LCS on title with value 2 because our query is hello world program and the subsequence is hello test program -even if we have a word that is not on the match list, the  hello and program  words are on their positions (1 and 3) and between.

The document 7 with hello test world while it has 2 of the words (hello and world) in order, their position is 1 and 3 while in the query the positions and 1 and 2.  As the positions don’t match, we can’t talk about the 2 matched words to form a subsequence.

A more restrictive version of LCS is LCCS (Longest Common Contiguous Subsequence). LCCS   scores only the subsequences formed by adjacent words.  To illustrate the difference, let’s take the previous example, but now with lccs:

mysql> select *,weight() from testrt where match('hello world program') option ranker=expr('top(lccs)');
+------+--------------------------+----------------------------+----------+
| id   | title                    | content                    | weight() |
+------+--------------------------+----------------------------+----------+
|    6 | hello  world program     | just some content          |        3 |
|    5 | hello test world program | just some content          |        2 |
|    9 | hello world              | just program world content |        2 |
|    4 | hello test program       | just some world content    |        1 |
|    7 | hello test world         | just program some content  |        1 |
|    8 | test program hello       | just some world content    |        1 |
+------+--------------------------+----------------------------+----------+
6 rows in set (0.00 sec)

Now the document 4 which had LCS=2 it has LCCS=1 as while it has a subsequence and words are in same position, the words are not adjacent.
Both LCS and LCCS count each keyword as 1 in the formula, without considering any information about importance (how rare or common) of the word. A variation of LCCS called WLCCS or Weighted Longest Common Contiguous Subsequence can sum the IDFs of the keywords instead of just a simple count. Unlike LCS and LCCS, WLCCS take float values since IDFs are expressed as float and the output depends on the data itself (as IDF is a measure relative to the current index). WLCCS will give better values to subsequences that include rare words and can rank higher a document that otherwise using lcs or lccs would score lower.

We talked earlier about an operator that can match only when  keywords  starts the field. Position of the first occurence can also be used in calculus of relevance as this denotes the keyword has a greater importance for that field if appears earlier in the field or even better, at the start of it.  min_hit_pos tell the position of the first matched keyword. In ranking expressions we can add a check for min_hit_pos==1  to check if the first matched keywords occurs at the start of fields.

The earliest position can also be applied to subsequences. Imagine that we have multiple documents which gives us subsequences that match the query (partial of full) and we want to know for which documents the subsequences appear earlier in the fields. We can use min_best_span_pos for that, which gives us the first occurence of the best subsequence (calculated as LCS) in a field.

Leave a Reply