blog-post

How to highlight search results

In this tutorial you will learn how to highlight search results in Manticore Search. You can benefit from search results highlighting if you want to improve readability of search results in your application or a web site.

Highlighting allows you to get snippets of search results with matching keywords highlighted. It helps to improve your application’s search experience.

Introduction


You can highlight keywords in text in Manticore Search using several methods.

  • Statement CALL SNIPPETS allows getting a list of fragments from documents (called snippets) that contain the matches. It can be used separately from a search query to highlight a string or a list of strings. Here is an example:
    CALL SNIPPETS('my text with keyword', 'index', 'keyword');
  • Function SNIPPET() builds a snippet from provided data and query, using specified index settings. This function is mostly used in a SELECT statement to highlight given text, field value or text fetched from another source using a UDF (user-defined function). It can be used for highlighting the same query as in the match clause or another one, it’s up to you. Like this:
    SELECT SNIPPET(content,'camera') FROM index WHERE MATCH('camera');
  • Function HIGHLIGHT() can be used to highlight search results. This function was added in Manticore 3.2.2. It makes it easier to highlight keywords in your documents when you store them in Manticore, not just get them indexed. Here is an example of the call:
    SELECT HIGHLIGHT() FROM index WHERE MATCH('text feature');

The first two CALL SNIPPETS and SNIPPET() provide the ability to get a list of parts of documents (called snippets) that contain the matches to the searched keyword. The last one HIGHLIGHT() fetches all available fields from document storage and highlights them against the given query. Unlike SNIPPET(), HIGHLIGHT() supports field syntax in queries.

All three share the same highlighting options which we’ll discuss in the next steps. In this tutorial, we’ll show examples of using HIGHLIGHT().

Let’s assume you have an index called ‘highlight’ with the following settings:

index highlight
{
        type = rt
        path = highlight
        rt_field = title
        rt_field = content
        rt_attr_uint = gid
        stored_fields = title, content
        index_sp = 1
        html_strip = 1
}

Basic usage


A quick example:

First, add a document:

INSERT INTO highlight(title,content,gid) VALUES('Syntax highlighting','Syntaxhighlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself;it is intended only for human readers.',1);

And then run SELECT HIGHLIGHT():

SELECT HIGHLIGHT() AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used ... , such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in ... of terms.[1] This <b>feature</b> facilitates writing in a structured ... affect the meaning of the <b>text</b> itself; it is intended ...

By default, any of the matching words are highlighted with tag ‘bold’ and at most 5 words are picked around each match to form a passage.
Passages are separated with ... by default.

The HTML tag is used to highlight matches as the snippets are often displayed in an HTML content, but you can customise the behaviour with “before_match”, “after_match”, “around” and “chunk_separator” settings. For example:

SELECT HIGHLIGHT({before_match='*',after_match='*',around=1,chunk_separator='###'}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ### a *feature* of *text*###. The *feature* displays *text*###] This *feature* facilitates### the *text* itself###

Control size of the snippet


The default settings put a limit (under the setting with same name - “limit”) of 256 characters as maximum snippet size. You can change it like this:

SELECT HIGHLIGHT({limit=10}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h:  ...  a <b>feature</b> ...

Another limit which can be changed is number of words included in snippet which is defined by “limit_words”:

SELECT HIGHLIGHT({limit_words=5},'content') AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ... . The <b>feature</b> displays <b>text</b>, especially ...

It is also possible to limit number of passages, for example, if we want to get just one passage:

SELECT HIGHLIGHT({limit_passages=1}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h:  ...  languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in ...

The default behaviour of the HIGHLIGHT() function is to return found passages separated by defined separator within the space defined by the limit.
As it’s possible that the limit may not be enough for all passages we may get only a part of the possible passages.

To demonstrate it let’s add first a document with a longer text.

INSERT INTO highlight(title,content) values('wikipedia','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, such as spell-checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature of many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalize the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports syntax highlighting Some text editors can also export the colored markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance asa HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML, or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: morethan one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less filein an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read. ');

Let’s highlight it now:

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ...
*************************** 2. row ***************************
h: ... version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries ... highlighted incorrectly. <b>syntax</b> elements Most editors with <b>syntax</b> highlighting allow different ... different lexical sub-elements of <b>syntax</b>. These include keywords, comments, ...

For the newly added document, we see the HIGHLIGHT() doesn’t give us all the passages. We can increase the limit to overcome that, the question is how far. If we use too high value, the HIGHLIGHT() returns the full body of the content (including the highlights):

SELECT HIGHLIGHT({limit=10000},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and <b>syntax</b> errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.
*************************** 2. row ***************************
h: <b>syntax</b> highlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and <b>syntax</b> errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. <b>syntax</b> highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate <b>syntax</b> highlighting with other features, such as spell checking orcode folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript <b>syntax</b> highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. <b>syntax</b> highlighting also helps programmers find errors in their program. For example, most editors highlight stringliterals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting thepair in a different color. A study published in the conference PPIG evaluated the effects of <b>syntax</b> highlighting on the comprehension of short programs, finding that the presence of <b>syntax</b> highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that <b>syntax</b> highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports<b>syntax</b> highlighting Some text editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance asa HTML, colorized LaTeX, PostScript or RTF version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic <b>syntax</b> Highlighter (GeSHi) extension for PHP. For editors thatsupport more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML,or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is notclear what language to use, and a document may not be highlighted or be highlighted incorrectly. <b>syntax</b>elements Most editors with <b>syntax</b> highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of <b>syntax</b>. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show asmuch useful information as possible without making the code difficult to read.

If we want just the passages rather than whole text highlighted we need to use option “force_passages”:

SELECT HIGHLIGHT({limit=10000,force_passages=1},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ...
*************************** 2. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ... intended only for human readers. <b>syntax</b> highlighting is a form of ... it. Some editors also integrate <b>syntax</b> highlighting with other features, such ... (after watch=false) in Javascript <b>syntax</b> highlighting is one strategy to ... what they are looking for. <b>syntax</b> highlighting also helps programmers find ... PPIG evaluated the effects of <b>syntax</b> highlighting on the comprehension of ... , finding that the presence of <b>syntax</b> highlighting significantly reduces the time ... during the study suggested that <b>syntax</b> highlighting enables programmers to pay ... in text editors gedit supports <b>syntax</b> highlighting Some text editors can ... version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries or ... themselves, for example the Generic <b>syntax</b> Highlighter (GeSHi) extension for PHP ... be highlighted incorrectly. <b>syntax</b> elements Most editors with <b>syntax</b> highlighting allow different ... different lexical sub-elements of <b>syntax</b>. These include keywords, comments, control ...

Another possible way to get the whole text with highlights applied is to simply use limit=0:

SELECT HIGHLIGHT({limit=0},'content') AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used for programming, scripting, ormarkup languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in different colors and fonts according to the category of terms.[1] This <b>feature</b> facilitates writing in a structuredlanguage such as a programming language or a markup language as both structures and syntax errors are visuallydistinct. Highlighting does not affect the meaning of the <b>text</b> itself; it is intended only for human readers.
*************************** 2. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used for programming, scripting, ormarkup languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in different colors and fonts according to the category of terms.[1] This <b>feature</b> facilitates writing in a structuredlanguage such as a programming language or a markup language as both structures and syntax errors are visuallydistinct. Highlighting does not affect the meaning of the <b>text</b> itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the <b>text</b> meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, suchas spell checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in <b>text</b> editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the <b>text</b>; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the <b>text</b>. Brace MATCHing is another important <b>feature</b> with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in <b>text</b> editors gedit supports syntax highlighting Some <b>text</b> editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of <b>text</b>-formatting software; for instance as a HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the <b>text</b>, such as C, LaTeX, HTML, or the <b>text</b> editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and <b>text</b> styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read.

HTML Stripping and boundaries


If our index has sentence detection, we can configure highlighting to not create passages that cross between sentences:

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('html text')\G
*************************** 1. row ***************************
h:  ...  highlighting is a feature of <b>text</b> editors that are used for ... markup languages, such as <b>HTML</b>. The feature displays <b>text</b>, especially source code ... affect the meaning of the <b>text</b> itself; it is intended only ... 1 row in set (0.00 sec)

In this example, we see the passage ‘… markup languages, such as HTML. The feature displays text, especially source code …’ which crosses between sentences.

With passage_boundary=sentence this passage will be split into two:

SELECT HIGHLIGHT({passage_boundary='sentence'},'content') AS h FROM highlight WHERE MATCH('html text')\G
*************************** 1. row ***************************
h:  ...  highlighting is a feature of <b>text</b> editors that are used for ... , or markup languages, such as <b>HTML</b>. ... The feature displays <b>text</b>, especially source code, in different ... affect the meaning of the <b>text</b> itself; it is intended only ... 1 row in set (0.05 sec)

Let’s add a document with HTML content.

INSERT INTO highlight(title,content) values('html content','The ideas of syntax highlighting overlap significantly with those of <a title="Structure editor" href="/wiki/Structure_editor">syntax-directed editors</a> One of the first such class of editors for code was Wilfred Hansens 1969 code editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a title="Autocomplete" href="/wiki/Autocomplete">code completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs.');

By default highlighting will process HTML content depending on the index settings. If HTML stripping is enabled in the index, then the HIGHLIGHT() result will also be HTML stripped.

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('code class')\G
*************************** 1. row ***************************
h:  ...  of the first such <b>class</b> of editors for code was Wilfred Hansens ... 1969 <b>code</b> editor, Emily.[3][4 ... ] It provided advanced language-independent <b>code</b> completion facilities, and unlike modern ...

If we want the highlight to include the HTML tags as well, we need to set “html_strip_mode=none”:

SELECT HIGHLIGHT({html_strip_mode='none'},'content') AS h FROM highlight WHERE MATCH('code class')\G
*************************** 1. row ***************************
h:  ... the first such <b>class</b> of editors for <b>code</b> was Wilfred ... 1969 <b>code</b> editor, Emily. <sup id="cite_ref-hansen_3-0" style="background: #EBE909; color: #000000;">class=" ... sup id="cite_ref-4" <b>class</b>="reference"><a title="Autocomplete" href="#cite_note- ... ="><b>code</b> completion facilities, and ... 1 row in set (0.05 sec)</a></sup>

Please note that html_strip_mode=none can highlight words that are part of HTML syntax, like ‘class’.
To protect the HTML entities, the retain mode can be used, but it requires no limit for the snippet (limit=0):

SELECT HIGHLIGHT({html_strip_mode='retain',limit=0},'content') AS h FROM highlight WHERE MATCH('code class')\G *************************** 1. row *************************** h:
<p>The ideas of syntax highlighting overlap significantly with those of <a title="Structure editor" href="/wiki/Structure_editor">syntax-directed editors</a> One of the first such <b>class</b> of editors for <b>class</b> was Wilfred Hansens 1969 <b>code</b> editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a title="Autocomplete" href="/wiki/Autocomplete"><b>code</b> completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs</p>

This tutorial has explained how to do highlighting in Manticore Search using function HIGHLIGHT().

Interactive course

img

This blog post is available in a form of an interactive course which features a command line allowing you to play with the above examples interactively.

Install Manticore Search

Install Manticore Search