В этом руководстве вы узнаете, как выделять результаты поиска в Manticore Search. Вы можете воспользоваться выделением результатов поиска, если хотите улучшить читаемость результатов поиска в вашем приложении или на веб-сайте.
Выделение позволяет получать фрагменты результатов поиска с выделенными совпадающими ключевыми словами. Это помогает улучшить опыт поиска в вашем приложении.
Введение
Вы можете выделять ключевые слова в тексте в Manticore Search, используя несколько методов.
- Оператор CALL SNIPPETS позволяет получить список фрагментов из документов (называемых сниппетами), которые содержат совпадения. Его можно использовать отдельно от поискового запроса для выделения строки или списка строк. Вот пример:
CALL SNIPPETS('my text with keyword', 'index', 'keyword');
- Функция SNIPPET() создает сниппет из предоставленных данных и запроса, используя заданные настройки индекса. Эта функция в основном используется в операторе SELECT для выделения данного текста, значения поля или текста, полученного из другого источника с помощью UDF (пользовательской функции). Ее можно использовать для выделения того же запроса, что и в условии совпадения, или другого, это на ваше усмотрение. Вот так:
SELECT SNIPPET(content,'camera') FROM index WHERE MATCH('camera');
- Функция HIGHLIGHT() может быть использована для выделения результатов поиска. Эта функция была добавлена в Manticore 3.2.2. Она упрощает выделение ключевых слов в ваших документах, когда вы храните их в Manticore, а не просто индексируете. Вот пример вызова:
SELECT HIGHLIGHT() FROM index WHERE MATCH('text feature');
Первые два CALL SNIPPETS и SNIPPET() предоставляют возможность получить список частей документов (называемых сниппетами), которые содержат совпадения с искомым ключевым словом. Последняя HIGHLIGHT() извлекает все доступные поля из хранилища документов и выделяет их по сравнению с данным запросом. В отличие от SNIPPET(), HIGHLIGHT() поддерживает синтаксис полей в запросах.
Все три функции имеют одинаковые параметры выделения, которые мы обсудим в следующих шагах. В этом руководстве мы покажем примеры использования HIGHLIGHT().
Предположим, у вас есть индекс под названием 'highlight' с следующими настройками:
index highlight
{
type = rt
path = highlight
rt_field = title
rt_field = content
rt_attr_uint = gid
stored_fields = title, content
index_sp = 1
html_strip = 1
}
Основное использование
Быстрый пример:
Сначала добавьте документ:
INSERT INTO highlight(title,content,gid) VALUES('Syntax highlighting','Syntaxhighlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself;it is intended only for human readers.',1);
А затем выполните SELECT HIGHLIGHT():
SELECT HIGHLIGHT() AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used ... , such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in ... of terms.[1] This <b>feature</b> facilitates writing in a structured ... affect the meaning of the <b>text</b> itself; it is intended ...
По умолчанию любые совпадающие слова выделяются тегом 'bold', и максимум 5 слов выбираются вокруг каждого совпадения, чтобы сформировать отрывок.
Отрывки по умолчанию разделяются ....
HTML-тег используется для выделения совпадений, так как сниппеты часто отображаются в HTML-контенте, но вы можете настроить поведение с помощью настроек "before_match", "after_match", "around" и "chunk_separator". Например:
SELECT HIGHLIGHT({before_match='*',after_match='*',around=1,chunk_separator='###'}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ### a *feature* of *text*###. The *feature* displays *text*###] This *feature* facilitates### the *text* itself###
Управление размером сниппета
Настройки по умолчанию устанавливают лимит (в настройке с тем же именем - "limit") в 256 символов как максимальный размер сниппета. Вы можете изменить это так:
SELECT HIGHLIGHT({limit=10}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ... a <b>feature</b> ...
Другой лимит, который можно изменить, - это количество слов, включенных в сниппет, которое определяется "limit_words":
SELECT HIGHLIGHT({limit_words=5},'content') AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ... . The <b>feature</b> displays <b>text</b>, especially ...
Также возможно ограничить количество отрывков, например, если мы хотим получить только один отрывок:
SELECT HIGHLIGHT({limit_passages=1}) AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: ... languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in ...
Поведение функции HIGHLIGHT() по умолчанию заключается в том, чтобы возвращать найденные отрывки, разделенные заданным разделителем, в пределах пространства, определенного лимитом.
Поскольку возможно, что лимита может не хватить для всех отрывков, мы можем получить только часть возможных отрывков.
Чтобы продемонстрировать это, давайте сначала добавим документ с более длинным текстом.
INSERT INTO highlight(title,content) values('wikipedia','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, such as spell-checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature of many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalize the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports syntax highlighting Some text editors can also export the colored markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance asa HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML, or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: morethan one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less filein an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read. ');
Давайте теперь выделим его:
SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ...
*************************** 2. row ***************************
h: ... version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries ... highlighted incorrectly. <b>syntax</b> elements Most editors with <b>syntax</b> highlighting allow different ... different lexical sub-elements of <b>syntax</b>. These include keywords, comments, ...
Для вновь добавленного документа мы видим, что HIGHLIGHT() не дает нам всех отрывков. Мы можем увеличить лимит, чтобы это исправить, вопрос в том, насколько далеко. Если мы используем слишком высокое значение, HIGHLIGHT() возвращает полный текст содержимого (включая выделения):
SELECT HIGHLIGHT({limit=10000},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and <b>syntax</b> errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.
*************************** 2. row ***************************
h: <b>syntax</b> highlighting is a feature of text editors that are used for programming, scripting, or markuplanguages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and <b>syntax</b> errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. <b>syntax</b> highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate <b>syntax</b> highlighting with other features, such as spell checking orcode folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript <b>syntax</b> highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. <b>syntax</b> highlighting also helps programmers find errors in their program. For example, most editors highlight stringliterals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting thepair in a different color. A study published in the conference PPIG evaluated the effects of <b>syntax</b> highlighting on the comprehension of short programs, finding that the presence of <b>syntax</b> highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that <b>syntax</b> highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports<b>syntax</b> highlighting Some text editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance asa HTML, colorized LaTeX, PostScript or RTF version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic <b>syntax</b> Highlighter (GeSHi) extension for PHP. For editors thatsupport more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML,or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is notclear what language to use, and a document may not be highlighted or be highlighted incorrectly. <b>syntax</b>elements Most editors with <b>syntax</b> highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of <b>syntax</b>. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show asmuch useful information as possible without making the code difficult to read.
Если мы хотим только отрывки, а не весь текст с выделениями, нам нужно использовать опцию "force_passages":
SELECT HIGHLIGHT({limit=10000,force_passages=1},'content') AS h FROM highlight WHERE MATCH('syntax')\G
*************************** 1. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ...
*************************** 2. row ***************************
h: <b>syntax</b> highlighting is a feature of ... language as both structures and <b>syntax</b> errors are visually distinct. Highlighting ... intended only for human readers. <b>syntax</b> highlighting is a form of ... it. Some editors also integrate <b>syntax</b> highlighting with other features, such ... (after watch=false) in Javascript <b>syntax</b> highlighting is one strategy to ... what they are looking for. <b>syntax</b> highlighting also helps programmers find ... PPIG evaluated the effects of <b>syntax</b> highlighting on the comprehension of ... , finding that the presence of <b>syntax</b> highlighting significantly reduces the time ... during the study suggested that <b>syntax</b> highlighting enables programmers to pay ... in text editors gedit supports <b>syntax</b> highlighting Some text editors can ... version of its <b>syntax</b> highlighting. There are several <b>syntax</b> highlighting libraries or ... themselves, for example the Generic <b>syntax</b> Highlighter (GeSHi) extension for PHP ... be highlighted incorrectly. <b>syntax</b> elements Most editors with <b>syntax</b> highlighting allow different ... different lexical sub-elements of <b>syntax</b>. These include keywords, comments, control ...
Другой возможный способ получить весь текст с примененными выделениями - это просто использовать limit=0:
SELECT HIGHLIGHT({limit=0},'content') AS h FROM highlight WHERE MATCH('text feature')\G
*************************** 1. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used for programming, scripting, ormarkup languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in different colors and fonts according to the category of terms.[1] This <b>feature</b> facilitates writing in a structuredlanguage such as a programming language or a markup language as both structures and syntax errors are visuallydistinct. Highlighting does not affect the meaning of the <b>text</b> itself; it is intended only for human readers.
*************************** 2. row ***************************
h: Syntax highlighting is a <b>feature</b> of <b>text</b> editors that are used for programming, scripting, ormarkup languages, such as HTML. The <b>feature</b> displays <b>text</b>, especially source code, in different colors and fonts according to the category of terms.[1] This <b>feature</b> facilitates writing in a structuredlanguage such as a programming language or a markup language as both structures and syntax errors are visuallydistinct. Highlighting does not affect the meaning of the <b>text</b> itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the <b>text</b> meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, suchas spell checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in <b>text</b> editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the <b>text</b>; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the <b>text</b>. Brace MATCHing is another important <b>feature</b> with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in <b>text</b> editors gedit supports syntax highlighting Some <b>text</b> editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of <b>text</b>-formatting software; for instance as a HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the <b>text</b>, such as C, LaTeX, HTML, or the <b>text</b> editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and <b>text</b> styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read.
Удаление HTML и границы
Если наш индекс имеет определение предложений, мы можем настроить выделение так, чтобы не создавать отрывки, которые пересекаются между предложениями:
SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('html text')\G
*************************** 1. row ***************************
h: ... highlighting is a feature of <b>text</b> editors that are used for ... markup languages, such as <b>HTML</b>. The feature displays <b>text</b>, especially source code ... affect the meaning of the <b>text</b> itself; it is intended only ... 1 row in set (0.00 sec)
В этом примере мы видим отрывок '... языки разметки, такие как HTML. Эта функция отображает текст, особенно исходный код ...', который пересекается между предложениями.
С passage_boundary=sentence этот отрывок будет разделен на два:
SELECT HIGHLIGHT({passage_boundary='sentence'},'content') AS h FROM highlight WHERE MATCH('html text')\G
*************************** 1. row ***************************
h: ... highlighting is a feature of <b>text</b> editors that are used for ... , or markup languages, such as <b>HTML</b>. ... The feature displays <b>text</b>, especially source code, in different ... affect the meaning of the <b>text</b> itself; it is intended only ... 1 row in set (0.05 sec)
Давайте добавим документ с HTML-контентом.
INSERT INTO highlight(title,content) values('html content','The ideas of syntax highlighting overlap significantly with those of <a title="Structure editor" href="/wiki/Structure_editor">syntax-directed editors</a> One of the first such class of editors for code was Wilfred Hansens 1969 code editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a title="Autocomplete" href="/wiki/Autocomplete">code completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs.');
По умолчанию выделение будет обрабатывать HTML-контент в зависимости от настроек индекса. Если удаление HTML включено в индексе, то результат HIGHLIGHT() также будет очищен от HTML.
SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('code class')\G
*************************** 1. row ***************************
h: ... of the first such <b>class</b> of editors for code was Wilfred Hansens ... 1969 <b>code</b> editor, Emily.[3][4 ... ] It provided advanced language-independent <b>code</b> completion facilities, and unlike modern ...
Если мы хотим, чтобы выделение включало также HTML-теги, нам нужно установить "html_strip_mode=none":
SELECT HIGHLIGHT({html_strip_mode='none'},'content') AS h FROM highlight WHERE MATCH('code class')\G
*************************** 1. row ***************************
h: ... the first such <b>class</b> of editors for <b>code</b> was Wilfred ... 1969 <b>code</b> editor, Emily. <sup id="cite_ref-hansen_3-0" style="background: #EBE909; color: #000000;">class=" ... sup id="cite_ref-4" <b>class</b>="reference"><a title="Autocomplete" href="#cite_note- ... ="><b>code</b> completion facilities, and ... 1 row in set (0.05 sec)</a></sup>
Обратите внимание, что html_strip_mode=none может выделять слова, которые являются частью HTML-синтаксиса, такие как 'class'.
Чтобы защитить HTML-сущности, можно использовать режим сохранения, но это требует отсутствия лимита для сниппета (limit=0):
SELECT HIGHLIGHT({html_strip_mode='retain',limit=0},'content') AS h FROM highlight WHERE MATCH('code class')\G *************************** 1. row *************************** h:
<p>The ideas of syntax highlighting overlap significantly with those of <a title="Structure editor" href="/wiki/Structure_editor">syntax-directed editors</a> One of the first such <b>class</b> of editors for <b>class</b> was Wilfred Hansens 1969 <b>code</b> editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a title="Autocomplete" href="/wiki/Autocomplete"><b>code</b> completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs</p>
В этом руководстве объясняется, как выполнять выделение в Manticore Search с помощью функции HIGHLIGHT().
Интерактивный курс
<img src="HIghlighting-optimized.webp" alt="img">
Этот блог доступен в виде интерактивного курса, который включает командную строку, позволяющую вам взаимодействовать с приведенными выше примерами.
