I would like to use Elasticsearch highlight to obtain matched keywords found inside a text. This is my settings/mappings
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _",
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": True
}
}
}
}
I am using a char_filter to search and highligth hypenated words. This my document example:
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "DbBIxXEBL7VGAl98vIRl",
"_score": 1.0,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
}
}
and this is the query I use
{
"query": {
"query_string" : {
"query" : "\"New York\" OR \"Rome\" OR \"Milton-Freewater\"",
"default_field": "description"
}
},
"highlight" : {
"pre_tags" : ["<key>"],
"post_tags" : ["</key>"],
"fields" : {
"description" : {
"number_of_fragments" : 0
}
}
}
}
and this is the output I have
...
"hits": [
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "GrDNz3EBL7VGAl98EITg",
"_score": 0.72928625,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
},
"highlight": {
"description": [
"This is an example text containing some cities like <key>New</key> <key>York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}
]
...
Rome and Milton-Freewater are highlighted correctly. New York is not
How can I have <key>New York</key>
instead of <key>New</key>
and <key>York</key>
?
There is an open PR regarding this but I'd suggest the following interim solution:
term_vector
settingPUT test_tokenizer
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"term_vector": "with_positions_offsets",
"fielddata": true
}
}
}
}
POST test_tokenizer/_doc
{"title":"Best places: New Mexico and Sedro-Woolley","description":"This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"}
query_string
to a bunch of bool-should match_phrases
inside the highlight_query
and use type: fvh
GET test_tokenizer/_search
{
"query": {
"query_string": {
"query": "'New York' OR 'Rome' OR 'Milton-Freewater'",
"default_field": "description"
}
},
"highlight": {
"pre_tags": [
"<key>"
],
"post_tags": [
"</key>"
],
"fields": {
"description": {
"highlight_query": {
"bool": {
"should": [
{
"match_phrase": {
"description": "New York"
}
},
{
"match_phrase": {
"description": "Rome"
}
},
{
"match_phrase": {
"description": "Milton-Freewater"
}
}
]
}
},
"type": "fvh",
"number_of_fragments": 0
}
}
}
}
yielding
{
"highlight":{
"description":[
"This is an example text containing some cities like <key>New York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}