I am trying to support smart-case search
within our application which uses elasticsearch. The use case I want to support is to be able to partially match on any blob of text using smart-case semantics. I managed to configure my index in such a way that I am capable of simulating smart-case search. It uses ngrams of max length 8 to not overload storage requirements.
The way it works is that each document has both a generated case-sensitive
and a case-insensitive
field using copy_to
with their own specific indexing strategy. When searching on a given input, I split the input in parts. This depends on the ngrams length, white spaces and double quote escaping. Each part is checked for capitalized letters. When a capital letter is found, it generates a match filter for that specific part using the case-sensitive field, otherwise it uses the case-insensitive field.
This has proven to work very nicely, however I am having difficulties with getting highlighting to work the way I would like. To better explain the issue, I added an overview of my test setup below.
curl -X DELETE localhost:9200/custom
curl -X PUT localhost:9200/custom -d '
{
"settings": {
"analysis": {
"filter": {
"default_min_length": {
"type": "length",
"min": 1
},
"squash_spaces": {
"type": "pattern_replace",
"pattern": "\\s{2,}",
"replacement": " "
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "8"
}
},
"analyzer": {
"index_raw": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "keyword"
},
"index_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim"],
"tokenizer": "keyword"
},
"index_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim"],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"date_detection": false,
"dynamic_templates": [
{
"case_insensitive": {
"match_mapping_type": "string",
"match": "case_insensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive"
}
}
},
{
"case_sensitive": {
"match_mapping_type": "string",
"match": "case_sensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive"
}
}
},
{
"text": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "index_raw",
"copy_to": ["case_insensitive","case_sensitive"],
"fields": {
"case_insensitive": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive",
"term_vector": "with_positions_offsets"
},
"case_sensitive": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive",
"term_vector": "with_positions_offsets"
}
}
}
}
}
]
}
}
}
'
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'
The user searches for: tHis test
which gets split into two parts as ngrams are maximum 8 in lengths: (1) tHis
and (2) test
. For (1) the case-sensitive field is used and (2) uses the case-insensitive field.
curl -X POST "http://localhost:9200/_search" -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*": {}
}
}
}
'
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534896,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.057534896,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
}
]
}
}
As you can see, the response shows that the smart-case search works very well. However, I also want to give feedback to the user using highlighting. My current setup uses "term_vector": "with_positions_offsets"
to generate highlights. This indeed gives back correct highlights. However, the highlights are returned as both case-sensitive and case-insensitive independently.
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
This requires me to manually zip multiple highlights on the same field into one combined highlight before returning it to the user. This becomes very painful when highlights become more complicated and can overlap.
Is there an alternative setup to actually get back the combined highlight. I.e. I would like to have this as part of my response.
"highlight": {
"text": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
Make use of highlight query to get merged result:
curl -XPOST 'http://localhost:9200_search' -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*.case_insensitive": {
"highlight_query": {
"bool": {
"must": [
{
"match": {
"*.case_insensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"*.case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
}
}
}
}
}
'
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
}
]
}
}
When ingesting the following, note the additional lower-case test
keyword:
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'
The response to the same query becomes:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis this .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em><em> this</em> .is a!<em> Test</em>"
]
}
}
]
}
}
As you can see, the highlight now also includes the lower-case this
. For such a test example, we do not mind. However, for complicated queries, the user might (and probably will) get confused when and how the smart-case has any effect. Especially when the lower-case match would include a field that only matches on lower-case.
This solution will give you all highlights merged as one, but might include unwanted results.