elasticsearch full-text-search full-text-indexing

Smartcase searches/highlights with ElasticSearch

Context

I am trying to support smart-case search within our application which uses elasticsearch. The use case I want to support is to be able to partially match on any blob of text using smart-case semantics. I managed to configure my index in such a way that I am capable of simulating smart-case search. It uses ngrams of max length 8 to not overload storage requirements.

The way it works is that each document has both a generated case-sensitive and a case-insensitive field using copy_to with their own specific indexing strategy. When searching on a given input, I split the input in parts. This depends on the ngrams length, white spaces and double quote escaping. Each part is checked for capitalized letters. When a capital letter is found, it generates a match filter for that specific part using the case-sensitive field, otherwise it uses the case-insensitive field.

This has proven to work very nicely, however I am having difficulties with getting highlighting to work the way I would like. To better explain the issue, I added an overview of my test setup below.

Settings

curl -X DELETE localhost:9200/custom
curl -X PUT    localhost:9200/custom -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "default_min_length": {
          "type": "length",
          "min": 1
        },
        "squash_spaces": {
          "type": "pattern_replace",
          "pattern": "\\s{2,}",
          "replacement": " "
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "8"
        }
      },
      "analyzer": {
        "index_raw": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim","default_min_length"],
          "tokenizer": "keyword"
        },
        "index_case_insensitive": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim","default_min_length"],
          "tokenizer": "ngram_tokenizer"
        },
        "search_case_insensitive": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim"],
          "tokenizer": "keyword"
        },
        "index_case_sensitive": {
          "type": "custom",
          "filter": ["squash_spaces","trim","default_min_length"],
          "tokenizer": "ngram_tokenizer"
        },
        "search_case_sensitive": {
          "type": "custom",
          "filter": ["squash_spaces","trim"],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "_all": { "enabled": false },
      "date_detection": false,
      "dynamic_templates": [
        {
          "case_insensitive": {
            "match_mapping_type": "string",
            "match": "case_insensitive",
            "mapping": {
              "type": "string",
              "analyzer": "index_case_insensitive",
              "search_analyzer": "search_case_insensitive"
            }
          }
        },
        {
          "case_sensitive": {
            "match_mapping_type": "string",
            "match": "case_sensitive",
            "mapping": {
              "type": "string",
              "analyzer": "index_case_sensitive",
              "search_analyzer": "search_case_sensitive"
            }
          }
        },
        {
          "text": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "string",
              "analyzer": "index_raw",
              "copy_to": ["case_insensitive","case_sensitive"],
              "fields": {
                "case_insensitive": {
                  "type": "string",
                  "analyzer": "index_case_insensitive",
                  "search_analyzer": "search_case_insensitive",
                  "term_vector": "with_positions_offsets"
                },
                "case_sensitive": {
                  "type": "string",
                  "analyzer": "index_case_sensitive",
                  "search_analyzer": "search_case_sensitive",
                  "term_vector": "with_positions_offsets"
                }
              }
            }
          }
        }
      ]
    }
  }
}
'

Data

curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'

Query

The user searches for: tHis test which gets split into two parts as ngrams are maximum 8 in lengths: (1) tHis and (2) test. For (1) the case-sensitive field is used and (2) uses the case-insensitive field.

curl -X POST "http://localhost:9200/_search" -d '
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "case_sensitive": {
              "query": "tHis",
              "type": "boolean"
            }
          }
        },
        {
          "match": {
            "case_insensitive": {
              "query": "test",
              "type": "boolean"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "<em>"
    ],
    "post_tags": [
      "</em>"
    ],
    "number_of_fragments": 0,
    "require_field_match": false,
    "fields": {
      "*": {}
    }
  }
}
'

Response

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.057534896,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.057534896,
        "_source": {
          "text": "tHis .is a! Test"
        },
        "highlight": {
          "text.case_sensitive": [
            "<em>tHis</em> .is a! Test"
          ],
          "text.case_insensitive": [
            "tHis .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

Problem: highlighting

As you can see, the response shows that the smart-case search works very well. However, I also want to give feedback to the user using highlighting. My current setup uses "term_vector": "with_positions_offsets" to generate highlights. This indeed gives back correct highlights. However, the highlights are returned as both case-sensitive and case-insensitive independently.

"highlight": {
  "text.case_sensitive": [
    "<em>tHis</em> .is a! Test"
  ],
  "text.case_insensitive": [
    "tHis .is a!<em> Test</em>"
  ]
}

This requires me to manually zip multiple highlights on the same field into one combined highlight before returning it to the user. This becomes very painful when highlights become more complicated and can overlap.

Question

Is there an alternative setup to actually get back the combined highlight. I.e. I would like to have this as part of my response.

"highlight": {
  "text": [
    "<em>tHis</em> .is a!<em> Test</em>"
  ]
}

Solution

Attempt

Make use of highlight query to get merged result:

curl -XPOST 'http://localhost:9200_search' -d '
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "case_sensitive": {
              "query": "tHis",
              "type": "boolean"
            }
          }
        },
        {
          "match": {
            "case_insensitive": {
              "query": "test",
              "type": "boolean"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "<em>"
    ],
    "post_tags": [
      "</em>"
    ],
    "number_of_fragments": 0,
    "require_field_match": false,
    "fields": {
      "*.case_insensitive": {
        "highlight_query": {
          "bool": {
            "must": [
              {
                "match": {
                  "*.case_insensitive": {
                    "query": "tHis",
                    "type": "boolean"
                  }
                }
              },
              {
                "match": {
                  "*.case_insensitive": {
                    "query": "test",
                    "type": "boolean"
                  }
                }
              }
            ]
          }
        }
      }
    }
  }
}
'

Response

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.9364339,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.9364339,
        "_source": {
          "text": "tHis .is a! Test"
        },
        "highlight": {
          "text.case_insensitive": [
            "<em>tHis</em> .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

Warning

When ingesting the following, note the additional lower-case test keyword:

curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'

The response to the same query becomes:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.9364339,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.9364339,
        "_source": {
          "text": "tHis this .is a! Test"
        },
        "highlight": {
          "text.case_insensitive": [
            "<em>tHis</em><em> this</em> .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

As you can see, the highlight now also includes the lower-case this. For such a test example, we do not mind. However, for complicated queries, the user might (and probably will) get confused when and how the smart-case has any effect. Especially when the lower-case match would include a field that only matches on lower-case.

Conclusion

This solution will give you all highlights merged as one, but might include unwanted results.