Search code examples
elasticsearchquery-string

How to perform query string search on term containing hyphens in elasticsearch


I know this issue has been discussed in several other posts, but my case is a little bit different since I have the following constraints:

  • I cannot change the mapping of the fields
  • I cannot perform any action that will result in reindexing or creating a new index.

So this is my query:

GET /test/_search
{
  "query": {
    "bool": {
      "must": [
        {"query_string": {
          "query": "*bop-qa-io-135*",
          "default_field": "errors.message", 
          "default_operator": "AND"}},
          {"range": {"updated_at": {"gte": "2023-07-03T00:00:00"}}},
          {"range": {"updated_at": {
            "lte": "2023-07-05T00:00:00"}}}]}}, "from": 0, "size": 300}

The type of errors.message is text. This query doesn't give me what I want, I know that the standard analyzer is working behind the scenes here to split my hyphenated query into separated terms etc. My question is if there's a way to make this query work under the constraints detailed above? What I've already tried:

  • Adding "analyzer": "keyword" to the query

I think there was something about putting everything in double-quotes that should've worked but I don't know how to do it here - there's already double quotes as part of the JSON syntax.

My ES version:

{
  "name": "GYWR05J",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "vFO2BdrzR0OLfPeVO9Rr-g",
  "version": {
    "number": "6.2.2",
    "build_hash": "10b1edd",
    "build_date": "2018-02-16T19:01:30.685723Z",
    "build_snapshot": false,
    "lucene_version": "7.2.1",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

Solution

  • I think there was something about putting everything in double-quotes that should've worked but I don't know how to do it here

    You can escape - with \- and put everything in quotes by escaping them with \". So you will get something like this: "query": "\"*bop\\-qa\\-io\\-135*\"" but it will not help you because query_string query doesn't work with wildcards. You can either choose wildcard or choose phrases there, but not both.

    Unfortunately, if you cannot reindex, the solution is not going to be simple. First, you need to analyze your request to see which tokens are generated:

    POST test/_analyze
    {
      "field": "errors.message",
      "text": ["bop-qa-io-135"]
    }
    

    Then from the generated tokens you need to create a span_near query with span_multi with wildcard query for the first and the last terms and with span_term for all other terms. The terms should be in the format produced by the _analyze request. So, for *bop-qa-io-135* we will get

    
    POST test/_search
    {
      "query": {
        "span_near": {
          "clauses": [
            {
              "span_multi": {
                "match": {
                  "wildcard": {
                    "errors.message": "*bop"
                  }
                }
              }
            },
            {
              "span_term": {
                "errors.message": "qa"
              }
            },
            {
              "span_term": {
                "errors.message": "io"
              }
            },
            {
              "span_multi": {
                "match": {
                  "wildcard": {
                    "errors.message": "135*"
                  }
                }
              }
            }
          ],
          "in_order": true
        }
      },
      "from": 0,
      "size": 300
    }
    

    If reindexing is an option, you can use an analyzer that is better suited for your type of text. There are numerous options. You can use whitespace analyzer for example, or use char filter to map - to some character that is not getting split by the analyzer for example _. One side effect of this approach is that because the character filter is applied for both indexing and searching searching for both - or _ will return both - and _:

    PUT test
    {
      "settings": {
        "number_of_replicas": 0,
        "number_of_shards": 1,
        "analysis": {
          "analyzer": {
            "nonsplit_analyzer": {
              "tokenizer": "standard",
              "char_filter": [
                "nonsplit_char_filter"
                ]
            }
          },
          "char_filter": {
            "nonsplit_char_filter": {
              "type": "mapping",
              "mappings": [
                "- => _"
                ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "text": {
            "type": "text",
            "analyzer":"nonsplit_analyzer"
          }
        }
      }
    }
      
    POST test/_bulk?refresh
    {"index":{}}
    {"text": "bazbop-qa-io-135678"}
    {"index":{}}
    {"text": "foobop_qa_io_135678"}
    {"index":{}}
    {"text": "foobop-qa-io-234567"}
    {"index":{}}
    {"text": "foobop qa io 135678"}
    
    
    POST test/_search
    {
      "query": {
        "query_string": {
          "default_field": "text",
          "query": "*bop-qa-io-135*"
        }
      }
    }
    
    POST test/_search
    {
      "query": {
        "query_string": {
          "default_field": "text",
          "query": "*bop_qa_io_135*"
        }
      }
    }