Search code examples
elasticsearchtokenize

How to search omitting whitespace on elasticsearch


Elasticsearch noob here trying to understand something

I have this query

{
  "size": 10,
  "_source": "pokemon.name",
  "query": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "multi_match": {
            "_name": "name-match",
            "type": "phrase",
            "fields": ["pokemon.name"],
            "operator": "or",
            "query": "pika"
          }
        },
        {
          "multi_match": {
            "_name": "weight-match",
            "type": "most_fields",
            // I use multi_match because I'm not sure how can I change it to match
            "fields": ["pokemon.weight"],
            "query": "10kg"
          }
        }
      ]
    }
  }
}

The issue is pokemon.weight has a space between the value and the unit 10 Kg. So I need to ignore the whitespace in order to match with 10kg

I've tried to change the tokenizer, sadly it can decide where to split but not to remove a character. Anyway I don't know how to use it and the documentation isn't very helpful, explains the theory but not how to use it.

Thanks! Any learning resource will be much appreciated.


Solution

  • You need to define a custom analyzer with a char filter. where you will replace a space char with an empty char, so that tokens generated in your case 10 and g, becomes 10g. I tried it locally and working fine for me.

    Bonus links for understanding how analysis works in ES and example of the custom analyzer with char filters.

    Below is my custom analyzer to achieve required tokens:-

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "\\u0020=>"
              ]
            }
          }
        }
      }
    }
    

    Now using the same analyzer, it generated below token, which I confirmed using analyze api.

    Endpoint :- http://{{your_hostname}}:9500/{{your_index_name}}/_analyzer

    body :-

    {
        "analyzer" : "my_analyzer",
        "text" : "10 g"
    }
    

    Result :-

    {
        "tokens": [
            {
                "token": "10g",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }