Search code examples
elasticsearchelasticsearch-plugin

Elasticsearch word_delimiter_graph split token on specific delimiter only


I want to use an Elasticsearch's Token filter that act like word_delimiter_graph but split tokens on specific delimiter only (if I am not wrong, default word_delimiter_graph does not allow to use custom delimiters list).

For example, I only want to split tokens on - delimiter only:

i-pod -> [i-pod, i, pod]

i_pod -> [i_pod] (since I only want to split on - only and not any other characters.)

How can I archive that?

Thank you!


Solution

  • I used some parameter type_table.

    (Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
    
    For example, the following array maps the plus (+) and hyphen (-) characters as alphanumeric, which means they won’t be treated as delimiters.
    

    Tests:

    i-pad

    GET /_analyze
    {
      "tokenizer": "keyword",
      "filter": {
        "type": "word_delimiter_graph",
        "preserve_original": true,
        "type_table": [ "_ => ALPHA" ]
      },
      "text": "i-pad"
    }
    

    Tokens:

    {
      "tokens": [
        {
          "token": "i-pad",
          "start_offset": 0,
          "end_offset": 5,
          "type": "word",
          "position": 0,
          "positionLength": 2
        },
        {
          "token": "i",
          "start_offset": 0,
          "end_offset": 1,
          "type": "word",
          "position": 0
        },
        {
          "token": "pad",
          "start_offset": 2,
          "end_offset": 5,
          "type": "word",
          "position": 1
        }
      ]
    }
    

    i_pad

    GET /_analyze
    {
      "tokenizer": "keyword",
      "filter": {
        "type": "word_delimiter_graph",
        "preserve_original": true,
        "type_table": [ "_ => ALPHA" ]
      },
      "text": "i_pad"
    }
    

    Tokens:

    {
      "tokens": [
        {
          "token": "i_pad",
          "start_offset": 0,
          "end_offset": 5,
          "type": "word",
          "position": 0
        }
      ]
    }