Search code examples
elasticsearchtokenizeanalyzerelasticsearch-6

Elasticsearch custom analyzer with two output tokens


Requirement is to create a custom analyzer which can generate two tokens as shown in below scenarios.

E.g.

Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in

I am able to remove non-alphanumeric character, but how to retain original one too in the output token list. Below is the custom analyzer that I have created.

       "alphanumericStringAnalyzer": {
            "filter": [
                "lowercase",
                "minLength_filter"],
            "char_filter": [
                "specialCharactersFilter"
            ],
            "type": "custom",
            "tokenizer": "keyword"
        }

      "char_filter": {
        "specialCharactersFilter": {
            "pattern": "[^A-Za-z0-9]",
            "type": "pattern_replace",
            "replacement": ""
        }
      },

This analyzer is generating single token "btechin" for input "B.tech in" but I also want original one too in token list "B.tech in"

Thanks!


Solution

  • You can use the word token delimiter as described in this documentation

    Here an example of word delimiter configuration :

    POST _analyze
    {
      "text": "B.tech in",
      "tokenizer": "keyword",
      "filter": [
        "lowercase",
        {
          "type": "word_delimiter",
          "catenate_all": true,
          "preserve_original": true,
          "generate_word_parts": false
        }
      ]
    }
    

    results :

    {
      "tokens": [
        {
          "token": "b.tech in",
          "start_offset": 0,
          "end_offset": 9,
          "type": "word",
          "position": 0
        },
        {
          "token": "btechin",
          "start_offset": 0,
          "end_offset": 9,
          "type": "word",
          "position": 0
        }
      ]
    }
    

    I hope it will fulfill your requirements!