elasticsearch tokenize analyzer elasticsearch-6

Elasticsearch custom analyzer with two output tokens

Requirement is to create a custom analyzer which can generate two tokens as shown in below scenarios.

E.g.

Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in

I am able to remove non-alphanumeric character, but how to retain original one too in the output token list. Below is the custom analyzer that I have created.

       "alphanumericStringAnalyzer": {
            "filter": [
                "lowercase",
                "minLength_filter"],
            "char_filter": [
                "specialCharactersFilter"
            ],
            "type": "custom",
            "tokenizer": "keyword"
        }

      "char_filter": {
        "specialCharactersFilter": {
            "pattern": "[^A-Za-z0-9]",
            "type": "pattern_replace",
            "replacement": ""
        }
      },

This analyzer is generating single token "btechin" for input "B.tech in" but I also want original one too in token list "B.tech in"

Thanks!

Solution

You can use the word token delimiter as described in this documentation

Here an example of word delimiter configuration :

POST _analyze
{
  "text": "B.tech in",
  "tokenizer": "keyword",
  "filter": [
    "lowercase",
    {
      "type": "word_delimiter",
      "catenate_all": true,
      "preserve_original": true,
      "generate_word_parts": false
    }
  ]
}

results :

{
  "tokens": [
    {
      "token": "b.tech in",
      "start_offset": 0,
      "end_offset": 9,
      "type": "word",
      "position": 0
    },
    {
      "token": "btechin",
      "start_offset": 0,
      "end_offset": 9,
      "type": "word",
      "position": 0
    }
  ]
}

I hope it will fulfill your requirements!