Requirement is to create a custom analyzer which can generate two tokens as shown in below scenarios.
E.g.
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
I am able to remove non-alphanumeric character, but how to retain original one too in the output token list. Below is the custom analyzer that I have created.
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
This analyzer is generating single token "btechin" for input "B.tech in" but I also want original one too in token list "B.tech in"
Thanks!
You can use the word token delimiter as described in this documentation
Here an example of word delimiter configuration :
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
results :
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
I hope it will fulfill your requirements!