How do I define a tokenizer in OpenSearch that keeps the specified delimiters as tokens?
Input: lorem123+ipsum-dolar
with delimiters +
, -
Output Tokens: lorem123
, +
, ipsum
, -
, dolar
I finally accomplished this by using a regex pattern with lookahead and lookbehind, which is described here for example: https://www.baeldung.com/java-split-string-keep-delimiters#3-positive-lookahead-or-lookbehind
The tokenizer for my question looks as follows:
"tokenizer": {
"type": "pattern",
"pattern": "((?<=[+-])|(?=[+-]))"
}