Search code examples
tokenizeopensearch

Keep delimiter as token when tokenizing in OpenSearch


How do I define a tokenizer in OpenSearch that keeps the specified delimiters as tokens?

Input: lorem123+ipsum-dolar with delimiters +, -

Output Tokens: lorem123, +, ipsum, -, dolar


Solution

  • I finally accomplished this by using a regex pattern with lookahead and lookbehind, which is described here for example: https://www.baeldung.com/java-split-string-keep-delimiters#3-positive-lookahead-or-lookbehind

    The tokenizer for my question looks as follows:

    "tokenizer": {
        "type": "pattern",
        "pattern": "((?<=[+-])|(?=[+-]))"
    }