Search code examples
elasticsearchlucenetokenizeelasticsearch-analyzers

How to tokenize a Roman numeral term in ElasticSearch?


When creating tokenizer by registering token chars as below, Roman 'X' cannot be registered.(Test ES Version : ES6.7, ES5.6)

      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 14,
          "token_chars": [
            "Ⅹ"
          ]
        }
    }

error log is like this

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[node02][192.168.115.x:9300][indices:admin/create]"}],"type":"illegal_argument_exception","reason":"Unknown token type: 'ⅹ', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]"},"status":400}

How can I tokenize Roman numerals into terms?


Solution

  • The error message clearly specifying that your Roman X isn't a valid token type. The error message also listing the valid options for token type, as shown below:

    must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]

    The issue is in your syntax if you refer the official ES doc https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html for token chars, then you can understand what it means as explained below:

    Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

    And below it again specifies the valid values as digit, letter also the same link has some examples where they used token_chars with valid values.

    Your issue would be resolved if you replace X with letter in your analyzer setting.