Search code examples
elasticsearch

The `max_ngram` didn't applied


I want to use ngram tokenizer in Elasticsearch. (Im using elasticsearch:7.9.1 and I havent any possibility of changing it)

Here is my index settings:

{
  "test" : {
    "settings" : {
      "index" : {
        "max_ngram_diff" : "40",
        "number_of_shards" : "1",
        "provided_name" : "test",
        "creation_date" : "1707595684021",
        "analysis" : {
          "analyzer" : {
            "ngram_analyzer" : {
              "filter" : [
                "lowercase",
                "trim"
              ],
              "type" : "custom",
              "tokenizer" : "ngram_tokenizer"
            },
            "search_analyzer" : {
              "filter" : [
                "lowercase",
                "trim"
              ],
              "type" : "custom",
              "tokenizer" : "keyword"
            }
          },
          "tokenizer" : {
            "ngram_tokenizer" : {
              "type" : "ngram",
              "max_ngram" : "40",
              "min_ngram" : "2"
            }
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "pXe_RM-0RvSDIc6gg8O5Gg",
        "version" : {
          "created" : "7090199"
        }
      }
    }
  }
}

But when I try to check ngram analyzer:

curl --location --request GET 'http://elasticsearch_test:9200/test/_analyze?pretty=true' \
--header 'Content-Type: application/json' \
--data '{
  "analyzer": "ngram_analyzer",
  "text": "cod_1"
}'

I cant get any ngram longer then 2 chars:

{
  "tokens" : [
    {
      "token" : "c",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "co",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "o",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "od",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "d",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d_",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "_",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "_1",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "1",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 8
    }
  ]
}

Im expecting, that tokenizer:

"tokenizer" : {
            "ngram_tokenizer" : {
              "type" : "ngram",
              "max_ngram" : "40",
              "min_ngram" : "2"
            }
          }

give me ngrams up to 40 chars


Solution

  • It's simply because the parameters are called min_gram and max_gram not min_ngram and max_ngram.

          "tokenizer" : {
            "ngram_tokenizer" : {
              "type" : "ngram",
              "max_gram" : "40",
                   ^
              "min_gram" : "2"
            }      ^
          }