I want to use ngram
tokenizer in Elasticsearch.
(Im using elasticsearch:7.9.1 and I havent any possibility of changing it)
Here is my index settings:
{
"test" : {
"settings" : {
"index" : {
"max_ngram_diff" : "40",
"number_of_shards" : "1",
"provided_name" : "test",
"creation_date" : "1707595684021",
"analysis" : {
"analyzer" : {
"ngram_analyzer" : {
"filter" : [
"lowercase",
"trim"
],
"type" : "custom",
"tokenizer" : "ngram_tokenizer"
},
"search_analyzer" : {
"filter" : [
"lowercase",
"trim"
],
"type" : "custom",
"tokenizer" : "keyword"
}
},
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"max_ngram" : "40",
"min_ngram" : "2"
}
}
},
"number_of_replicas" : "1",
"uuid" : "pXe_RM-0RvSDIc6gg8O5Gg",
"version" : {
"created" : "7090199"
}
}
}
}
}
But when I try to check ngram analyzer:
curl --location --request GET 'http://elasticsearch_test:9200/test/_analyze?pretty=true' \
--header 'Content-Type: application/json' \
--data '{
"analyzer": "ngram_analyzer",
"text": "cod_1"
}'
I cant get any ngram longer then 2 chars:
{
"tokens" : [
{
"token" : "c",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "co",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "o",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 2
},
{
"token" : "od",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "d",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "d_",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "_",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 6
},
{
"token" : "_1",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 7
},
{
"token" : "1",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 8
}
]
}
Im expecting, that tokenizer:
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"max_ngram" : "40",
"min_ngram" : "2"
}
}
give me ngrams up to 40 chars
It's simply because the parameters are called min_gram
and max_gram
not min_ngram
and max_ngram
.
"tokenizer" : {
"ngram_tokenizer" : {
"type" : "ngram",
"max_gram" : "40",
^
"min_gram" : "2"
} ^
}