Search code examples
elasticsearchtokenize

How to declare more than one tokenizer in settings for elasticsearch


I want to create a search index with a property for which i want results in following order:

  1. first all the results which starts with the search term
  2. then all the results containing the search term

so for this, i want to use https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html

but i already have a tokenizer kuromoji_tokenizer in settings for my index.

So how can i add another tokenizer in settings (and later use it in analyzer), so that i can fulfill above scneario?

So for example in below json, can i add another child to tokenzier or tokenzier needs to be an array?

"settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  }

Solution

  • I believe you can, yes. Just add it next to the first one, don't create an array, just give it another name (in my example i called it "my_other_tokenizer"):

    "settings": {
        "analysis": {
          "analyzer": {
            "autocomplete": {
              "tokenizer": "autocomplete",
              "filter": [
                "lowercase"
              ]
            },
            "autocomplete_search": {
              "tokenizer": "lowercase"
            }
          },
          "tokenizer": {
            "autocomplete": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 10,
              "token_chars": [
                "letter"
              ]
            },
            "my_other_tokenizer": {
              "type": "kuromoji_tokenizer",
              "mode": "extended",
              "discard_punctuation": "false",
              "user_dictionary": "userdict_ja.txt"
            }
          }
        }
      }
    

    And then just use it in your analyzer setting, just as you did for the first tokenizer.