Search code examples
elasticsearchlucenetokenizestringtokenizeranalyzer

elasticsearch custom tokenizer - split token by length


I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, assuming minimum length is 4, the token "abcdefghij" will be split into: "abcd efgh ij".

I am wondering if I can implement this logic without the need of coding a custom Lucene Tokenizer class?

Thanks in advance.


Solution

  • For your requirement, if you can't do it using the pattern tokenizer then you'll need to code up a custom Lucene Tokenizer class yourself. You can create a custom Elasticsearch plugin for it. You can refer to this for examples about how Elasticsearch plugins are created for custom analyzers.