Search code examples
elasticsearchfiltertokenize

How to tokenize a sentence based on maximum number of words in Elasticsearch?


I have a string like "This is a beautiful day" What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be: "This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day" So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?


Solution

  • Seems that you're looking for shingle token filter it does exactly what you want.