Search code examples
elasticsearchmergeconcatenationtokenize

Merge token filter in Elasticsearch


I'm trying to index some tags after stemming them and applying other filters. These tags could be composed of multiple words.

The thing I'm not managing to do though is to apply a final token filter which outputs a single token from the token stream.

So I would like tags made up of multiple words to be stemmed, stopwords removed, but then be joined again in the same token before being saved in the index (sort of what the keyword tokenizer does, but as a filter).

I find no way of doing this with the way token filters are applied in Elasticsearch: if I tokenize on white spaces, then stem, all of the subsequent token filters would receive these stemmed single tokens, and not the entire token stream, right?

For example I would like the tag

the fox jumps over the fence

to be saved in the index as a whole token as

fox jump over fence

and not

fox,jump,over,fence

Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?


Solution

  • After a bit of research I found this thread:

    http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

    which had the exact solution I was looking for. 

    I created a simple Elasticsearch plugin that only provides the Concatenate Token Filter, which you can find at:

    https://github.com/francesconero/elasticsearch-concatenate-token-filter