I'm trying to index some tags after stemming them and applying other filters. These tags could be composed of multiple words.
The thing I'm not managing to do though is to apply a final token filter which outputs a single token from the token stream.
So I would like tags made up of multiple words to be stemmed, stopwords removed, but then be joined again in the same token before being saved in the index (sort of what the keyword tokenizer does, but as a filter).
I find no way of doing this with the way token filters are applied in Elasticsearch: if I tokenize on white spaces, then stem, all of the subsequent token filters would receive these stemmed single tokens, and not the entire token stream, right?
For example I would like the tag
the fox jumps over the fence
to be saved in the index as a whole token as
fox jump over fence
and not
fox,jump,over,fence
Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?
After a bit of research I found this thread:
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html
which had the exact solution I was looking for.
I created a simple Elasticsearch plugin that only provides the Concatenate Token Filter, which you can find at:
https://github.com/francesconero/elasticsearch-concatenate-token-filter