Search code examples
javaelasticsearchhibernate-search

Hibernate Search + Elasticsearch - remove consecutive duplicate characters


I'm using Hibernate Search with Elasticsearch and I need to generate tokens for search without consecutive duplicate characters. I checked the documentation of Elasticsearch but couldn't find anything what would do what I need. I've found something about custom analyzers, but that is always put together using predefined tokenizers and other parts based on what I found. There is no option which would do what I need.

Do you have any idea how to achieve this?

The only thing which comes to my mind is to create a duplicate database column and put the duplicate value of original column with removed unwanted characters. Then search in both those fields.

Example:

  • Person name: Zimmermann
  • Search term: Zimerman

This search term should find the person.

PS: Fuzzy search can't be used because it would cause more harm than good in my case and find the results which I don't want.

Thanks for any advice.


Solution

  • I think the pattern-replace token filter would work. Just set the pattern parameter to "(.)\\1+" ("any character followed by the same character at least once") and the replace parameter to "$1" ("that character, but only once").

    Be careful when copy/pasting these to Java code: the backslashes matter.

    Note I'm not sure about the performance of this regexp. Usually I would rather use an ngram filter, but since you don't want fuzzy search...

    Also note that you will still get false positives: searching for "Zimmermann", without any spelling error, may return a person named "Zimermann" higher in the result list than the actual "Zimmermann".

    A common solution to solve this problem, or at least mitigate it, is to take advantage of scoring. Just sort the results by score (relevance), and craft the query so that exact matches get a better score.

    For example you could add two fields for the person name: "name_exact", with an analyzer that does not apply the pattern-replace predicate, and "name_fuzzy", with an analyzer that does. Then in Hibernate Search, build a boolean predicate with two "should" clauses": one on each field. Exact matches will naturally get a higher score and will rise to the top of the result list.