Search code examples
elasticsearchsearch-suggestionmisspelling

Umlaut in Elastic Suggesters


I am currently trying to set up a suggester similar to the google misspelling correction. I am using the Elastic Suggesters with the following query:

{
   "query": {
      "match": {
         "name": "iphone hüle"
      }
   },
   "suggest": {
      "suggest_name": {
         "text": "iphone hüle",
         "term": {
            "field": "name"
         }
      }
   }
}

It results the following suggestions:

"suggest": {
      "suggest_name": [
         {
            "text": "iphone",
            "offset": 0,
            "length": 6,
            "options": []
         },
         {
            "text": "hule",
            "offset": 7,
            "length": 4,
            "options": [
               {
                  "text": "hulle",
                  "score": 0.75,
                  "freq": 162
               },
               ...
               {
                  "text": "hulk",
                  "score": 0.75,
                  "freq": 38
               }
            ]
         }
      ]
   }

Now the problem I have is in the returned text inside the options and inside the suggest. The text I submitted and the returned text should be "hüle" not "hule". Furthermore the returned option text should actually be "hülle" and not "hulle". As I use the same fields for the query and the suggester I wonder why the umlauts are only missing in the suggester and not in the regular query results.

See a query result here:

            "_source": {
               ...
               "name": "Ladegerät für iPhone",
               "manufacturer": "Apple",
            }

Solution

  • The data you get back in your query result, i.e.

    "name": "Ladegerät für iPhone"
    

    is the stored content of the field. It is exactly your source data. Search and obviously also the suggester, however, work on the inverted index, which contains tokens massaged by the analyzer. You are most likely using an analyzer that folds umlauts.

    Strange enough I discussed this with a colleague yesterday. We came to the conclusion that we may need a separate field, indexed and not stored, into which we index the non-normalized tokens. We want to use it to fetch suggestion terms. In addition it may be a feature that we can perform exact searches on it, i.e. searches which do make a difference between Müller and Mueller, Foto and Photo, Rene and René.