Elasticsearch text analyzer breaks keyword filter/normalizer

We're using Elasticsearch 8.4.0, if that's relevant.

We've been using the following normalizer on our index.

{
            "analysis": {
                "filter": {
                    "preserved_ascii_folding": {
                        "preserve_original": true,
                        "type": "asciifolding"
                    }
                },
                "normalizer": {
                    "preserved_ascii_keyword_normalizer": {
                        "filter": [
                            "lowercase",
                            "trim",
                            "preserved_ascii_folding"
                        ],
                        "type": "custom"
                    }
                }

We specifically only use it on our keyword fields, like this:

                "keywords_in_fr_fr": {
                    "normalizer": "preserved_ascii_keyword_normalizer",
                    "type": "keyword"
                },

We're trying to solve an issue where things like "bear bear bear" rank more highly than "bear" for a search on the keyword "bear". To this end, we added the following analyzer:


  "analysis": {
    "analyzer": {
      "unique_lowercase": {
        "filter": [
          "lowercase",
          "unique"
          ],
         "tokenizer": "whitespace",
         "type": "custom"
        }
      }
    },

Similarly, we applied it only to text fields, like this;

                "name_in_fr_fr": {
                    "analyzer": "unique_lowercase",
                    "type": "text"
                },

HOWEVER, this resulted in a bunch of records no longer being indexed due to errors like this, where it now objects to keywords having non-ASCII characters in them.

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse field [keywords_in_fr_fr] of type [keyword] in document with id '1421'. Preview of field's value: 'mot-clé'"}],"type":"mapper_parsing_exception","reason":"failed to parse field [keywords_in_fr_fr] of type [keyword] in document with id '1421'. Preview of field's value: 'mot-clé'","caused_by":{"type":"illegal_state_exception","reason":"The normalization token stream is expected to produce exactly 1 token, but got 2+ for analyzer analyzer name[preserved_ascii_keyword_normalizer], analyzer [org.elasticsearch.index.analysis.CustomAnalyzer@6d8a4c15], analysisMode [ALL] and input \"mot-clé\"\n"}},"status":400}

The only useful thing I found while Googling suggested that we should drop the preserve_original: true, but our search consultant wrote the original code, and I presume he knew what he was doing. (Our contract ended over a year ago, so I can't just reach out to him about this.)

Please let me know if you need any more information/code.

Solution

There are a few issues here. First of all, your keyword normalizer doesn't deal with the bear situation, and the error that you are getting is not related to it. As you correctly pointed out the error is caused by the preserve_original flag. While it can be useful in analyzer, you cannot use is in normalizer since normalizer cannot produce more than 1 token, and preserve_original does exactly that - it emits the original token followed by the token with dropped diacritical marks. The preserve_original flag has a few specific use cases, but I cannot tell you if you need it or not without looking at your queries.

The way you deal with the bear problem is a bit harsh, but efficient. It has a few issue though. It will fail (return any number of bears) if your users will search for "bear bear bear". It will also will not deal with "bear? bear. bear!" correctly producing 3 different "bears" none of which will match search for a "bear". Again, not sure how your searches look like, but I would at least replace whitespace tokenizer with a standard tokenizer.

Alternatively, you can built your own similarity that simply ignores term frequency all together. It will be a bit slower, but it will work with phrases that have repeated words

DELETE test
PUT test
{
  "settings": {
    "similarity": {
      "scripted_idf": {
        "type": "scripted",
        "weight_script": {
          "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
        },
        "script": {
          "source": "double norm = 1/Math.sqrt(doc.length); return weight * norm;"
        }
      }
    },
    "analysis": {
      "filter": {
        "ascii_folding": {
          "type": "asciifolding"
        }
      },
      "normalizer": {
        "preserved_ascii_keyword_normalizer": {
          "filter": [
            "lowercase",
            "trim",
            "ascii_folding"
          ],
          "type": "custom"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "keywords_in_fr_fr": {
        "normalizer": "preserved_ascii_keyword_normalizer",
        "type": "keyword"
      },
      "name_in_fr_fr": {
        "analyzer": "standard",
        "similarity": "scripted_idf",
        "type": "text"
      }
    }
  }
}

POST test/_doc
{
  "keywords_in_fr_fr": "mot-clé",
  "name_in_fr_fr": "bear? bear. bear!"
}

POST test/_doc?refresh
{
  "keywords_in_fr_fr": "mot-clé",
  "name_in_fr_fr": "just a bear"
}

POST test/_search 
{
  "query": {
    "match": {
      "name_in_fr_fr": "bear"
    }
  }
}


POST test/_search 
{
  "query": {
    "match": {
      "name_in_fr_fr": "just bear"
    }
  }
}

POST test/_search 
{
  "query": {
    "match_phrase": {
      "name_in_fr_fr": "bear bear bear"
    }
  }
}