Score difference after upgrading elasticsearch

I am upgrading my elasticsearch from 5.6 to 8.9, I have a query where I sort by weight field and _score. There is a difference in the score that are being assigned to the data thus giving different order of results.

Can anyone help me find the issue and solution for the same

Query -

POST /auto-complete/_search?typed_keys=true
{
    "size": 5,
    "query": {
        "bool": {
            "should": [
                {
                    "match_phrase_prefix": {
                        "suggestion": {
                            "query": "the"
                        }
                    }
                },
                {
                    "match": {
                        "suggestion.analyzed": {
                            "fuzziness": "AUTO",
                            "operator": "and",
                            "query": "the"
                        }
                    }
                }
            ]
        }
    },
    "sort": [
        {
            "weight": {
                "order": "desc"
            }
        },
        {
            "_score": {
                "order": "desc"
            }
        }
    ]
}

On Elasticsearch 5.6 data was -

{
  "took": 129,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4858,
    "max_score": null,
    "hits": [
      {
        "_index": "auto-complete",
        "_type": "default",
        "_id": "The Thelma Hoop Earrings",
        "_score": 6.3522644,
        "_source": {
          "suggestion": "The Thelma Hoop Earrings",
          "weight": 1
        },
        "sort": [
          1,
          6.3522644
        ]
      },
      {
        "_index": "auto-complete",
        "_type": "default",
        "_id": "The Theresa Ring",
        "_score": 6.3522644,
        "_source": {
          "suggestion": "The Theresa Ring",
          "weight": 1
        },
        "sort": [
          1,
          6.3522644
        ]
      },
      {
        "_index": "auto-complete",
        "_type": "default",
        "_id": "The Theodora Ring",
        "_score": 6.337865,
        "_source": {
          "suggestion": "The Theodora Ring",
          "weight": 1
        },
        "sort": [
          1,
          6.337865
        ]
      },
      {
        "_index": "auto-complete",
        "_type": "default",
        "_id": "The Thea Ring",
        "_score": 6.337865,
        "_source": {
          "suggestion": "The Thea Ring",
          "weight": 1
        },
        "sort": [
          1,
          6.337865
        ]
      },
      {
        "_index": "auto-complete",
        "_type": "default",
        "_id": "The Theor Band For Him",
        "_score": 5.7033815,
        "_source": {
          "suggestion": "The Theor Band For Him",
          "weight": 1
        },
        "sort": [
          1,
          5.7033815
        ]
      }
    ]
  }
}

While on elasticsearch 8.9 it was -

{
  "took": 28,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4874,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "auto-complete",
        "_id": "The Theodora Ring",
        "_score": 8.927014,
        "_source": {
          "suggestion": "The Theodora Ring",
          "weight": 1
        },
        "sort": [
          1,
          8.927014
        ]
      },
      {
        "_index": "auto-complete",
        "_id": "The Theresa Ring",
        "_score": 8.927014,
        "_source": {
          "suggestion": "The Theresa Ring",
          "weight": 1
        },
        "sort": [
          1,
          8.927014
        ]
      },
      {
        "_index": "auto-complete",
        "_id": "The Thea Ring",
        "_score": 8.927014,
        "_source": {
          "suggestion": "The Thea Ring",
          "weight": 1
        },
        "sort": [
          1,
          8.927014
        ]
      },
      {
        "_index": "auto-complete",
        "_id": "The Thelma Hoop Earrings",
        "_score": 7.9907713,
        "_source": {
          "suggestion": "The Thelma Hoop Earrings",
          "weight": 1
        },
        "sort": [
          1,
          7.9907713
        ]
      }
    ]
  }
}

Mapping file for Elasticsearch 5.6 is -

curl -X PUT "localhost:9201/auto-complete?pretty" -H 'Content-Type: application/json' -d'
{ 
"mappings" : 
{
  "default": {
    "properties": {
      "suggestion": {
        "type": "text",
        "fields": {
          "analyzed": {
            "type": "text",
            "analyzer": "nGram_analyzer",
            "search_analyzer": "whitespace"
          }
        }
      },
      "weight": {
        "type": "integer"
      }
    }
  }
}, 
"settings" : 
{
  "number_of_shards": 1,
  "number_of_replicas": 1,
  "index": {
    "analysis": {
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        }
      },
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      }
    }
  }
}
}'

Mapping for Elasticsearch 8.9 is -



curl -X PUT "localhost:9201/auto-complete?pretty" -H 'Content-Type: application/json' -d'
{ 
"mappings" : 
{
  "properties": {
    "suggestion": {
      "type": "text",
      "fields": {
        "analyzed": {
          "type": "text",
          "analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace"
        }
      }
    },
    "weight": {
      "type": "integer"
    }
  }
}, 
"settings" : 
{
  "number_of_shards": 1,
  "number_of_replicas": 1,
  "max_ngram_diff" : 18,
  "index": {
    "analysis": {
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        }
      },
      "filter": {
        "nGram_filter": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      }
    }
  }
}}

Solution

Here is the reason: Elasticsearch 5.x uses the TF/IDF similarity model, while Elasticsearch 8.x uses the BM25 model by default. These models calculate relevance scores differently, which can lead to different results.

Today the default scoring algorithm in Elasticsearch is TF/IDF. This default will change to BM25 once Elasticsearch switches to Lucene 6. In this talk, Britta will tell you all about BM25 – what it is, how it differs from TF/IDF and other scoring techniques, and why it might be the better default going forward. https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25

You can continue to read for more information.

What is the difference between them?

BM25 is an extension of the TF/IDF model with some modifications to improve its performance. It includes the concepts of Term Frequency (TF) and Inverse Document Frequency (IDF), but it also introduces two additional factors:

Term Frequency Saturation: Unlike TF/IDF, where the term frequency component keeps growing as the term appears more frequently, in BM25 the growth of the term frequency component slows down when the term appears "enough" times. This is known as term frequency saturation. The idea is that after a certain point, additional occurrences of a term do not make a document more relevant.
Field Length Normalization: BM25 also introduces a factor to handle different lengths of fields (or documents). In TF/IDF, a term that appears in a short field can have the same weight as in a long field, which can skew the relevance. BM25 introduces a parameter to normalize this, so shorter fields do not get too much weight.

These modifications generally make BM25 perform better than TF/IDF in ranking the relevance of documents for a given query.

To test with your data you can use explain API. Explain API ESv5.6 and ESv8.0