Elasticsearch ranking shorter/less relevant titles first

I'm working on a product search with Elasticsearch 7.3. The product titles are not formatted the same but there is nothing I can do about this.

Some titles might look like this:

Ford Hub Bearing

And others like this:

Hub bearing for a Chevrolet Z71 - model number 5528923-01

If someone searches for "Chevrolet Hub Bearing" the "Ford Hub Bearing" product ranks #1 and the Chevrolet part ranks #2. If I remove all the extra text (model number 5528923-01) from the product title, the Chevrolet part ranks #1 as desired.

Unfortunately I am unable to fix the product titles, so I need to be able to rank the Chevrolet part as #1 when someone searches Chevrolet Hub Bearing. I have simply set the type of name to text and applied the standard analyzer in my index. Here is my query code:

{
    query:{

        bool: {
            must: [
                {
                    multi_match:{
                        fields: 
                            [
                               'name'
                             ],
                             query: "Chevrolet Hub Bearing"
                    }
                 }                  
            ]
        }

    }         
}

Solution

Elasticsearch uses the field length in the scoring formula with the BM25 algorithm. That's why the longer document get in the second position even when it matches more terms.

I recommend you to read those wonderful blog posts about the BM25 : how-shards-affect-relevance-scoring-in-elasticsearch And the-bm25-algorithm-and-its-variables

But you can tweak the bm25 algorithm to avoid this behavior. Here is the bm25 documentation for elasticsearch and here a post explaining how to do it

TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

k1 => Controls non-linear term frequency normalization (saturation). The default value is 1.2.

b => Controls to what degree document length normalizes tf values. The default value is 0.75.

discount_overlaps => Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

So you should configure a new similarity in your index settings like that :

PUT <index>
{
  "settings": {
    "index": {
      "number_of_shards": 1
    },
    "similarity": {
      "my_bm25_without_length_normalization": {
        "type": "BM25",
        "b": 0
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "text",
          "similarity": "my_bm25_without_length_normalization"
        }
      }
    }
  }
}

Then if will stop penalizing longer name for the scoring. The length normalization will be kept for other fields.