Search code examples
elasticsearchelasticsearch-ruby

Elasticsearch ranking shorter/less relevant titles first


I'm working on a product search with Elasticsearch 7.3. The product titles are not formatted the same but there is nothing I can do about this.

Some titles might look like this:

Ford Hub Bearing

And others like this:

Hub bearing for a Chevrolet Z71 - model number 5528923-01

If someone searches for "Chevrolet Hub Bearing" the "Ford Hub Bearing" product ranks #1 and the Chevrolet part ranks #2. If I remove all the extra text (model number 5528923-01) from the product title, the Chevrolet part ranks #1 as desired.

Unfortunately I am unable to fix the product titles, so I need to be able to rank the Chevrolet part as #1 when someone searches Chevrolet Hub Bearing. I have simply set the type of name to text and applied the standard analyzer in my index. Here is my query code:

{
    query:{

        bool: {
            must: [
                {
                    multi_match:{
                        fields: 
                            [
                               'name'
                             ],
                             query: "Chevrolet Hub Bearing"
                    }
                 }                  
            ]
        }

    }         
}

Solution

  • Elasticsearch uses the field length in the scoring formula with the BM25 algorithm. That's why the longer document get in the second position even when it matches more terms.

    I recommend you to read those wonderful blog posts about the BM25 : how-shards-affect-relevance-scoring-in-elasticsearch And the-bm25-algorithm-and-its-variables

    But you can tweak the bm25 algorithm to avoid this behavior. Here is the bm25 documentation for elasticsearch and here a post explaining how to do it

    TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

    k1 => Controls non-linear term frequency normalization (saturation). The default value is 1.2.

    b => Controls to what degree document length normalizes tf values. The default value is 0.75.

    discount_overlaps => Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

    So you should configure a new similarity in your index settings like that :

    PUT <index>
    {
      "settings": {
        "index": {
          "number_of_shards": 1
        },
        "similarity": {
          "my_bm25_without_length_normalization": {
            "type": "BM25",
            "b": 0
          }
        }
      },
      "mappings": {
        "doc": {
          "properties": {
            "name": {
              "type": "text",
              "similarity": "my_bm25_without_length_normalization"
            }
          }
        }
      }
    }
    

    Then if will stop penalizing longer name for the scoring. The length normalization will be kept for other fields.