I'm working on a product search with Elasticsearch 7.3. The product titles are not formatted the same but there is nothing I can do about this.
Some titles might look like this:
Ford Hub Bearing
And others like this:
Hub bearing for a Chevrolet Z71 - model number 5528923-01
If someone searches for "Chevrolet Hub Bearing" the "Ford Hub Bearing" product ranks #1 and the Chevrolet part ranks #2. If I remove all the extra text (model number 5528923-01) from the product title, the Chevrolet part ranks #1 as desired.
Unfortunately I am unable to fix the product titles, so I need to be able to rank the Chevrolet part as #1 when someone searches Chevrolet Hub Bearing
. I have simply set the type of name
to text
and applied the standard
analyzer in my index. Here is my query code:
{
query:{
bool: {
must: [
{
multi_match:{
fields:
[
'name'
],
query: "Chevrolet Hub Bearing"
}
}
]
}
}
}
Elasticsearch uses the field length in the scoring formula with the BM25 algorithm. That's why the longer document get in the second position even when it matches more terms.
I recommend you to read those wonderful blog posts about the BM25 : how-shards-affect-relevance-scoring-in-elasticsearch And the-bm25-algorithm-and-its-variables
But you can tweak the bm25 algorithm to avoid this behavior. Here is the bm25 documentation for elasticsearch and here a post explaining how to do it
TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:
k1 => Controls non-linear term frequency normalization (saturation). The default value is 1.2.
b => Controls to what degree document length normalizes tf values. The default value is 0.75.
discount_overlaps => Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
So you should configure a new similarity in your index settings like that :
PUT <index>
{
"settings": {
"index": {
"number_of_shards": 1
},
"similarity": {
"my_bm25_without_length_normalization": {
"type": "BM25",
"b": 0
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"similarity": "my_bm25_without_length_normalization"
}
}
}
}
}
Then if will stop penalizing longer name for the scoring. The length normalization will be kept for other fields.