Search code examples
elasticsearchnormalizationrelevance

How to make elasticsearch scoring take field-length into account


I created a very simple test index consisting on the following 5 entries:

{    "tags": [        { "topics": "music festival dance techno germany"}    ]}
{    "tags": [        { "topics": "music festival dance techno"}    ]}
{    "tags": [        { "topics": "music festival dance"}    ]}
{    "tags": [        { "topics": "music festival"}    ]}
{    "tags": [        { "topics": "music"}    ]}

Then I performed the following query:

{
  "query": { 
    "bool": { 
      "should": [
        { "match": { "tags.topics": "music festival"}}
      ]
    }
  }
}

Expecting to obtain the following order in the results:

1) "music festival"

2) "music festival dance"

3) "music festival dance techno"

4) "music festival dance techno germany"

5) "music"

Accounting for field-length normalization.

However I got the following:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 5,
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance techno germany"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "3",
                "_score": 0.5753642,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "4",
                "_score": 0.42221835,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "2",
                "_score": 0.32088596,
                "_source": {
                    "tags": [
                        {
                            "topics": "music festival dance techno"
                        }
                    ]
                }
            },
            {
                "_index": "testindex",
                "_type": "entry",
                "_id": "5",
                "_score": 0.2876821,
                "_source": {
                    "tags": [
                        {
                            "topics": "music"
                        }
                    ]
                }
            }
        ]
    }
}

Whose order seems absolutely random, except for the lowest score that matched only one word.

What could be causing this and, what could I change (during mapping, indexing or searching), to get the expected order?

Note: The same goes for non-perfect matching queries. Searching "music dance" should still produce the 3 word entry as a first result, so using or boosting term queries seems out of the question.


Solution

  • As I described in this answer scoring/relevance is not the easiest topic in Elasticsearch.

    I was trying to figure out solution for you and currently I have something like that.

    Documents:

    { "tags": [ { "topics": ["music", "festival", "dance", "techno", "germany"]} ], "topics_count": 5 }
    { "tags": [ { "topics": ["music", "festival", "dance", "techno"]} ], "topics_count": 4 }
    { "tags": [ { "topics": ["music", "festival", "dance"] } ], "topics_count": 3 }
    { "tags": [ { "topics": ["music", "festival"]} ], "topics_count": 2 }
    { "tags": [ { "topics": ["music"]} ], "topics_count": 1 }
    

    and query:

    {
      "query": {
        "bool": {
          "should": [
            {
              "function_score": {
                "query": {
                  "terms_set": {
                    "tags.topics" : {
                      "terms" : ["music", "festival"],
                      "minimum_should_match_script": {
                        "source": "params.num_terms"
                      }
                    }
                  }
                },
                "script_score" : {
                  "script" : {
                    "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
                  }
                }
              }
            },
            {
              "function_score": {
                "query": {
                  "terms_set": {
                    "tags.topics" : {
                     "terms" : ["music", "festival"],
                     "minimum_should_match_script": {
                        "source": "doc['topics_count'].value"
                      }
                    }
                  }
                },
                "script_score" : {
                  "script" : {
                    "source": "_score * Math.sqrt(1.0 / doc['topics_count'].value)"
                  }
                }
              }
            }
          ]
        }
      }
    }
    

    It is not perfect. Still needs some improvements. It works well (tested on ES 6.2) for ["music", "festival"] and ["music", "dance"] on this example but I'm guessing that on other results it will work not 100% as you expected. Mostly because of the relevance/scoring complexity. But you can now read more about things I used and try to improve it.