Search code examples
elasticsearchelasticsearch-aggregationelasticsearch-dsl

Order ElasticSearch results by percentage of nested field matches


I would like to order ElasticSearch query results based on the percentage of matches for a nested field.

For example, let's suppose I have an ElasticSearch index strucutured as follows:

{
    "properties": {
        "name": {
            "type": "text"
        },
        "jobs": {
            "type": "nested",
            "properties": {
                "id": {
                    "type": "long"
                }
            }
        }
    }
}

With the following documents:

{
    "name": "Alice",
    "jobs": [
        { "id": 1 },
        { "id": 2 },
        { "id": 3 },
        { "id": 4 }
    ]
}
{
    "name": "Bob",
    "jobs": [
        { "id": 1 },
        { "id": 2 },
        { "id": 3 }
    ]
}
{
    "name": "Charles",
    "jobs": [
        { "id": 2 },
        { "id": 3 }
    ]
}

Now, I would like to perform a query to find which documents have specific jobs, ordered by the percentage of matched jobs. For example:

  • Searching for jobs 1 and 2, I would expect the order to be:
    1. Bob (66% jobs matched)
    2. Alice (50% jobs matched)
    3. Charles (50% jobs matched)
  • Searching for jobs 2, I would expect the order to be:
    1. Charles (50% jobs matched)
    2. Bob (33% jobs matched)
    3. Alice (25% jobs matched)

So far, I'm using the following query, but it sorts by number of matches, not the percentage:

{
    "query": {
        "nested": {
            "path": "jobs",
            "query": {
                "bool": {
                    "should": [
                        {
                            "match": {
                                "jobs.id": "1"
                            }
                        },
                        {
                            "match": {
                                "jobs.id": "2"
                            }
                        }
                    ]
                }
            },
            "score_mode":"sum"
        }
    }
}

Solution

  • script_score seems to do the job:

    {
      "query": {
        "function_score": {
          "query": {
            "nested": {
              "path": "jobs",
              "query": {
                "bool": {
                  "should": [
                    {
                      "match": {
                        "jobs.id": "1"
                      }
                    },
                    {
                      "match": {
                        "jobs.id": "2"
                      }
                    }
                  ]
                }
              },
              "score_mode": "sum"
            }
          },
          "script_score": {
            "script": {
              "source": "_score / params['_source']['jobs'].length"
            }
          }
        }
      }
    }