Search code examples
elasticsearchelasticsearch-aggregationelasticsearch-7

Diversifying search results in ElasticSearch 7.5


I have a search-index containing products of different catalogs. Now when I search for a given search-term quite often results like the following one are being returned:

Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
...
Catalog 1 - Product x
Catalog 2 - Product 1
...

This is not optimal, as I want to point the user to other catalogs, also, without having him to browse through multiple pages of search-results containing all products of the same catalog. So I tried to use the diversified_sampler-aggregation which, in conjunction with a child top_hits-aggregation, seemed to be exactly the solution, I want:

POST /myIndex/_search?typed_keys=true
{
  "query": {
    "query_string": {
      "fields": [
        "title^2",
        "description^2",
        "descriptionOriginal^0.01"
      ],
      "query": "*someSearchTerm*"
    }
  },
  "size": 0,
  "aggs": {
    "aggDiversifiedSampler": {
      "diversified_sampler": {
        "shard_size": 100000,
        "field": "catalogId",
        "max_docs_per_value": 3
      },
      "aggs": {
        "aggTopHits": {
          "top_hits": {
            "from": 0,
            "size": 50,
            "sort": [
              {
                "_score": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

Paging is being done through the "size" and "from" properties of the inner top_hits-aggregation. The search-results can be fetched from the values-collection of the inner top_hits-aggregation - therefore I set the size of the query itself to 0.

This seems to work - at a first glance, but having a closer look at the results, reveals, that not all search-results are being returned. The results now look like this:

Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
Catalog 2 - Product 1
Catalog 2 - Product 2
Catalog 2 - Product 3
...
Catalog x - Product 1
Catalog x - Product 2
Catalog x - Product 3

...and then it ends.

It seems, that the diversified_sampler does not warp-around after reaching the last catalog, and so further results from the single catalogs won't appear. What I want is something like this:

Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
Catalog 2 - Product 1
Catalog 2 - Product 2
Catalog 2 - Product 3
...
Catalog x - Product 1
Catalog x - Product 2
Catalog x - Product 3
Catalog 1 - Product 4
Catalog 1 - Product 5
Catalog 1 - Product 6
Catalog 2 - Product 4
Catalog 2 - Product 5
Catalog 2 - Product 6
...

Any ideas? My technique using the diversified_sampler is not set into stone, but I couldn't come up with something else. Some fancy script-based sorting of the query maybe? Don't know. Client-based reordering is not an option, because I don't want the elasticsearch-wise paging get broken. I need the paging to keep the performance up - the search-index is about 18GB containing 900k documents...


Solution

  • I think I found a solution without the diversified_sampler-aggregation using scripted sorting:

    POST /myIndex/_search?typed_keys=true
    {
      "query": {
        "query_string": {
          "fields": [
            "title^2",
            "description^2",
            "descriptionOriginal^0.01"
          ],
          "query": "*someSearchTerm*"
        }
      },
      "sort": [{
          "_script": {
            "script": {
              "source": "Math.round(_score / params.fuzziness) * params.fuzziness",
              "params": {
                "fuzziness": 2
              }
            },
            "type": "number",
            "order": "desc"
          }
        }, {
          "_script": {
            "script": {
              "source": "if(doc['catalogId'].value != params.cid) {params.cid=doc['catalogId'].value;params.sort=0;return params.count=0;} else {return (++params.count % params.grpSize == 0) ?++params.sort : params.sort;}",
              "params": {
                "cid": 0,
                "sort": 0,
                "count": 0,
                "grpSize": 3
              }
            },
            "type": "number",
            "order": "asc"
          }
        }, {
          "_score": {
            "order": "desc"
          }
        }
      ]
    }
    

    In the first scripted-sort I pre-sort my documents, so that results within a certain _score-range fall together. This is being controlled by the fuzziness-parameter. Then I sort within these ranges using a script-sort so that always the next 3 (controlled by param grpSize) documents per catalog-id are taken and then incrementing the sort-order. (Don't know if it is dangerous to use script-params as a "global"-variables...I feel a little bit uncomfortable with that...)

    Here is the script in a more readable representation:

    if(doc['catalogId'].value != params.cid) {
      params.cid = doc['catalogId'].value;
      params.sort = 0;
      return params.count = 0;
    } else {
      return (++params.count % params.grpSize == 0) ? ++params.sort : params.sort;
    }
    

    Last but not least the documents with the same _score-range and sort-order are being sorted by their real _score.

    The solution doesn't involve a real performance-impact (at least on my index) and delivers quite the results, i wanted.

    Please feel free to post ideas and optimizations!