I have a search-index containing products of different catalogs. Now when I search for a given search-term quite often results like the following one are being returned:
Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
...
Catalog 1 - Product x
Catalog 2 - Product 1
...
This is not optimal, as I want to point the user to other catalogs, also, without having him to browse through multiple pages of search-results containing all products of the same catalog. So I tried to use the diversified_sampler-aggregation which, in conjunction with a child top_hits-aggregation, seemed to be exactly the solution, I want:
POST /myIndex/_search?typed_keys=true
{
"query": {
"query_string": {
"fields": [
"title^2",
"description^2",
"descriptionOriginal^0.01"
],
"query": "*someSearchTerm*"
}
},
"size": 0,
"aggs": {
"aggDiversifiedSampler": {
"diversified_sampler": {
"shard_size": 100000,
"field": "catalogId",
"max_docs_per_value": 3
},
"aggs": {
"aggTopHits": {
"top_hits": {
"from": 0,
"size": 50,
"sort": [
{
"_score": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Paging is being done through the "size" and "from" properties of the inner top_hits-aggregation. The search-results can be fetched from the values-collection of the inner top_hits-aggregation - therefore I set the size of the query itself to 0.
This seems to work - at a first glance, but having a closer look at the results, reveals, that not all search-results are being returned. The results now look like this:
Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
Catalog 2 - Product 1
Catalog 2 - Product 2
Catalog 2 - Product 3
...
Catalog x - Product 1
Catalog x - Product 2
Catalog x - Product 3
...and then it ends.
It seems, that the diversified_sampler does not warp-around after reaching the last catalog, and so further results from the single catalogs won't appear. What I want is something like this:
Catalog 1 - Product 1
Catalog 1 - Product 2
Catalog 1 - Product 3
Catalog 2 - Product 1
Catalog 2 - Product 2
Catalog 2 - Product 3
...
Catalog x - Product 1
Catalog x - Product 2
Catalog x - Product 3
Catalog 1 - Product 4
Catalog 1 - Product 5
Catalog 1 - Product 6
Catalog 2 - Product 4
Catalog 2 - Product 5
Catalog 2 - Product 6
...
Any ideas? My technique using the diversified_sampler is not set into stone, but I couldn't come up with something else. Some fancy script-based sorting of the query maybe? Don't know. Client-based reordering is not an option, because I don't want the elasticsearch-wise paging get broken. I need the paging to keep the performance up - the search-index is about 18GB containing 900k documents...
I think I found a solution without the diversified_sampler-aggregation using scripted sorting:
POST /myIndex/_search?typed_keys=true
{
"query": {
"query_string": {
"fields": [
"title^2",
"description^2",
"descriptionOriginal^0.01"
],
"query": "*someSearchTerm*"
}
},
"sort": [{
"_script": {
"script": {
"source": "Math.round(_score / params.fuzziness) * params.fuzziness",
"params": {
"fuzziness": 2
}
},
"type": "number",
"order": "desc"
}
}, {
"_script": {
"script": {
"source": "if(doc['catalogId'].value != params.cid) {params.cid=doc['catalogId'].value;params.sort=0;return params.count=0;} else {return (++params.count % params.grpSize == 0) ?++params.sort : params.sort;}",
"params": {
"cid": 0,
"sort": 0,
"count": 0,
"grpSize": 3
}
},
"type": "number",
"order": "asc"
}
}, {
"_score": {
"order": "desc"
}
}
]
}
In the first scripted-sort I pre-sort my documents, so that results within a certain _score-range fall together. This is being controlled by the fuzziness-parameter. Then I sort within these ranges using a script-sort so that always the next 3 (controlled by param grpSize) documents per catalog-id are taken and then incrementing the sort-order. (Don't know if it is dangerous to use script-params as a "global"-variables...I feel a little bit uncomfortable with that...)
Here is the script in a more readable representation:
if(doc['catalogId'].value != params.cid) {
params.cid = doc['catalogId'].value;
params.sort = 0;
return params.count = 0;
} else {
return (++params.count % params.grpSize == 0) ? ++params.sort : params.sort;
}
Last but not least the documents with the same _score-range and sort-order are being sorted by their real _score.
The solution doesn't involve a real performance-impact (at least on my index) and delivers quite the results, i wanted.
Please feel free to post ideas and optimizations!