Search code examples
sortingelasticsearchlucenerankingscoring

Elasticsearch decay score based on occurrence


I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?


Solution

  • You cannot diversify elasticsearch sorting. You can only random_seed score the documents and hope for the best. You can use something like a top hits aggregator to aggregate buckets per author, but you cannot paginate a group of buckets. Therefore breaking pagination.

    See here for more information