Search code examples
elasticsearchelasticsearch-aggregation

Composite aggregation query with bucket_sort does not work properly


I have an index to store financial transactions:

{
  "mappings": {
    "_doc": {
      "properties": {
        "amount": {
          "type": "long"
        },
        "currencyCode": {
          "type": "keyword"
        },
        "merchantId": {
          "type": "keyword"
        },
        "merchantName": {
          "type": "text"
        },
        "partnerId": {
          "type": "keyword"
        },
        "transactionDate": {
          "type": "date"
        },
        "userId": {
          "type": "keyword"
        }
      }
    }
  }
}

Here's my query:

GET /transactions/_search
{
  "aggs": {
    "date_merchant": {
      "aggs": {
        "amount": {
          "sum": {
            "field": "amount"
          }
        },
        "amount_sort": {
          "bucket_sort": {
            "sort": [
              {
                "amount": {
                  "order": "desc"
                }
              }
            ]
          }
        },
        "top_hit": {
          "top_hits": {
            "_source": {
              "includes": [
                "merchantName",
                "currencyCode"
              ]
            },
            "size": 1
          }
        }
      },
      "composite": {
        "size": 1,
        "sources": [
          {
            "date": {
              "date_histogram": {
                "calendar_interval": "day",
                "field": "transactionDate"
              }
            }
          },
          {
            "merchant": {
              "terms": {
                "field": "merchantId"
              }
            }
          }
        ]
      }
    }
  },
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "userId": "AAA"
          }
        },
        {
          "term": {
            "partnerId": "BBB"
          }
        },
        {
          "range": {
            "transactionDate": {
              "gte": "2022-07-01"
            }
          }
        },
        {
          "term": {
            "currencyCode": "EUR"
          }
        }
      ]
    }
  },
  "size": 0
}

Please note the "size": 1 in the composite aggregation.

If I change it to 3 (based on my data)... I get different results!

That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)

How can I sort on ALL the buckets instead?

EDIT

Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)

enter image description here


Solution

  • Composite agg design

    The composite aggregation is designed to iterate all buckets in the most efficient way possible.

    How can I sort on ALL the buckets instead?

    To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.

    So, how to actually sort over all buckets?

    Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.

    The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.

    Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.