elasticsearch scroll aggregate bucket pagination

ES: Bucket agg + top_hits + scroll? How to return all hits (more than `size+from` max) in buckets?

I'm running an elasticsearch filter with a large no. (~10million hits) of results. My from+size max is the default (10,000 hits). I'd like to aggregate based on a field, and return all the hits for the filter in all the buckets (not just the counts).

I know that I can use the top_hits to get the actual docs in each bucket (ElasticSearch: retriving documents belonging to buckets) but I think I need to scroll to get them all (to get more than the first 10000 hits). Can I scroll and aggregate? The scroll api fails when I run it with an aggregation.

Currently, I have two solutions both seem not that great:

run multiple filter queries, say 1 for each bucket (and then I don't need to use the aggregation + top_hits command). (too slow for my application)
run 1 big filter query, and don't aggregate, but use the scroll api to get all the hits. Then I'll put them in the respective buckets here on my host. (ok, but seems like ES is set up for aggregating these into buckets for me and has more resources to do this work)

Are there better ways to deal with this?

This seems related to this: ( Paging elasticsearch aggregation results ) although scroll api isn't mentioned (unless that is what they mean by paging?)

Solution

I believe your use case isn't supported. Aggregations specifically "throw out" the other information in documents. Top hits is just meant to return the most relevant hits in each bucket that match your query. This is more of a scoring feature than a document retrieval feature, i.e. top hits agg isn't meant to return all the documents in a bucket.

If you need all the documents anyway, why don't you aggregate the results yourself? This is your option #2 and it seems like the best option to me.

The SO post you referenced describes a workaround for paging in an aggregation by using the exclude value filter in terms aggregations. It doesn't use the scroll api. I also don't think it helps you.

Lastly, Elasticsearch terms aggregations often have errors due to shard sizing. If you need the documents anyway, you can get completely accurate aggregations by performing the bucketing in your application - you'll have to visit every document, which might be slower than what ES can do, but you're also getting a different result.

If you have more details on your use case, perhaps one of us can give better advice. Such as, why do you need all the documents and also the bucket counts?