Subaggregation leads to missing data

Question in short: When executing a query with a subaggregation, why does the inner aggregation miss data in some cases?

Question in detail: I have a search query with a subaggregation (buckets in buckets) as follows:

{
    "size": 0,
    "aggs": {
        "outer_docs": {
            "terms": {"size": 20, "field": "field_1_to_aggregate_on"},
            "aggs": {
                "inner_docs": {
                    "terms": {"size": 10000, "field": "field_2_to_aggregate_on"},
                    "aggs": "things to display here"
                }
            }
        }
    }
}

If I execute this query, for some outer_docs, I receive not all inner_docs that are associated with it. In the output below, there are three inner docs for outer doc key_1.

{
    "hits": {
        "total": 9853,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 3,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "3", "doc_count": 1, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

Now, I add a query to singly select one outer_doc that would have been in the first 20 anyway.

"query": {"bool": {"must": [{'term': {'field_1_to_aggregate_on': 'key_1'}}]}}

In this case, I do get all inner_docs, which are in the output below seven inner docs for outer doc key_1.

{
    "hits": {
        "total": 8,
        "max_score": 0.0,
        "hits": []
    },
    "aggregations": {
        "outer_docs": {
            "doc_count_error_upper_bound": -1, "sum_other_doc_count": 9801,
            "buckets": [
                {
                    "key": "key_1", "doc_count": 8,
                    "inner_docs": {
                        "doc_count_error_upper_bound": 0,
                        "sum_other_doc_count": 0,
                        "buckets": [
                            {"key": "1", "doc_count": 1, "some": "data here"},
                            ...
                            {"key": "7", "doc_count": 2, "some": "data here"},
                        ]
                    }
                },
                ...
            ]
        }
    }
}

I have specified explicitly that I want 10,000 inner_docs per outer_doc. What is preventing me from getting all data?

This is my version information:

{
    'build_date': '2018-09-26T13:34:09.098244Z',
    'build_flavor': 'default',
    'build_hash': '04711c2',
    'build_snapshot': False,
    'build_type': 'deb',
    'lucene_version': '7.4.0',
    'minimum_index_compatibility_version': '5.0.0',
    'minimum_wire_compatibility_version': '5.6.0',
    'number': '6.4.2'
}

EDIT: After digging a bit more, I found out that the issue was unrelated to subaggregation, but to aggregation itself and the usage of shards. I have opened this bug report for Elastic about it:

Solution

It turned out that the problem was not due to subaggregation, and that it is an actual feature of ElasticSearch. We are using 5 shards, and when using shards, aggregations only return approximate results.

We have made this problem reproducible, and posted it in the Elastic discuss forum. There, we learned that aggregations do not always return all data, with a link to the documentation where this is explained in more detail.

We also learned that using only 1 shard solves the issue, and when that is not possible, the parameter shard_size can alleviate the problem.