ElasticSearch 2.3 _search for more than 10,000 paged items

In ElasticSearch 2.3 (and in the latest releases) there is a index.max_result_window setting which restricts the search query to a from + size value that is less than 10,000 entries. e.g.

from: 0 size: 10,000 is ok
from: 0 size: 10,001 is not ok
from: 9,000 size: 1,001 is not ok

In the latest release, 7.10, the documentation says this can be worked around by using search-after. However, due to legacy data, I need something similar in ES 2.3. I'm curious if there are any good options?

Why do I need this? In our data we've a child / parent hierarchy. One query we run against this data is to determine all the unique parents over a certain date range. Currently we retrieve this information using an aggregate query. i.e.

{
  "query": { "match_all_in_date_range": {} },
  "aggs": {
    "parents": {
      "terms": {
        "field": "parentId"
      }
    }
  }
}

Which, interestingly, returns all the parents even if there are more than 10,000. i.e. It does not appear to be affected by the index.max_result_window limit.

But this aggregation is expensive and time consuming. As a result I'm evaluating if it's possible to remove it and "aggregate" the data in our own code. i.e. Retrieve all the objects, read their parentId field, and record the unique ids.

But it looks like the index.max_result_window limit may break that idea. i.e. Unless I'm mistaken. Two ideas I had to work around this would be

Rather than paging I should modify the query to exclude the parentIds I've already retrieved (the downside being that it could take longer to run and will cause the query to grow until the end)
To move over to the more heavy duty scroll API (which may be more suitable for other usages)

But I'd be curious to hear if there are other options available to me?

Solution

You could divide the search into smaller ones, separating by hour for example, or by other field, so that each search returns less than 10,000 results