I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort
field is 0
for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting
to be later fed into sort_after
for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.
The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.