Elastic gives inconsistent results under stress

Our ES is fairly slow, we did not optimize it (and the query) yet, but according to this link, request rejection from Elastic is a form of a feedback that asks to slow down and adapt the size of the bulk.

We built a form of a back pressure where the size of a blocking bulk (a list of individual requests sent at the same time, we do not use MSearch yet) depends on how many requests were rejected in the previous bulk. We wait for current bulk to finish before starting a new one. Obviously all rejected requests are re-injected into the request-queue (in a form of a data needed to construct the query). For example if our Elastic can handle 500 simultaneous requests and we send 600, some of them will be rejected and the new size will be reduced to 480 (20% off).

What we found out was that ES returns different results for the previously rejected requests. For example it may return something like the expected result, but with an offset of 2. We also have missing results where an address should have 1 result, but has none due to this bug.

If the bulk size is less than the threshold that ES can handle, everything goes as expected and we get expected results.

It doesn't look like it's the library's (elastic4s) problem.

Elastic configuration: 2 nodes with 5 shards each

Per node: 2 CPU, 32 GB ram, 16 GB heap. Everything else is default

I couldn't find any information on the internet, did anyone have this problem? What was the solution?

What we tried so far:

Thread.sleep between bulks as the link above suggests.
Removing cache on query level as well as removing it from the index.
Trying same index on a different (slower) hardware.
Verified that it's not a race-condition (in our code) problem.

Update:

What the query like.

Thread pool for search: "search" : { "type" : "fixed", "min" : 4, "max" : 4, "queue_size" : 1000 },

2nd UPDATE:

We also tried setting preference to our query (thinking that it was a problem with shards): .preference(Preference.Primary) with no positive result (they were even more random than before). Two consecutive runs with this setting give different "random" results, so this is not consistent.

Solution

The reason for inconsistent results was that Elastic replies with Success if at least 1 shard had a result. So basically if only one of our 5 shards succeeded, the request would return a successful result with only 20% of the data.

As seen here and here, this is not a bug, this is a feature. Elastic prefers to return some (albeit, inconsistent) result instead of not returning anything.

The solution to this problem is either to use only one shard or to treat more than 0 failed shards as a general request failure using following object that each ES response has:

"_shards" : { "total" : 5, "successful" : 5, "failed" : 0 },