The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar
java client) to an ES cluster of 2-4 nodes.
My ES configuration:
Elasticsearch version: 7.2
custom configuration in elasticsearch.yml:
thread_pool.search.queue_size = 20000
thread_pool.write.queue_size = 500
I use only the default 7.x circuit-breaker values, such as:
indices.breaker.total.limit = 95%
indices.breaker.total.use_real_memory = true
network.breaker.inflight_requests.limit = 100%
network.breaker.inflight_requests.overhead = 2
The error from elasticsearch.log:
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
"bytes_wanted": 3144831050,
"bytes_limit": 3060164198,
"durability": "PERMANENT"
},
"status": 429
}
Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.
Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)
So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?
How to fix: