Search code examples
elasticsearchelasticsearch-java-apielasticsearch-7

Elasticsearch 7.x circuit breaker - data too large - troubleshoot


The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.

My ES configuration:

Elasticsearch version: 7.2

custom configuration in elasticsearch.yml:   
    thread_pool.search.queue_size = 20000  
    thread_pool.write.queue_size = 500

I use only the default 7.x circuit-breaker values, such as:  
    indices.breaker.total.limit = 95%  
    indices.breaker.total.use_real_memory = true  
    network.breaker.inflight_requests.limit = 100%  
    network.breaker.inflight_requests.overhead = 2  

The error from elasticsearch.log:

    {
      "error": {
        "root_cause": [
          {
            "type": "circuit_breaking_exception",
            "reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
            "bytes_wanted": 3144831050,
            "bytes_limit": 3060164198,
            "durability": "PERMANENT"
          }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
        "bytes_wanted": 3144831050,
        "bytes_limit": 3060164198,
        "durability": "PERMANENT"
      },
      "status": 429
    }

Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.

Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)


Solution

  • So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?

    1. the circuit breaker mechanism exists since the very first versions.
    2. we started experience issues around it when moving from version 5.4 to 7.2
    3. in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
    4. In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
    5. also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
    6. considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169

    How to fix:

    1. change the configuration back to CMS GC.
    2. or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.