Search code examples
apache-kafkastrimzi

Producer failing to publish broker shows NotEnoughReplicasException


I have started a strimzi kafka cluster with the following broker configuration from examples/metrics/kafka-metrics.yaml in AWS EKS environment with persistence-claim as 10G.

replicas: 3
...
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2

The issue here is producers are failing to publish to a TEST_TOPIC which has a single partition having the following errors at the broker.

Error processing append operation on partition TEST-TOPIC-0 (kafka.server.ReplicaManager) [data-plane-kafka-request-handler-4]
org.apache.kafka.common.errors.NotEnoughReplicasException: The size of the current ISR Set(1) is insufficient to satisfy the min.isr requirement of 2 for partition TEST-TOPIC-0
  1. Although all brokers are online, what could be the possible reason for minISR=1?
  2. Will the state of the cluster recover automatically? If not is there a way to recover without impacting the clients?
  3. Ours is a mission critical application hence at any cost producer should publish. Any recommended kafka configuration to achieve this.

dashboard


Solution

  • There are many reasons why replicas might not be in-sync. For example, slow networking or storage does not allow them to keep up, maybe your producers are not configured to wait for the messages to be replicated (acks=all), or maybe your brokers are not balanced well and some of them are overloaded etc. Without detailed knowledge of your environment and the logs, it is not really easy to say why the replicas are not in sync and if they resync on their own.


    In any case, Ours is a mission-critical application hence at any cost producer should publish. is either the wrong approach or it suggests that you misconfigured things. Your topic is configured with replication factor 3 and a minimal number of in-sync replicas as 2. That suggests you want to have some reliability and availability. But if that is the case, you should not want your producers to publish at any cost. You need them to publish at a speed where the brokers can keep up with the traffic. If you want to publish messages at any cost - even at the cost of for example losing messages or availability (which might not be common, but there are some use-cases like that) - then you should configure your topic differently and for example not set the minimal number of in-sync replicas to 2.

    If you want both at the same time, then you have to make sure they all work in sync and that might mean that the producers are sometimes slowed down when waiting for the replication to happen or blocked when there are not enough in-sync replicas like in the case above.