go kubernetes apache-kafka confluent-platform

kafka commit during rebalancing

The scenario:

Kafka version 2.4.1.
Kafka partitions are processing messages actively.
CPU usage is less, memory usage is mediocre and no throttling is observed.
Golang Applications deployed on k8s using confluent's go client version 1.7.0.
k8s deletes some of the pods, kafka consumer group goes into rebalancing.
The message which was getting processed during this rebalancing gets stuck in the middle and takes around 17 mins to get processed, usual processing time is 3-4 seconds max.
No DB throttling, load is actually not even 10% of our peak.
k8s pods have 1 core and 1gb of memory.
Messages are consumed and processed in the same thread.
Earlier we found that one of the brokers in the 6 cluster node was unhealthy and we replaced it, post which we started facing the issue.

Question - Why did the message get stuck? Is it because rebalancing made the processing thread hang? OR something else?

Thanks in advance for your answers!

Solution

Messages are stuck due to rebalancing which is happening for your consumer group (CG). The rebalancing process for Kafka is normal procedure and is always triggered when new member joins the CG or leaves the CG. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. But if the CG stuck in PreparingRebalance you will not process any data.

You can identify the CG state by running some Kafka commands as example:

kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state

and it should show you the status of the CG as example:

GROUP                     COORDINATOR (ID)          ASSIGNMENT-STRATEGY  STATE           #MEMBERS
name-of-consumer-group brokerX.com:9092 (1)                      Empty           0

in above example you have STATE : EMPTY

The ConsumerGroup State may have 5 states:

Stable - is when the CG is stable and has all members connected successfully

Empty - is when there is no members in the group (usually mean the module is down or crashed)

PreparingRebalance - is when the members are connecting to the CG (it may indicate issue with client when members keep crashing but also is the State of CG before gets stable state)

CompletingRebalance - is the state when the PreparingRebalance is completing the process of rebalancing

Dead - consumer group does not have any members and metadata has been removed.

To indicate if the issue is on Cluster or client per PreparingRebalance just stop the client and execute the command to verify CG state... if the CG will be still showing members .. then you have to restart the broker which is pointed in the output command as Coordinator of that CG example brokerX.com:9092 .. if the CG become empty once you stop all clients connected to the CG would mean that something is off with the client code/data which causes members to leave/rejoin CG and as effect of this you sees that the CG is always in the status of PreparingRebalance that you will need to investigate why is this happening.

since from what I recall there was bug in Kafka version 2.4.1. and been fixed in 2.4.1.1 you can read about it here:

my troubleshooting steps should show you how can you verify If this is the case that you facing the bug issue or is just bad code.