The scenario:
Question - Why did the message get stuck? Is it because rebalancing made the processing thread hang? OR something else?
Thanks in advance for your answers!
Messages are stuck due to rebalancing which is happening for your consumer group (CG). The rebalancing process for Kafka is normal procedure and is always triggered when new member joins the CG or leaves the CG. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. But if the CG stuck in PreparingRebalance
you will not process any data.
You can identify the CG state by running some Kafka commands as example:
kafka-consumer-groups.sh --bootstrap-server $BROKERS:$PORT --group $CG --describe --state
and it should show you the status of the CG as example:
GROUP COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
name-of-consumer-group brokerX.com:9092 (1) Empty 0
in above example you have STATE : EMPTY
The ConsumerGroup State may have 5 states:
Stable - is when the CG is stable and has all members connected successfully
Empty - is when there is no members in the group (usually mean the module is down or crashed)
PreparingRebalance - is when the members are connecting to the CG (it may indicate issue with client when members keep crashing but also is the State of CG before gets stable state)
CompletingRebalance - is the state when the PreparingRebalance is completing the process of rebalancing
Dead - consumer group does not have any members and metadata has been removed.
To indicate if the issue is on Cluster or client per PreparingRebalance
just stop the client and execute the command to verify CG state... if the CG will be still showing members .. then you have to restart the broker which is pointed in the output command as Coordinator of that CG example brokerX.com:9092
.. if the CG become empty once you stop all clients connected to the CG would mean that something is off with the client code/data which causes members to leave/rejoin CG and as effect of this you sees that the CG is always in the status of PreparingRebalance that you will need to investigate why is this happening.
since from what I recall there was bug in Kafka version 2.4.1. and been fixed in 2.4.1.1 you can read about it here:
my troubleshooting steps should show you how can you verify If this is the case that you facing the bug issue or is just bad code.