Imagine you have a Kafka consumer group with 3 members (M1, M2, and M3). Each member is running in it's own process, and each currently has one partition assigned (Pa, Pb, and Pc).
M1 receives a poison message from P1 which is crafted such that it triggers a stack overflow exception, killing M1. This will eventually trigger a rebalance, and M2 now has P1.
M2 will now receive the same poison message from P1 - and also die, triggering a rebalance and giving P1 to M3.
Finally, M3 will receive the same message and die.
At this point you have taken out your entire set of processors - and any new ones you spin up will also die until you have fixed the message in Kafka directly.
My question is - how does one prevent this cascading failure? I'm happy that the affected partition is ignored until the issue is resolved, and I can see how I would use the Pause functionality to achieve this in the case of a handled exception. However, I can't handle a stack overflow, so am not able to easily pause the partition.
Does Kafka have any mechanisms for handling this type of cascading failure?
One of the best question on Apache Kafka.
Well we can use assign(Collection partitions) method to avoid such scenarios. In this particular case we can do the following:
M1
Consumer<K, V> m1 = getConsumer();
TopicPartition tp = new TopicPartition("topic", 0);
m1.assign(Arrays.asList(tp));
M2
Consumer<K, V> m2 = getConsumer();
TopicPartition tp = new TopicPartition("topic", 1);
m2.assign(Arrays.asList(tp));
M3
Consumer<K, V> m3 = getConsumer();
TopicPartition tp = new TopicPartition("topic", 2);
m3.assign(Arrays.asList(tp));
NOTE: Above code is just an example
You can find detailed explanation here
If you need any further help let me know. Happy to help.