Search code examples
c#apache-kafkakafka-consumer-apireliability

Prevent cascading failures in Kafka consumers


Imagine you have a Kafka consumer group with 3 members (M1, M2, and M3). Each member is running in it's own process, and each currently has one partition assigned (Pa, Pb, and Pc).

M1 receives a poison message from P1 which is crafted such that it triggers a stack overflow exception, killing M1. This will eventually trigger a rebalance, and M2 now has P1.

M2 will now receive the same poison message from P1 - and also die, triggering a rebalance and giving P1 to M3.

Finally, M3 will receive the same message and die.

At this point you have taken out your entire set of processors - and any new ones you spin up will also die until you have fixed the message in Kafka directly.

My question is - how does one prevent this cascading failure? I'm happy that the affected partition is ignored until the issue is resolved, and I can see how I would use the Pause functionality to achieve this in the case of a handled exception. However, I can't handle a stack overflow, so am not able to easily pause the partition.

Does Kafka have any mechanisms for handling this type of cascading failure?


Solution

  • One of the best question on Apache Kafka.

    Well we can use assign(Collection partitions) method to avoid such scenarios. In this particular case we can do the following:

    M1

        Consumer<K, V> m1 = getConsumer();
        TopicPartition tp = new TopicPartition("topic", 0);
        m1.assign(Arrays.asList(tp));
    

    M2

        Consumer<K, V> m2 = getConsumer();
        TopicPartition tp = new TopicPartition("topic", 1);
        m2.assign(Arrays.asList(tp));
    

    M3

        Consumer<K, V> m3 = getConsumer();
        TopicPartition tp = new TopicPartition("topic", 2);
        m3.assign(Arrays.asList(tp));
    

    NOTE: Above code is just an example

    You can find detailed explanation here

    If you need any further help let me know. Happy to help.