Search code examples
apache-kafkakafka-consumer-apiapache-kafka-streamsrebalancing

Partitions processing stuck until state store is rebuilt during rebalancing in Kafka Streams


Let's assume I have stateful Kafka Streams application that consumes data from topic with 3 partitions. At the moment I have 2 instances of the above application running. Let's put it like that: instance1 have partitions part1 and part2 assigned, instance2 has part3.

So now I want to add the new instance to utilize the parallelization completely.

In my understanding, as soon as I start a new instance, the rebalancing occurs: one of partitions part1 or part2 and corresponding local state stores will be migrated from the existing instance to the newly added instance. In this example, let's imagine that part1 migrates on instance3.

At the same time, I realize that new instance instance3 will not start processing new data until it restores the local state store from the changelog topic, which may take much time.

During the period from starting the application and until it restores the state store:

  • does it mean that the data from part1 is not being processed and stuck until instance3 finishes the start up?
  • if yes, then what are the approaches to estimate how much time will it take for instance3 to build the local state store?
  • during this time, are other instances not affected by rebalancing and keep processing data with no downtime (instance1 - part2, instance2 - part3)?

Solution

  • Rebalancing has evolved with the recent releases:

    from version 2.4.0 with KIP-429

    • the incremental cooperative rebalancing is added that came instead of the stop-the-world rebalancing protocol
    • optimized for cloud in sense of better rebalance behavior for falling out members (e.g. when Pod is dead and restarts)
    • consumer does not need to revoke a partition if the group coordinator reassigns the same partition to the consumer again

    => part2 and part3 are not stuck and continued to be processed

    from version 2.6.0 with KIP-441

    • improve Kafka Streams scaling out behavior, especially for stateful tasks
    • previously some tasks have been blocked in processing until the state store is rebuilt which may take hours
    • now the new instance first tries to catch-up the state store from change log and only then takes the task as active
    • no downtime during the scale out

    => part1 continues to be processed on instance1 until instance3 rebuilds the state store for part1 and ready to hand over of its processing