java apache-kafka spring-kafka spring-cloud-stream

Spring cloud stream Kafka consumer stuck with long running job and large value for max.poll.interval.ms

We have got some long-running jobs that have been also implemented with Spring Cloud Stream and Kafka binder. The issue we are facing is because the default value for max.poll.interval.ms and max.poll.records is not appropriate for our use case, we need to set a relatively large value for max.poll.interval.ms (a few hours) and a relatively small value for max.poll.records (e.g. 1) to be aligned with the longest-running job could ever get consumed by the consumer. This addresses the issue of a consumer getting into the rebalance loop. However, it causes some operational challenges with the consumer. It happens sometimes that the consumer gets stuck on the restart and does not consume any messages until the max.poll.interval.ms passes.

Is this because of the way that the Spring Cloud stream poll has been implemented? Does it help if I use the sync consumer and manages the poll() accordingly?

The consumer logs the lose of heartbeat and the message I can see in the Kafka log when the consumer has stuck:

GroupCoordinator 11]: Member consumer-3-f46e14b4-5998-4083-b7ec-bed4e3f374eb in group foo has failed, removing it from the group

Solution

Spring cloud stream (message-driven) is not a good fit for this application. It would be better to manage the consumer yourself; close it after the poll(); process the job; create a new consumer and commit the offset, and poll() again.