I am unclear why we need both session.timeout.ms
and max.poll.interval.ms
and when would we use one or the other or both? It seems like both settings indicate the upper bound on the time the coordinator will wait to get the heartbeat from a consumer before assuming it's dead.
Also how does it behave for versions 0.10.1.0+ based on KIP-62?
Before KIP-62, there was only session.timeout.ms
(ie, Kafka 0.10.0
and earlier). max.poll.interval.ms
was introduced via KIP-62 (part of Kafka 0.10.1
).
KIP-62, decouples heartbeats from calls to poll()
via a background heartbeat thread, allowing for a longer processing time (ie, time between two consecutive poll()
) than heartbeat interval.
Assume processing a message takes 1 minute. If heartbeat and poll were coupled (ie, before KIP-62), you will need to set session.timeout.ms
larger than 1 minute to prevent consumer to time out. However, if a consumer dies, it also takes longer than 1 minute to detect the failed consumer.
KIP-62 decouples polling and heartbeat allowing to send heartbeats between two consecutive polls. Now you have two threads running, the heartbeat thread and the processing thread and thus, KIP-62 introduced a timeout for each. session.timeout.ms
is for the heartbeat thread while max.poll.interval.ms
is for the processing thread.
Assume, you set session.timeout.ms=30000
, thus, the consumer heartbeat thread must sent a heartbeat to the broker before this time expires. On the other hand, if processing of a single message takes 1 minutes, you can set max.poll.interval.ms
larger than one minute to give the processing thread more time to process a message.
If the processing thread dies, it takes max.poll.interval.ms
to detect this. However, if the whole consumer dies (and a dying processing thread most likely crashes the whole consumer including the heartbeat thread), it takes only session.timeout.ms
to detect it.
The idea is, to allow for a quick detection of a failing consumer even if processing itself takes quite long.
Implementation Detail
The new timeout max.poll.interval.ms
is mainly a client side concept: if poll()
is not called within max.poll.interval.ms
, the heartbeat thread will detect this case and send a leave-group request to the broker. -- max.poll.interval.ms
is still relevant for consumer group rebalances: if a rebalance is triggered, consumers have max.poll.interval.ms
time to re-join the group by calling poll()
client side which triggers a join-group request.