Search code examples
apache-kafkakafka-consumer-api

Kafka Consumer keeps rebalancing


My Kafka Consumer client keeps rebalancing between frequent polls even if no source records are being processed.

Also, I have consumer.pause() and consumer.resume() implemented on the right place to hold the consumer from polling until the polled number of consumer records are processed on isolation with the backend APIs. Not sure what is causing the issue.

Consumer logs

2022-07-21 10:27:08; LOG_LEVEL="INFO"; SOURCE="MyKAFKAConsumer"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="**Polling...**"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="MyKAFKAConsumer"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="**Polling...**"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Group coordinator brokerHost:9093 (id: 2147444144 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted."

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Discovered group coordinator brokerHost:9093 (id: 2147444144 rack: null)"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Discovered group coordinator brokerHost:9093 (id: 2147444144 rack: null)"

2022-07-21 10:27:25; LOG_LEVEL="ERROR"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Offset commit failed on partition MySourceTopic-0 at offset 38572: The coordinator is not aware of this member."

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] OffsetCommit failed with Generation{generationId=18661, memberId='MyKAFKAConsumerClientId1-d410deb4-d6f9-4f7e-b789-f866b89817e6', protocol='range'}: The coordinator is not aware of this member."

2022-07-21 10:27:25; LOG_LEVEL="WARN"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Asynchronous auto-commit of offsets {MySourceTopic-0=OffsetAndMetadata{offset=38572, leaderEpoch=132, metadata=''}, MySourceTopic-2=OffsetAndMetadata{offset=38566, leaderEpoch=122, metadata=''}, MySourceTopic-1=OffsetAndMetadata{offset=38779, leaderEpoch=118, metadata=''}, MySourceTopic-3=OffsetAndMetadata{offset=38585, leaderEpoch=121, metadata=''}} failed: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records."

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Failing OffsetCommit request since the consumer is not part of an active group"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Attempt to heartbeat with stale Generation{generationId=18661, memberId='MyKAFKAConsumerClientId1-d410deb4-d6f9-4f7e-b789-f866b89817e6', protocol='range'} and group instance id Optional.empty failed due to UNKNOWN_MEMBER_ID, ignoring the error"

2022-07-21 10:27:25; LOG_LEVEL="WARN"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Synchronous auto-commit of offsets {MySourceTopic-0=OffsetAndMetadata{offset=38572, leaderEpoch=132, metadata=''}, MySourceTopic-2=OffsetAndMetadata{offset=38566, leaderEpoch=122, metadata=''}, MySourceTopic-1=OffsetAndMetadata{offset=38779, leaderEpoch=118, metadata=''}, MySourceTopic-3=OffsetAndMetadata{offset=38585, leaderEpoch=121, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group."

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Lost previously assigned partitions MySourceTopic-0, MySourceTopic-2, MySourceTopic-1, MySourceTopic-3"

2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] (Re-)joining group"
2022-07-21 10:27:25; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] (Re-)joining group"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Successfully joined group with generation Generation{generationId=18663, memberId='MyKAFKAConsumerClientId1-f55c7204-c932-40cd-a9b3-0351f9904026', protocol='range'}"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.AbstractCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Successfully synced group in generation Generation{generationId=18663, memberId='MyKAFKAConsumerClientId1-f55c7204-c932-40cd-a9b3-0351f9904026', protocol='range'}"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Notifying assignor about the new Assignment(partitions=[MySourceTopic-0, MySourceTopic-1, MySourceTopic-2, MySourceTopic-3])"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Adding newly assigned partitions: MySourceTopic-0, MySourceTopic-2, MySourceTopic-1, MySourceTopic-3"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Setting offset for partition MySourceTopic-0 to the committed offset FetchPosition{offset=38572, offsetEpoch=Optional[132], currentLeader=LeaderAndEpoch{leader=Optional[brokerHost:9093 (id: 33708 rack: null)], epoch=132}}"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Setting offset for partition MySourceTopic-2 to the committed offset FetchPosition{offset=38566, offsetEpoch=Optional[122], currentLeader=LeaderAndEpoch{leader=Optional[brokerHost:9093 (id: 39217 rack: null)], epoch=122}}"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Setting offset for partition MySourceTopic-1 to the committed offset FetchPosition{offset=38779, offsetEpoch=Optional[118], currentLeader=LeaderAndEpoch{leader=Optional[brokerHost:9093 (id: 30678 rack: null)], epoch=118}}"

2022-07-21 10:27:26; LOG_LEVEL="INFO"; SOURCE="org.apache.kafka.clients.consumer.internals.ConsumerCoordinator"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="[Consumer clientId=MyKAFKAConsumerClientId1, groupId=MyKAFKAConsumeGroupId1] Setting offset for partition MySourceTopic-3 to the committed offset FetchPosition{offset=38585, offsetEpoch=Optional[121], currentLeader=LeaderAndEpoch{leader=Optional[brokerHost:9093 (id: 38300 rack: null)], epoch=121}}"

2022-07-21 10:27:30; LOG_LEVEL="INFO"; SOURCE="MyKAFKAConsumer"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="**Polling...**"

2022-07-21 10:27:35; LOG_LEVEL="INFO"; SOURCE="MyKAFKAConsumer"; PLATFORM="Retail_Products"; SERVICE="MyKAFKAConsumer"; EVENT_MESSAGE="**Polling...**"

Consumer Configuration

    auto.commit.interval.ms = 5000
    auto.offset.reset = earliest
    enable.auto.commit = true
    fetch.max.bytes = 52428800
    fetch.max.wait.ms = 500
    fetch.min.bytes = 1
    heartbeat.interval.ms = 3000
    max.poll.interval.ms = 300000
    max.poll.records = 100

Consumer poll() implementation

        while (true) {
            try {
                ConsumerRecords<String, GenericRecord> records = null;
                LOGGER.info("Polling...{}");
                records = consumer.poll(Duration.ofSeconds(5));

                if (records != null && records.count() > 0) {
                    consumer.pause(consumer.assignment());
                    recordExecutor.execute(records);
                }

                consumer.resume(consumer.assignment());
                consumer.commitSync();

            } catch (SerializationException e) {
                continue;
            } catch (Exception e) {
                continue;
            }
        }

Thanks in advance.


Solution

  • After an extensive analysis, we have identified the root cause of these frequent rebalancing, which was due to the session.timeout between the consumers/streamers and the Group Coordinator (Leader Broker) . The session.timeout is keep happening due to the server capacity issue (100 % CPU Utilization) where these consumers/streamers are hosted. When the CPU is saturated to 100 %, all of these KAFKA Clients are getting into a Not Responding state and eventually it's session timing out.

    Event log

    group-MyKAFKAConsumeGroupId1] Group coordinator brokerHost:9093 (id: 21474 rack: null) is unavailable or invalid due to cause: session timed out without receiving a heartbeat response.isDisconnected: false. Rediscovery will be attempted.