Search code examples
apache-kafkaconsumer

What are the impacts of kafka broker being incative for long duration and startup after many days?


We are tackling with production issue which might take few days to fix. Majority of Kafka nodes are active. One node is down. We will bring it up after the bugs are fixed. Our Kafka version is 2.1.X.

I was curious what are the impacts of starting an inactive broker after few days.

Are there any issues we might observe ? (Especially impacts on consumer after replicas are catching up on restarted broker.)

What are the contingencies to rollout safely ?


Solution

  • Whenever a broker is down, it's recommended to restore as quickly as you can. The consumer offsets expire and log-end offsets are also getting cleaned regularly in an active cluster.

    We were able to restore node after 4 days but it wasn't easy operation. We restore the Kafka cluster by enabling unclean leader election. We were having controlled shutdowns due to bad leader assignments. After the inactive node was restored, we disabled the unclean leader election.

    Things to take into account:

    • In prod usually the clients can't have any downtime. Monitor consumer groups for any long rebalances or lagging commits beyond SLA's.

    • Running a preferred replica election if the offset on restored nodes are live.

    • Reset offsets on consumer group. This does require a short downtime.

    Rollback:

    You can rollback topic partition using reassignment tool but there are no easy rollback.