Search code examples
apache-kafkaupgrade

During rolling upgrade/restart, how to detect when a kafka broker is "done"?


I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).

What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.

Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...


Solution

  • I would suggest exposing JMX metrics and tracking the following for cluster health

    • the controller count (must be 1 over the whole cluster)
    • under replicated partitions (should be zero for healthy cluster)
    • unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
    • ISR shrinks within a reasonable time period, like 10 minute window (should be none)

    Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true