When kafka-streams app is running and Kafka is suddenly down, the app enters into "waiting" mode , the consumers and producers threads sending warning logs on them not be able to connect, and when Kafka is back, everything should (theoretically) go back to normal. I'm trying to get an alert on this situation and I'm not able to find the place to catch that and send log/metric. I tried the following:
streams.setUncaughtExceptionHandler
but this occurs only on exceptions which is not the case hereProductionExceptionHandler
and change default.production.exception.handler
property to my class which extend this interface. again, as with setUncaughtExceptionHandler
there is not exception being thrown here so nothing really happens.I know Kafka has its own metrics which I can listen to and find if broker is down. but there can be a situations where Kafka brokers are just fine and the my kafka-streams app is not able to connect(i.e bad authentication configuration or vpn/vpc issues)
what can I do to catch those issues and log them /report them ?
update
see the consumer/producer logs in case of kafka not available:
2020-08-24 21:41:32,055 [my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1] WARN o.apache.kafka.clients.NetworkClient - [] [Consumer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-consumer, groupId=my-kafka-streams-app] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected
2020-08-24 21:41:32,186 [kafka-admin-client-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] WARN o.apache.kafka.clients.NetworkClient - [] [AdminClient clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
2020-08-24 21:41:32,250 [kafka-producer-network-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] WARN o.apache.kafka.clients.NetworkClient - [] [Producer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
This case is not easy to detect programmatically. The problem is, that the clients don't really expose their state to Kafka Streams, and thus Kafka Streams does not really know about the disconnect. There is KIP that proposes to add a DISCONNECT
state, but it's not easy to implement (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-457%3A+Add+DISCONNECTED+status+to+Kafka+Streams).
The exception handler you mention don't help for this situation, as no exception is thrown (at least not within the Kafka Streams code base).
What you can try is to monitor consumer lag or some Kafka Streams metrics (like processing rate). They might provide a good enough proxy. Cf https://docs.confluent.io/current/streams/monitoring.html