Search code examples
apache-kafkadevopsconfluent-platformprometheus-alertmanageraws-msk

How to monitor disk space usage for Kafka Brokers in AWS MSK cluster


We need to Monitor disk space usage for Kafka Brokers running in AWS MSK cluster.

There're several metrices emitted by Kafka which can be used to monitor various aspects. But I was unable to find any specific metric that monitors "Disk Usage" for each broker.

Although, it depends on message and log retention policy and the rate at which new events are coming in various topics, how we can predict if our brokers go out of disk in next 1 days (or whatever duration we want as safe threshold).

If we can monitor the average size of event payload and events per minute (or hour), it can help in making this calculation. I was referring to Apache Kafka documentation for available metrices, but I was unable to find this as well.

avg(rate(kafka_server_BrokerTopicMetrics_FifteenMinuteRate{ name="BytesInPerSec"}[1h]))/avg(rate(kafka_server_BrokerTopicMetrics_FifteenMinuteRate{ name="BytesOutPerSec"}[1h]))

Tried above PQL. If anyone can suggest a healthy range for ByteIn/ByteOut, it may be used with confidence.

All pointers are highly appreciated.


Solution

  • available metrices for node filesystem can be used directly. Kafka does not expose any specific metrics for this purpose. So I re-used following metrices used for eks cluster:

    node_filesystem_free_bytes / node_filesystem_size_bytes < 0.2
    

    We used similar metrics for EKS cluster node file system monitoring. This serves the same purpose and gives an idea of available disk space on any kafka broker in MSK cluster (just add filters inside each metrics)