Search code examples
apache-kafkajmx

Kafka Replication factor JMX


I'm trying to collect some Kafka Telemetry when I do some replication of my messages.

I presume my bottleneck is in the network when I do replication of the record in 3 instances RF=3

I need data to support my theory, so do we have any JMX data that can tell me in Grafana how much time it takes a record to be replicated in the three machines.

Regards.


Solution

  • Take a look at the kafka.network:type=RequestMetrics metrics. There's a few metrics that track the time spent processing produce requests on the leader and the time spent waiting for followers to replicate records. They are highlighted in the Monitoring section in the Kafka docs:

    • kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce: Total time spent processing produce requests
    • kafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce: Time spent by the leader processing produce requests
    • kafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce: Time spent waiting for followers

    There are a few other metrics including RequestQueueTimeMs, ResponseQueueTimeMs, ResponseSendTimeMs each measuring the different steps brokers take when handling requests.

    All of these metrics have a few attributes such as various percentiles, min, max, etc, that you should monitor to identify potential bottlenecks in your clusters.