I'm trying to collect some Kafka Telemetry
when I do some replication of my messages.
I presume my bottleneck is in the network when I do replication of the record in 3 instances RF=3
I need data to support my theory, so do we have any JMX
data that can tell me in Grafana
how much time it takes a record to be replicated in the three machines.
Regards.
Take a look at the kafka.network:type=RequestMetrics
metrics. There's a few metrics that track the time spent processing produce requests on the leader and the time spent waiting for followers to replicate records. They are highlighted in the Monitoring section in the Kafka docs:
kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce
: Total time spent processing produce requestskafka.network:type=RequestMetrics,name=LocalTimeMs,request=Produce
: Time spent by the leader processing produce requestskafka.network:type=RequestMetrics,name=RemoteTimeMs,request=Produce
: Time spent waiting for followersThere are a few other metrics including RequestQueueTimeMs
, ResponseQueueTimeMs
, ResponseSendTimeMs
each measuring the different steps brokers take when handling requests.
All of these metrics have a few attributes such as various percentiles, min, max, etc, that you should monitor to identify potential bottlenecks in your clusters.