Search code examples
websphereprometheusgrafanapromql

Calculating average wait time per message in a topic with PromQL


We collect the following two prometheus metrics over time: The first is:

# HELP was_sib_durableSubscription_messageWait_time_seconds_total Total amount of time (in seconds) spent on the bus by messages consumed from this subscription.
# TYPE was_sib_durableSubscription_messageWait_time_seconds_total gauge

The second is:

# HELP was_sib_durableSubscription_messageWait_total The number of messages that waited on the bus.
# TYPE was_sib_durableSubscription_messageWait_total counter

We tried dividing the two metrics but the result does not seem to be right.

How can we create a graph that shows the average wait time per message that waited on the bus?


Solution

  • The following PromQL query returns the average wait time over the last 10 minutes:

    increase(was_sib_durableSubscription_messageWait_time_seconds_total[10m])
      /
    increase(was_sib_durableSubscription_messageWait_total[10m])
    

    It uses increase function for calculating the increase of wait time sum and wait time measurements over the last 10 minutes (such metrics are called counters).

    It also uses / operator for dividing the sum of wait times over the last 10 minutes by the number of wait time measurements during the last 10 minutes.

    If you need calculating the average wait time over another lookbehind window, then just replace 10m in the query above with the needed duration. See the list of supported durations in PromQL.

    P.S. If the was_sib_durableSubscription_messageWait_total counter changes slowly over time, then increase() over this counter in Prometheus may return unexpected results because of extrapolation - see this issue for details. The workaround is to use increase() function from VictoriaMetrics - this is Prometheus-like monitoring solution I work on. It doesn't use extrapolation in increase() function. See these docs for more details.