Search code examples
prometheuspromql

Calculate stability by increase() of total value in Grafana


I'm getting metrics from Prometheus. One of the metrics that indicates stability is transactions. By default it only has total count of transactions (50,55,70, and so on). So I have a chart that shows me a difference between counts (increase(total_transaction[1m])). That's how I can be sure that transactions keep going, and if difference is zero, then it means that it stuck and something is wrong. enter image description here

So I want to calculate this stability in percents, same as usually uptime is calculated. In my case if program is running but transaction won't go it's equal to downtime. So any idea how from the total count I can calculate percentage of aliveness.

Example: Total transactions is constantly increasing for 5 minutes, but then for 10 seconds it stuck and not increasing and showing same value. Meaning 10/300 = 3% downtime and 97% uptime the result I want to get.

Any suggestions?


Solution

  • Try the following query:

    avg_over_time(
      (
        increase(total_transaction[1m]) >bool 0
      )[1h:1m]
    )
    

    It returns the percentage of "uptime" in the range [0..1], where 0 means 100% downtime, while 1 means 100% "uptime".

    The definition of 100% "uptime" in this case is the following: the per-minute total_transaction increase was always non-zero during the last hour (see 1h duration in square brackets above).

    The definition of 100% downtime is the following: the per-minute total_transaction increase was always zero during the last hour.

    The query above uses the following promql features:

    Note that the query above may return lower than expected "uptime" values because of increase() implementation specifics in Prometheus - it ignores counter increase between the last raw sample just before the lookbehind window specified in square brackets (1m in the query above) and the first raw sample inside the lookbehind window. This issue is going to be fixed eventually according to this design doc.

    In the meantime it is possible to use VictoriaMetrics - an alternative Prometheus-like monitoring solution (I'm the core developer of VictoriaMetrics), which provides increase() function free of the issue mentioned above.