Search code examples
prometheuspromqlprometheus-blackbox-exporter

Prometheus - Percentage of gauge values below a certain threshold


I'm using the blackbox exporter to gather metrics from various endpoints, and I want to set a SLI to determine the number of GET requests that are slower than 300ms and 1s per service.
The exporter provides a gauge metric called probe_duration_seconds.
I'm trying to run a PromQL query to calculate the percentage of probe_duration_seconds that are below 300ms in the last 5 hours.

My current query probe_duration_seconds{}[5h] < 0.3
returns an error:

Error executing query: invalid parameter "query": 1:1: parse error: binary expression must contain only scalar and instant vector types.

I have also tried:
100 - sum(rate(probe_success{}[5h]) * 100) by (instance)
which gives me the overall success/failure rate, but I want to quantify it based on response time as well.


Solution

  • Prometheus doesn't provide a function, which could return the percentage of raw samples with values smaller than the given threshold on the given lookbehind window. This functionality can be emulated via subquery feature. For example, the following query returns the percentage of probe_duration_seconds samples with the values smaller than 0.3 during the last hour:

    count_over_time((probe_duration_seconds < 0.3)[5h:1m])
      /
    count_over_time((probe_duration_seconds)[5h:1m])
    

    This query expects that the raw samples are collected by Prometheus every minute - see 1m after the colon in square brackets. Set it to your real scrape interval for more accurate results.

    P.S. VictoriaMetrics - an alternative Prometheus-like solution I work on - provides share_le_over_time() function, which can be used instead of the query above:

    share_le_over_time(probe_duration_seconds[5h], 0.3)
    

    This approach has the following advantages over the subquery-based approach:

    • It is easier to write and maintain.
    • It works with any scrape_interval between raw samples - there is no need in adjusting the query for different scrape intervals.
    • It works faster than the initial approach and consumes less memory during the execution, since the subquery in the initial approach may allocate big amounts of memory for small scrape intervals and big lookbehind windows.