Search code examples
prometheusgrafanapromql

Why is the moving average higher than the actual series in prometheus


Given a gauge called gin_in_flight_requests

We have two queries in prometheus:

green line:

sum(avg_over_time(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}[1m]))

yellow line

sum(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"})

The green line has 14:35 a higher peak than every individual point of the sum line but how can it be that the sum of averages over time produce a higher result then the max of the sum itself ?

sum of average over time vs plain sum

The graph was made with grafana 9 explore


Solution

  • By default Prometheus wraps time series selectors into last_over_time() rollup function with 5 minutes lookbehind window in square brackets if the time series selector isn't wrapped into any rollup function. So the sum(gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}) query is automatically converted into the following query before execution:

    sum(
      last_over_time(
        gin_in_flight_requests{app="my-service",cluster="prod", url="/api/v1/url1"}[5m]
      )
    )
    

    See these docs for more details.

    E.g. this query takes into account a subset of raw samples, actually the last raw samples just before each point displayed on the graph. It ignores the remaining raw samples. So it may return values smaller than the sum(avg_over_time(...)) query. If you want taking into account all the max raw samples, then use max_over_time function.

    P.S. If you want capturing all the raw sample maximums and minimums on the selected time range in Grafana, then just use max_over_time() and min_over_time() queries with $__interval lookbehind window in square brackets:

    sum(max_over_time(...[$__interval]))
    

    and

    sum(min_over_time(...[$__interval]))
    

    P.P.S. FYI, an alternative Prometheus-like monitoring solution I work on - VictoriaMetrics - provides a rollup function, which simultaneously returns min, max and avg values on the selected time range. E.g. it can be used instead of three queries with min_over_time(), max_over_time() and avg_over_time() functions.