Search code examples
prometheusgrafanapromql

How to properly show accumulative API response time in Grafana


I have the service (named sdk-backend) written in Scala + Cats Effect.
Kamon.io is used to publish response time metrics to Prometheus.
There is API for token retrieval.
If I make two API calls, there are such metrics tracked:

# TYPE api_v1_sdk_token_seconds histogram
api_v1_sdk_token_seconds_bucket{le="0.005"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.01"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.025"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.05"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.075"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.1"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.25"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.5"} 0.0
api_v1_sdk_token_seconds_bucket{le="0.75"} 0.0
api_v1_sdk_token_seconds_bucket{le="1.0"} 0.0
api_v1_sdk_token_seconds_bucket{le="2.5"} 0.0
api_v1_sdk_token_seconds_bucket{le="5.0"} 0.0
api_v1_sdk_token_seconds_bucket{le="7.5"} 1.0
api_v1_sdk_token_seconds_bucket{le="10.0"} 2.0
api_v1_sdk_token_seconds_bucket{le="+Inf"} 2.0
api_v1_sdk_token_seconds_count 2.0
api_v1_sdk_token_seconds_sum 16.978542592 

api_v1_sdk_token_seconds_count means, there were 2 requests to the API, which took 16.97 sec (api_v1_sdk_token_seconds_sum) in sum (yes API is quite slow).

The metrics are published in Prometheus without issues.
Then I'd like to import the metrics into Grafana.
The expression I'm using to show response time over time is as follows:

avg by(app) (sum by(app) (increase(api_v1_sdk_token_seconds_sum{app="sdk-backend"}[$__rate_interval])))

enter image description here

The spikes on the picture is the result of load testing I've made.
The load testing report looks like this:
enter image description here

As you can see from the report, mean response time is 1338 sec.
What I'd like to see in Grafana in peak is amount of time around mean response time (1.3 sec), rather than ~ 3000 sec which currently shown in Grafana. More over, there were 44467 requests done during the load test with mean requests per sec = 148.23

Questions:

  1. Is it correct formulae for displaying mean response time over time?
avg by(app) (sum by(app) (increase(api_v1_sdk_token_seconds_sum{app="sdk-backend"}[$__rate_interval])))/avg by(app) (sum by(app) (increase(api_v1_sdk_token_seconds_count{app="sdk-backend"}[$__rate_interval])))
  1. How to write formulae for displaying requests per second having in mind that there is accumulative metric (api_v1_sdk_token_seconds_count) basically, stands for for number of requests have been done.

Solution

  • Please notice, that avg by(app) applied after sum by(app) does nothing.

    Additionally, your initial query doesn't take into consideration possible different number of requests.

    Is it correct formulae for displaying mean response time over time?

    avg by(app) (sum by(app) (increase(api_v1_sdk_token_seconds_sum{app="sdk-backend"}[$__rate_interval])))/avg by(app) (sum by(app) (increase(api_v1_sdk_token_seconds_count{app="sdk-backend"}[$__rate_interval])))
    

    It is not ideal (you should remove useless avg by), but should return correct result.

    sum by(app) (increase(api_v1_sdk_token_seconds_sum{app="sdk-backend"}[$__rate_interval]))
     / sum by(app) (increase(api_v1_sdk_token_seconds_count{app="sdk-backend"}[$__rate_interval]))
    

    How to write formulae for displaying requests per second

    This can be accomplished with simple use of rate function:

    rate(api_v1_sdk_token_seconds_count [$__rate_interval])