Search code examples
prometheusgrafanaamazon-cloudwatchmetrics

Is it possible to get accurate request per minute metrics via prometheous


Goal

Track RPM and Up time via grafana & prometheus

Situation

We are using

django-prometheus -> To emit metrics 
fluent-bit -> Scrapes django metrics every 15s and pushes to prometheus 
prometheus -> 2 shards running via prometheus operator on k8s

Problem

When we compare grafana dashboard with aws target group request metrics it isn't matching. Tried all below options

Expr: sum by(service) (irate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (increase(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
Expr: sum by(service) (rate(django_http_requests_before_middlewares_total{namespace="name"}[5m]))
django_http_requests_before_middlewares_total -> This is Counter data type.
This counter never resets because we have unique dimension
- container_id
- service_name
- namespace   

Q. Is it possible to create dashboard on grafana which resembles aws target group metrics ?

Ideally increase should work but it takes diff continuously and that might be giving incorrect result.

Thanks in advance.


Solution

  • In theory the following query should return the exact number of per-service requests for the last minute:

    sum(
      increase(django_http_requests_before_middlewares_total[1m])
    ) by (service)
    

    But in practice Prometheus may return unexpected results for this query:

    • It can return fractional results over the integer counter because of extrapolation. See this issue for details.
    • It can return lower than expected results, since Prometheus ignores the counter increase between the last raw sample just before the lookbehind window specified in square brackets (e.g. [1m] in the query above) and the first raw sample in the lookbehind window.
    • It can return empty result if the specified lookbehind window in square brackets contains less than two raw samples. For example, if the interval between raw samples doesn't exceed one minute, then the increase(m[d]) would return empty results for d <= 1m.

    Prometheus developers are aware of these issues and are going to fix them - see this design doc.

    In the mean time you can try using increase() function in VictoriaMetrics - this is Prometheus-like monitoring solution I work on. Its' increase function is free from issues mentioned above.

    An important note: both Prometheus and VictoriaMetrics calculate query results independently per each point displayed on the graph. So, if you need displaying per-minute number of requests using the query above, you need to set the interval between points on the graph (aka step) to one minute.