Search code examples
apiprometheusgrafanapromql

Can someone explain this PromQL query to me?


I'm new to promQL and I am using it to create grafana dashboard to visualize various API metrics like throughput, latency etc.

For measuring latency I came across these queries being used together. Can someone explain how are they working

histogram_quantile(0.99, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))

histogram_quantile(0.95, sum(irate(http_request_duration_seconds_bucket{path="<API Endpoint>"}[2m])*30) by (path,le))

Also I want to write a query which will show me number of API calls with latency greater than 4sec. Can someone please help me there as well?


Solution

  • The provided queries are designed to return 99th and 95th percentiles for the http_request_duration_seconds{path="..."} metric of histogram type over requests received during the last 2 minutes (see 2m in square brackets).

    Unfortunately the provided queries have some issues:

    • They use irate() function for calculating the per-second increase rate of every bucket defined in http_request_duration_seconds histogram. This function isn't recommended to use in general case, because it tends to return jumpy results on repeated queries - see this article for details. So it is better to use rate or increase instead when calculating histogram_quantile.
    • They multiply the calculated irate() by 30. This has no any effect on query results, since histogram_quantile() normalizes the provided per-bucket values.

    So it is recommended to use the following query instead:

    histogram_quantile(0.99,
      sum(
        increase(http_request_duration_seconds_bucket{path="..."}[2m])
      ) by (le)
    )
    

    This query works in the following way:

    1. Prometheus selects all the time series matching the http_request_duration_seconds_bucket{path="..."} time series selector on the selected time range on the graph. These time series represent histogram buckets for the http_request_duration_seconds histogram. Each such bucket contains a counter, which counts the number of requests with duration not exceeding the value specified in the le label.

    2. Prometheus calculates the increase over the last 2 minutes per each selected time series, e.g. how many requests hit every bucket during the last 2 minutes.

    3. Prometheus calculates per-le sums over bucket values calculated at step 2 - see sum() function docs for details.

    4. Prometheus calculates the estimated 99th percentile for the bucket results returned at step 3 by executing histogram_quantile function. The error of the estimation depends on the number of buckets and the le values. More buckets with better le distribution usually give lower error for the estimated percentile.