Search code examples
prometheuspromql

How to join Prometheus queries to gain their aggregate results


I need to join two promql queries to gain their results but faced some troubles. We have two metrics:

  • http_server_requests_total - counter for requests. Labels: app, path, method, response_status_code. Produces a integer number of requests.
  • http_server_request_duration_seconds_bucket - histogram to measure response time from http requests. Labels also: app, path, method, response_status_code. Use "le" label to sort time series over buckets.

I have two queries for these metrics:

endpoints with highest number of requests

topk(40, 
    sum(http_server_requests_total{app=~"web_api_service", response_status_code=~"2..|4..|5.."}) 
        by (app, path, method, response_status_code)
)

endpoints with highest execution time

histogram_quantile(
  0.95,
  sum by (le, app, path, method, response_status_code) 
      (rate(http_server_request_duration_seconds_bucket{
          app=~"web_api_service",response_status_code=~"2..|4..|5.."
      }[$__rate_interval]))
)

Now I want to combine results of these two queries to examine slowest endpoints with highest number of requests. I've tried several methods from this article to join queries:

  • simple and between two queries gives me number of requests from the first query as a Value (I run these queries in Grafana Explore tab since we have no direct access to Prometheus server).
  • same result for + or + on(app, path, method, response_status_code) or + on(app, path, method, response_status_code) group_left. Also I get only number of requests as Value (and NaN also).
  • + on(app, path, method, response_status_code, le) group_right returns slightly different results. But still there are no float le values of request duration from the second metric.

My questions are:

  • what is a correct way to join results of two metrics to gain intersection of number of requests and requests duration?
  • how can I make similar to sql ORDER BY first_metric, second_metric DESC in the promql?

Solution

  • Vector matching

    So you need to combine the results of two PromQL queries into one, f.e. topk aggregation and histogram_quantile() function, to get something like "a latency of N most frequent requests".

    The right way to combine metrics in PromQL is vector matching, that might be one-to-one or many-to-one depending on labels matching.

    The first query returns N most frequent requests:

    topk(40, sum(http_server_requests_total{}) by(app, path, method, response_status_code))
    

    The second query returns a latency:

    histogram_quantile(
      0.95,
      sum(rate(http_server_request_duration_seconds_bucket{}[1m])) by(le, app, path, method, response_status_code)
    )
    

    But how to combine them? You have 2 vectors with equal labels (due to by() clause on the same labels) so it's one-to-one vector matching.

    The result value (95p latency) is provided by the 2nd query, so the trick here is to discard the 1st value. You could achieve this by making the 1st value equal to 1 and by multiplying the 1 to the 2nd value. How to make it a 1? Any number in a power of 0 returns 1, and Prometheus does support arithmetic operations:

    topk(40, sum(http_server_requests_total{}) by(app, path, method, response_status_code)) ^ 0
    *
    histogram_quantile(
      0.95,
      sum(rate(http_server_request_duration_seconds_bucket{}[1m])) by(le, app, path, method, response_status_code)
    )
    

    sort(), sort_desc()

    To get results sorted you could use one of the sorting queries either on final or one of sub-results.

    topK returns already sorted results in descending order, so you could just sort the final latency vector:

    sort_desc(
      topk(40, sum(http_server_requests_total{}) by(app, path, method, 
      response_status_code)) ^ 0
      *
      histogram_quantile(
        0.95,
        sum(rate(http_server_request_duration_seconds_bucket{}[1m])) by(le, app, 
      path, method, response_status_code)
      )
    )