Search code examples
prometheusmetricspromqlprometheus-alertmanager

Prometheus - calculate percentage of 503 error count, per API using PromQL


Let say I have following time series followed by total count of status code -

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService.200"} -> 1

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService.400"} -> 4

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService.404"} -> 1

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService.500"} -> 3

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService.503"} -> 3

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myShopService.200"} -> 2

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myShopService.400"} -> 4

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myShopService.404"} -> 1

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myShopService.500"} -> 2

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myShopService.503"} -> 6

If you notice the metric, there are basically two API named as myItemService and myShopService. I have the count of status code. Now I want to calculate the percentage of 503 error, per API and set an alert if the percentage is grater than a specific threshold using promQL.

Let say, the threshold is 30. I wanna trigger alert if the percentage of 503 count for any of the API is greater than 30.

According to above metric,

total request in myItemService api = 12

total 503 in myItemService = 3

myItemService percentage = 3/12*100 = 25

total request in myShopService api = 15

total 503 in myShopService = 6

myShopService percentage = 6/15*100 = 40

For the above scenario, alert should trigger for myShopService API.

Questions:

  1. According to my metric structure, can I calculate percentage of 503 per API? If yes, what will be the promQL query? I prepared the following promQL but it is giving me the result considering whole, not per API

    sum(app_interface_statusCode{metricType="Count", service=~".*503"}) / sum(app_interface_statusCode{metricType="Count"}) * 100

  2. Can I split the API name and status code in the time series? For example: In my existing time series,

as you can see -

service="myItemService.503"

Can I make the time series like -

app_interface_statusCode{instance="localhost:5555", job="prometheus", metricType="Count", service="myItemService", statusCode="503"}

if yes, how can I do it?


Solution

  • The following query extracts status code from service label into the statusCode label by using the label_replace function:

    label_replace(app_interface_statusCode, "statusCode", "$1", "service", "[^.]+\\.([0-9]+)")
    

    The following query removes the status code part from the service label:

    label_replace(app_interface_statusCode, "service", "$1", "service", "([^.]+)\\.[0-9]+")
    

    Now let's combine these two queries in order to obtain service label without status code and the status code in the statusCode label:

    label_replace(
      label_replace(app_interface_statusCode, "statusCode", "$1", "service", "[^.]+\\.([0-9]+)"),
      "service", "$1", "service", "([^.]+)\\.[0-9]+"
    )
    

    The following query calculates the per-service percentage of requests with 503 status codes if q returns time series with properly set service and statusCode labels:

    sum(q{statusCode="503"}) by (service) / sum(q) by (service)
    

    This query uses sum() function for summing metrics with identical service label.

    Unfortunately the q cannot be replaced with the big label_replace() query above, since label filters cannot be applied to time series returned from functions :(

    So we have to rewrite statusCode="503" filter into filter on the original service label containing both service name and status code:

    label_replace(
      label_replace(app_interface_statusCode{service=~"[^.]+\\.503"}, "statusCode", "$1", "service", "[^.]+\\.([0-9]+)"),
      "service", "$1", "service", "([^.]+)\\.[0-9]+"
    )
    

    Then the final query converts into the following monster:

    sum(
      label_replace(
        label_replace(app_interface_statusCode{service=~"[^.]+\\.503"}, "statusCode", "$1", "service", "[^.]+\\.([0-9]+)"),
        "service", "$1", "service", "([^.]+)\\.[0-9]+"
      )
    ) by (service)
      /
    sum(
      label_replace(app_interface_statusCode, "service", "$1", "service", "([^.]+)\\.[0-9]+")
    ) by (service)
    

    P.S. As you can see, it is better to store time series with properly formatted labels instead of relying on label_replace() function for label formatting during query time, since this may significantly simplify the final query. That's why it is recommended to use relabeling at data ingestion time. See also this relabeling cookbook.