Search code examples
kubernetesprometheusazure-aks

How do you limit a Prometheus alerting rule to a Kubernetes Service?


I am currently using this Prometheus alerting rule, which works fine, but is too general:

sum (rate (container_cpu_usage_seconds_total{id="/"}[1m])) / sum (machine_cpu_cores) * 100 > 50

I would like to change it in two ways:

  1. Make the 'container_cpu_usage_seconds_total{id="/"}[1m]))' part specific for one Kubernetes Service that runs pods that execute a calculation

  2. Divide the value from point 1 by the sum of the cpu cores required by the calculation pods. Right now this is 500 millicores.

How do I do this?

I found this thread, in which someone uses the following rule, but I am not quite sure how to reform it to fit my criteria.

sum (rate (container_cpu_usage_seconds_total{image!=""}[1m])) by (pod_name)

Solution

  • This is how I solved my problem:

    ((sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m]))) / ((avg(container_spec_cpu_quota{container="test-app"})/100000)*count(container_spec_cpu_quota{container="test-app"}))) * 100 > 50
    

    The first part is the amount of cores the containers with the name "test-app" are using. This is then divided by the amount of cores that were assinged to them on creation

    The division by 10000 is necessary to compare the two. If the final value is bigger than 50, i.e. if the pods use more than 50% of their assigned CPU resource, an alert is registered.

    Explanation of the different parts of the formula:

    This factor scrapes the total cpu usage of the "test-app" container.

    sum (rate (container_cpu_usage_seconds_total{container="test-app"}[1m])))
    

    This factor represents the cpu assigned to the containers.

    avg(container_spec_cpu_quota{container="test-app"})/100000
    

    This factor is the amount of test-app containers present

    count(container_spec_cpu_quota{container="test-app"})