Search code examples
azureprometheusgrafanapromqlprometheus-node-exporter

Prometheus. CPU process time total to % percent


We started using Prometheus and Grafana as the main tools for monitoring our Service Fabric cluster. For targeting Prometheus we use wmi_exporter, with predefined parameters: CPU, system, process, service, memory, etc. Our main goal was to start monitoring our product services on the node group each instance in Azure Service Fabric.

For instance, we are using this PQuery to calculate total CPU usage in %:

100 - (avg by (hostname) (irate(wmi_cpu_time_total{scaleset="name",mode="idle" }[5m])) * 100) and metrics +- looks realistic.

Until we started to write queries for services.

For services, sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100, and metrics seems to be not realistic time to time, especially it is obvious after you compare it with total CPU time %. I found out an article regarding multiplying to 100 for getting % from CPU time, but in this case, I get metrics around 170% or more. Perhaps I need to divide it into the number of CPU cores?

Regarding query, I'm using the sum process because I get two different metrics for one process in two modes, user and privileged.

Can anyone please help me with the correct calculation for CPU process time total metric and transforming them to perc. ?

Thank you, I would be grateful for any help!


Solution

  • I hope this will help! The result is pretty much the same as the Windows performance manager. So, for CPU % for running services (tasks, processes):

    sum by (process,hostname)(irate(wmi_process_cpu_time_total{scaleset="name", process=~"processes"}[5m])) * 100 / 2 (number of CPU cores)
    

    First, you summarize all metrics for the running process, the exporter provides results for the same process ID: user and kernel mode metrics, so it needs to be summarized. The same must be done for hostname (instance, etc.). In my case, I have Azure scale sets, from 2 to 5 instances. It must be multiplied on 100 to get % and divide on number of CPU cores.

    Cheers!