I am new to Grafana and Prometheus. I have read a lot of documentation and now I"m trying to work backwards by reviewing some existing queries and making sure I understand them
I have downloaded the Node Exporter Full dashboard (https://grafana.com/grafana/dashboards/1860). I have been reviewing the CPU Busy query and I"m a bit confused. I am quoting it below, spaced out so we can see the nested sections better:
In this query, job
is node-exporter
while instance
is the IP and port of the server. This is my base understanding of the query:
node_cpu_seconds_total
is a counter of the number of seconds the CPU took at a given sample.
My questions are as follows:
mode=idle
, then does adding the by (mode)
add anything? There is only one mode anyways? My understanding of by (something)
is more relevant when there are multiple values and we group the values by that category (as we do by cpu
in this query)Both of these count functions return the amount of CPU cores. If you take them out of this long query and execute, it'll immediately make sense:
count by (cpu) (node_cpu_seconds_total{instance="foo:9100"})
# result:
{cpu="0"} 8
{cpu="1"} 8
By putting the above into another count()
function, you will get a value of 2
, because there are just 2 metrics in the dataset. At this point, we can simplify the original query to this:
(
NUM_CPU
-
avg(
sum by(mode) (
rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
)
)
* 100
)
/ NUM_CPU
The rest, however, is somewhat complicated. This:
sum by(mode) (
rate(node_cpu_seconds_total{mode="idle",instance="foo:9100"}[1m])
)
... is essentially the sum of idle time of all CPU cores (I'm intentionally skipping the context of time to make it simpler). It's not clear why there is by (mode)
, since the rate function inside has a filter, which makes it possible for only idle
mode to appear. With or without by (mode)
it returns just one value:
# with by (mode)
{mode="idle"} 0.99
# without
{} 0.99
avg()
on top of that makes no sense at all. I assume, that the intention was to get the amount of idle time per CPU (by (cpu)
, that is). In this case it starts to make sense, although it is still unnecessary complex. Thus, at this point we can simplify the query to this:
(NUM_CPU - IDLE_TIME_TOTAL * 100) / NUM_CPU
I don't know why it is so complicated, you can get the same result with a simple query like this:
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="foo:9100"}[1m])))