I have a fairly typical show CPU usage query
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80
which results in data which looks a bit like this:
{instance="opus143.domain.com:9182"} 94.07140535559513
{instance="opus162.domain.com:9182"} 90.00755315803018
{instance="opus163.domain.com:9182"} 85.48084077380952
but I would like to query only for values for machines which don't appear in another list
opus_local_slaves_count > 0
opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
I think I've been able to get part of the way there using label_replace to give me the host in each case
(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "$1","instance","(.*?)[.].*"))
{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952
label_replace((opus_local_slaves_count > 0), "host", "$1","instance","(.*?)[.].*")
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
But now I am really stuck trying to exclude hosts in the second list, from the first. Is this even possible in PromQL? In SQL it would be a simple NOT IN subquery
UPDATE: For context, what I'm trying to achieve is to be able to alert for high-CPU on servers, except servers in the second list, which should have high-CPU utilisation. Maybe there's a better way?
Solved it!
For anyone who finds this looking to do something similar... The saliant keyword is UNLESS!
I first simplified things by creating recording rules:
groups:
- name: custom_rules
rules:
- record: wmi_cpu_time_total_instance
expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
- record: wmi_cpu_time_total_instance_host
expr: label_replace(wmi_cpu_time_total_instance, "host", "$1", "instance","(.*?)[.].*")
- record: opus_local_slaves_count_instance_host
expr: label_replace(opus_local_slaves_count, "host", "$1", "instance","(.*?)[.].*")
which encapsulated much of the complexity in calculating and adding the host label, and I then found this blog (thank you Chris Siebenmann) https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics which pointed me at the UNLESS keyword so I could write the simple query
wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)
which gives a list of hosts which don't have the opus_local_slaves_count label, or for which the opus_local_slaves_count = 0
Voila!