Search code examples
promql

showing only data from machines which do not show up in another query


I have a fairly typical show CPU usage query

100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80

which results in data which looks a bit like this:

{instance="opus143.domain.com:9182"} 94.07140535559513 
{instance="opus162.domain.com:9182"} 90.00755315803018 
{instance="opus163.domain.com:9182"} 85.48084077380952 

but I would like to query only for values for machines which don't appear in another list

opus_local_slaves_count > 0

opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 

I think I've been able to get part of the way there using label_replace to give me the host in each case

(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "$1","instance","(.*?)[.].*"))

{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513 
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018 
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952 

label_replace((opus_local_slaves_count > 0), "host", "$1","instance","(.*?)[.].*")

opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54 
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54 

But now I am really stuck trying to exclude hosts in the second list, from the first. Is this even possible in PromQL? In SQL it would be a simple NOT IN subquery

UPDATE: For context, what I'm trying to achieve is to be able to alert for high-CPU on servers, except servers in the second list, which should have high-CPU utilisation. Maybe there's a better way?


Solution

  • Solved it!

    For anyone who finds this looking to do something similar... The saliant keyword is UNLESS!

    I first simplified things by creating recording rules:

    groups:
    - name: custom_rules
      rules:
      - record: wmi_cpu_time_total_instance
        expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
      - record: wmi_cpu_time_total_instance_host
        expr: label_replace(wmi_cpu_time_total_instance, "host", "$1", "instance","(.*?)[.].*")
      - record: opus_local_slaves_count_instance_host
        expr: label_replace(opus_local_slaves_count, "host", "$1", "instance","(.*?)[.].*")
    

    which encapsulated much of the complexity in calculating and adding the host label, and I then found this blog (thank you Chris Siebenmann) https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics which pointed me at the UNLESS keyword so I could write the simple query

    wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)

    which gives a list of hosts which don't have the opus_local_slaves_count label, or for which the opus_local_slaves_count = 0

    Voila!