I am using prometheus and grafana to monitor some servers. One of the metrics I have exposed is called recent_tables
, which contains the number of assets who have written to sql tables in the past 15 minutes (machines automatically post to sql). It's labels are table
, job
, and status_code
. I also have metric online_assets
, which has the amount of assets that are online. Its labels are cluster_id
, db_host
, and job
.
I am trying to make an alert for when < 90% of online assets have written to sql tables recently. Before I write the alert, I am trying to get a panel in grafana to populate the data and eventually transition this to an alertmanager expr. The following queries do not work, and I don't understand why:
recent_tables < online_assets * 0.9
sum(recent_tables) by (table) < online_assets * 0.9
However, the following query works:
sum(recent_tables{table="<table>"}) - sum(online_assets)
I do not want to have to make an alert based on every table (this is possible through ansible), but I would like to understand if there is a way to get multiple vectors out of the same query.
As Michael Doubez pointed out, you cannot have unbalanced label dimensions when making queries.
I ended up with the following:
sum(recent_tables) by (table) - ignoring(table) group_left() sum(live_assets) * 0.9 < 0
This accounts for the mismatch in dimensions but there may be a cleaner way.