I am running Prometheus (bitnami/prometheus:2.25.2) with Prometheus Alertmanager (bitnami/alertmanager:0.21.0) on AKS 1.19.9.
Alerting is handled by the Alertmanager which in turn routes the alerts to slack channels.
I have noticed that lately certain alerts have been ending up in the “Not grouped” section within the Prometheus Alertmanager WebUI, and not making it into the Slack channel.
I am unable to explain this, as they are grouped by [cluster, alertname] and do contain these labels (in the screenshot blurred, but cluster contains the same value).
To make matters even more confusing (for me anyway) there are certain alerts that also have these labels and are sent correctly.
The Alert manager routing tree in the config:
spec:
route:
groupWait: 30s
groupInterval: 5m
repeatInterval: 3h
receiver: fallback
routes:
- matchers:
- name: team
value: platform-engineering
groupBy: [cluster, alertname]
receiver: fallback
routes:
- matchers:
- name: severity
value: critical
groupBy: [cluster, alertname]
receiver: alerts-critical
- matchers:
- name: severity
value: warning
groupBy: [cluster, alertname]
receiver: alerts-warning
Does anybody care to take a stab at what is wrong here? I am obviously missing something :-)
Many thanks in advance!
Think I found the issue.
The Prometheus running on the cluster is provisioned by the Prometheus operator: bitnami/prometheus-operator:0.53.1
The routing tree listed above is what you see when looking at the Prometheus Alert manager configuration before deployment.
However … when you visit the Prometheus Alert manager WebUI after deployment and click the “status” tab at the top of the page, it tells a different story. What I found was that the operator was injecting an extra matcher into the routing tree during the deployment.
This obviously has consequences for the matching and grouping of alerts in particular if the alerts that are hitting the routing tree are not originating from the namespace monitoring.
In my case only monitoring workloads reside here and the bulk of the workloads comes from namespaces outside monitoring.
Reading GitHub issue 3737 on the Prometheus Operator repo, confirmed this suspicion.
As a workaround I tried Till Adam’s suggestion:
kind: clustermanagement
namespace: prometheus
source_namespace: '{{ $labels.namespace }}'
With this, we have one alertmanagerconfig in the prometheus namespace responsible for all cluster related alerts, no matter what namespace label the metrics originally had.
Note that your alerting rules should also be adjusted when using this! The actual namespace where the alert stems from, will now be in source_namespace
.
The only edge cases that I have encountered were when you end up losing the namespace label. This seems to be occurring when the alert expression is making use of aggregation operators (count for example).
If I am not mistaken PR3821 will introduce the fix (Global alertmanagerconfig) for this challenge.