Search code examples
kubernetesprometheusazure-aksprometheus-alertmanager

Certain Prometheus alerts end up as "Not grouped"


I am running Prometheus (bitnami/prometheus:2.25.2) with Prometheus Alertmanager (bitnami/alertmanager:0.21.0) on AKS 1.19.9.
Alerting is handled by the Alertmanager which in turn routes the alerts to slack channels.

I have noticed that lately certain alerts have been ending up in the “Not grouped” section within the Prometheus Alertmanager WebUI, and not making it into the Slack channel.

enter image description here

I am unable to explain this, as they are grouped by [cluster, alertname] and do contain these labels (in the screenshot blurred, but cluster contains the same value).

To make matters even more confusing (for me anyway) there are certain alerts that also have these labels and are sent correctly.

enter image description here

The Alert manager routing tree in the config:

spec:
  route:
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 3h
    receiver: fallback
    routes:
    - matchers:
      - name: team
        value: platform-engineering
      groupBy: [cluster, alertname]
      receiver: fallback
      routes:
      - matchers:
        - name: severity
          value: critical
        groupBy: [cluster, alertname]
        receiver: alerts-critical
      - matchers:
        - name: severity
          value: warning
        groupBy: [cluster, alertname]
        receiver: alerts-warning

Does anybody care to take a stab at what is wrong here? I am obviously missing something :-)
Many thanks in advance!


Solution

  • Think I found the issue. The Prometheus running on the cluster is provisioned by the Prometheus operator: bitnami/prometheus-operator:0.53.1 The routing tree listed above is what you see when looking at the Prometheus Alert manager configuration before deployment.

    However … when you visit the Prometheus Alert manager WebUI after deployment and click the “status” tab at the top of the page, it tells a different story. What I found was that the operator was injecting an extra matcher into the routing tree during the deployment.

    enter image description here

    This obviously has consequences for the matching and grouping of alerts in particular if the alerts that are hitting the routing tree are not originating from the namespace monitoring.

    In my case only monitoring workloads reside here and the bulk of the workloads comes from namespaces outside monitoring.

    Reading GitHub issue 3737 on the Prometheus Operator repo, confirmed this suspicion.

    As a workaround I tried Till Adam’s suggestion:

    kind: clustermanagement
    namespace: prometheus
    source_namespace: '{{ $labels.namespace }}'
    

    With this, we have one alertmanagerconfig in the prometheus namespace responsible for all cluster related alerts, no matter what namespace label the metrics originally had.

    Note that your alerting rules should also be adjusted when using this! The actual namespace where the alert stems from, will now be in source_namespace.

    The only edge cases that I have encountered were when you end up losing the namespace label. This seems to be occurring when the alert expression is making use of aggregation operators (count for example).

    If I am not mistaken PR3821 will introduce the fix (Global alertmanagerconfig) for this challenge.