Search code examples
kubernetesprometheusgrafanagrafana-alerts

Alerts in K8s for Pod failing


I wanted to create alerts in Grafana for My Kubernetes Clusters. I have configured Prometheus, Node exporter, Kube-Metrics, Alert Manager in my k8s Cluster. I wanted to setup Alerting on Unschedulable or Failed Pods.

  1. Cause of unschedulable or failed pods
  2. Generating an alert after a while
  3. Creating another alert to notify us when pods fail. Can You guide me how to achieve this??

Solution

  • Based on the comment from Suresh Vishnoi:

    it might be helpful awesome-prometheus-alerts.grep.to/rules.html#kubernetes

    yes, this could be very helpful. On this site you can find templates for failed pods (not healthy):

    Pod has been in a non-ready state for longer than 15 minutes.

      - alert: KubernetesPodNotHealthy
        expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
          description: "Pod has been in a non-ready state for longer than 15 minutes.\n  V
    

    or for crash looping:

    Pod {{ $labels.pod }} is crash looping

      - alert: KubernetesPodCrashLooping
        expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          description: "Pod {{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    

    See also this good guide about monitoring kubernetes cluster with Prometheus:

    The Kubernetes API and the kube-state-metrics (which natively uses prometheus metrics) solve part of this problem by exposing Kubernetes internal data, such as the number of desired / running replicas in a deployment, unschedulable nodes, etc.

    Prometheus is a good fit for microservices because you just need to expose a metrics port, and don’t need to add too much complexity or run additional services. Often, the service itself is already presenting a HTTP interface, and the developer just needs to add an additional path like /metrics.

    If it comes to unschedulable nodes, you can use the metric kube_node_spec_unschedulable. It is described here or here: kube_node_spec_unschedulable - Whether a node can schedule new pods or not.

    Look also at this guide. Basically, you need to find the metric you want to monitor and set it appropriately in Prometheus. Alternatively, you can use templates, as I showed at the beginning of the answer.