Search code examples
airflowprometheusprometheus-node-exporter

Prometheus: How to create alerts based on the result of any Airflow Dag instead of a specific Airflow Dag


I installed airflow[statsd] using "pip install 'apache-airflow[statsd]' and I installed statsd_exporter. Now I can see airflow metrics from Prometheus. but all the metrics related to airflow have dag_id and task_id as a part of metrics names.

For example, for dag id "dag1" with task id "task1" the metrics for the time taken to finish a task is: airflow_dag_dag1_task1_duration. For dag id "dag2" with task id "task2" the metrics is: airflow_dag_dag2_task2_duration.

what I am interested is something like: trigger an alert if any dag fails, or trigger an alert if it takes more than XXX sec for a dag to complete. in another word, I do not want to create a rule and alert for each individual dag or task. I want to alert the generic situation.

How can I create rule/alert in Prometheus for generic case?


Solution

  • You should be able to use something like the below and base your alerting off that:

    {__name__=~"airflow_dag.+_duration"}
    

    Be warned though, if you use just {__name__=~".+"} it's an expensive query and could potentially break the instance requiring a restart of the Prometheus service.

    Updated to include example alert:

    - alert: Saf_Test
      expr: ({__name__=~"windows_cpu.+_total"} > 5.8281319e+07)
      for: 5m
      labels:
        severity: warning
      annotations:
        description: 'Alert text here'
        summary: 'Summary here'