I installed airflow[statsd] using "pip install 'apache-airflow[statsd]' and I installed statsd_exporter. Now I can see airflow metrics from Prometheus. but all the metrics related to airflow have dag_id and task_id as a part of metrics names.
For example, for dag id "dag1" with task id "task1" the metrics for the time taken to finish a task is: airflow_dag_dag1_task1_duration. For dag id "dag2" with task id "task2" the metrics is: airflow_dag_dag2_task2_duration.
what I am interested is something like: trigger an alert if any dag fails, or trigger an alert if it takes more than XXX sec for a dag to complete. in another word, I do not want to create a rule and alert for each individual dag or task. I want to alert the generic situation.
How can I create rule/alert in Prometheus for generic case?
You should be able to use something like the below and base your alerting off that:
{__name__=~"airflow_dag.+_duration"}
Be warned though, if you use just {__name__=~".+"}
it's an expensive query and could potentially break the instance requiring a restart of the Prometheus service.
Updated to include example alert:
- alert: Saf_Test
expr: ({__name__=~"windows_cpu.+_total"} > 5.8281319e+07)
for: 5m
labels:
severity: warning
annotations:
description: 'Alert text here'
summary: 'Summary here'