airflow airflow-scheduler service-level-agreement

How to add SLA's to ETL jobs running in Airflow?

I am new to Apache Airflow. I have some DAGs already running in the Airflow. Now I want to add SLA's to it so that I can track and monitor the tasks and get alert if something breaks.

I know how to add SLA's to DAGs default_args using timedelta() like below

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 6, 1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'sla': timedelta(minutes=30)
}

But I have below questions:

We can specify SLA for whole DAG or only for tasks individually?
What would be appropriate SLA time for the DAG that is running for 30 minutes?
What would be appropriate SLA time for a task that is running for 5 minutes?
Do we need to consider retry_delay as well while specifying SLA?

Solution

We can specify SLA for whole DAG or only for tasks individually?

I believe SLAs are provisioned only for individual tasks and not for DAG as a whole. But I think the same effect is achievable (can't say for sure though) for entire DAG by creating a task at the end (DummyOperator) that is dependent on all other tasks of your DAG and setting an SLA on that closing task

What would be appropriate SLA time for the DAG that is running for 30 minutes?

This would entirely depend on factors like criticality of your task, its failure rate etc. But I would suggest that you begin with a 'strict-enough' timedelta (like 5 minutes) and then tune it (increase or decrease) from there

What would be appropriate SLA time for a task that is running for 5 minutes?

Same as above, start with 1 minute and tune from there

Do we need to consider retry_delay as well while specifying SLA?

Going by the docs, I'd say yes

:param sla: time by which the job is expected to succeed. Note that
        this represents the ``timedelta`` after the period is closed. For
        example if you set an SLA of 1 hour, the scheduler would send an email
        soon after 1:00AM on the ``2016-01-02`` if the ``2016-01-01`` instance
        has not succeeded yet.