Search code examples
airflowairflow-schedulerservice-level-agreement

How to add SLA's to ETL jobs running in Airflow?


I am new to Apache Airflow. I have some DAGs already running in the Airflow. Now I want to add SLA's to it so that I can track and monitor the tasks and get alert if something breaks.

I know how to add SLA's to DAGs default_args using timedelta() like below

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 6, 1),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'sla': timedelta(minutes=30)
}

But I have below questions:

  1. We can specify SLA for whole DAG or only for tasks individually?

  2. What would be appropriate SLA time for the DAG that is running for 30 minutes?

  3. What would be appropriate SLA time for a task that is running for 5 minutes?

  4. Do we need to consider retry_delay as well while specifying SLA?


Solution

  • We can specify SLA for whole DAG or only for tasks individually?

    I believe SLAs are provisioned only for individual tasks and not for DAG as a whole. But I think the same effect is achievable (can't say for sure though) for entire DAG by creating a task at the end (DummyOperator) that is dependent on all other tasks of your DAG and setting an SLA on that closing task


    What would be appropriate SLA time for the DAG that is running for 30 minutes?

    This would entirely depend on factors like criticality of your task, its failure rate etc. But I would suggest that you begin with a 'strict-enough' timedelta (like 5 minutes) and then tune it (increase or decrease) from there


    What would be appropriate SLA time for a task that is running for 5 minutes?

    Same as above, start with 1 minute and tune from there


    Do we need to consider retry_delay as well while specifying SLA?

    Going by the docs, I'd say yes

    :param sla: time by which the job is expected to succeed. Note that
            this represents the ``timedelta`` after the period is closed. For
            example if you set an SLA of 1 hour, the scheduler would send an email
            soon after 1:00AM on the ``2016-01-02`` if the ``2016-01-01`` instance
            has not succeeded yet.