Search code examples
pythonairflowetlairflow-scheduler

How to make a DAG in Apache Airflow run like a simple cron job?


Airflow scheduler kinda left me scratching my head for the past few days as it backfills dag runs even after catchup=False. My timezone-aware dag has a start date of 13-04-2021 19:30 PST or 14-04-2021 2:30 UTC and has the following configuration:

# define DAG and its parameters
dag = DAG(
    'backup_dag',
    default_args=default_args,
    start_date=pendulum.datetime(2021, 4, 13, 19, 30, tz='US/Pacific'),  # set start_date in US/Pacific (PST) timezone
    description='A data backup pipeline',
    schedule_interval="30 19 * * *",  # 7:30 PM every day
    catchup=False,
    is_paused_upon_creation=False
)

This dag runs on an edge device, that edge device is sometimes on and sometimes off. I want this dag to basically schedule its run at 19:30 PST or 2:30 UTC, whenever the edge device is on, otherwise don't. The weird thing is that when I deploy the container with the dag to the edge device the dag automatically starts its first run outside the scheduled interval, even though that interval has passed!

Screenshot from 2021-04-19 16-50-54

What am I missing here? I can't wrap my head around why the scheduler is doing this

Following is my understanding after reading all the documentation, please do correct me if I'm wrong.

DAG picked up by scheduler at 2021-04-19T011:30:00+00:00 UTC, ideally it should run at 2021-04-20T02:30:00+00:00 UTC according to the dag config. All times below are in UTC

      Dag Start_date         1st run(skip catchup=false)   2nd run(skip catchup=false)    3rd run(skip catchup=false)   4th run(skip catchup=false)
2021-04-14T02:30:00+00:00 ---> 2021-04-15T02:30:00+00:00 ---> 2021-04-16T02:30:00+00:00  ---> 2021-04-17T02:30:00+00:00 ---> 2021-04-18T02:30:00+00:00 ---> 

5th run(skip catchup=false)   6th run(should execute)              
 2021-04-19T02:30:00+00:00 ---> 2021-04-20T02:30:00+00:00

So, why is the 5th run taking place for interval 2021-04-18T02:30:00+00:00 to 2021-04-19T02:30:00+00:00 even though the interval has passed?

I want the DAG to only run when its interval has come.


Solution

  • This is expected Airflow behavior:

    turn catchup off. [...] When turned off, the scheduler creates a DAG run only for the latest interval.

    The corresponding example in the Catchup section is similar to yours and explains the behavior in more detail.

    A dirty workaround of which I can think is to set the schedule_interval=None and actually trigger the DAG from cron using CLI.