Search code examples
airflowairflow-scheduler

Run DAG at specific time each day


I've read multiple examples about schedule_interval, start_date and the Airflow docs multiple times aswell, and I still can't wrap my head around:

How do I get to execute my DAG at a specific time each day? E.g say it's now 9:30 (AM), I deploy my DAG and I want it to get executed at 10:30

I have tried


with DAG(
    "test",
    default_args=default_args,
    description= "test",
    schedule_interval = "0 10 * * *",
    start_date = days_ago(0),
    tags = ["goodie"]) as dag:

but for some reason that wasnt run today. I have tried different start_dates altso start_date = datetime.datetime(2021,6,23) but it does not get executed.

If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time i.e it does not get run today but did run yesterday

Isn't there an easy way to say "I deploy my DAG now, and I want to get it executed with this cron-syntax" (which I assume is what most people want) instead of calculating an execution time, based on start_date, schedule_interval and figuring out, how to interpret it?


Solution

  • If I replace days_ago(0) with days_ago(1) it is behind 1 day all the time

    It's not behind. You are simply confusing Airflow scheduling mechanizem with cron jobs. In cron jobs you just provide a cron expression and it schedule accordingly - This is not how it works in Airflow.

    In Airflow the scheduling is calculated by start_date + schedule interval. Airflow execute the job at the END of the interval. This is consistent with how data pipelines usually works. Today you are processing yesterday data so at the end of this day you want to start a process that will go over yesterday records.

    As a rule - NEVER use dynamic start date.

    Setting:

    with DAG(
        "test",
        default_args=default_args,
        description= "test",
        schedule_interval = "0 10 * * *",
        start_date = datetime(2021,06,23, 10 ,0), # 2021-06-23 10:00
        tags = ["goodie"]) as dag:
    

    Means that the first will start on 2021-06-24 10:00 this run execution_date will be 2021-06-23 10:00. The second run will start on 2021-06-25 10:00 this run execution_date will be 2021-06-24 10:00

    Since this is a source of confusion to many new users there is an architecture change in progress AIP-39 Richer scheduler_interval which will decople between WHEN to run and WHAT interval to consider with this run. It will be available in Airflow 2.3.0

    UPDATE for Airflow>=2.3.0: AIP-39 Richer scheduler_interval has been completed and released It added Timetable support so you can Customizing DAG Scheduling with Timetables