I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
@dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
@task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
@task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
@task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
@task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
@dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.