Search code examples
airflowairflow-scheduler

Use Airflow for batch processing to dynamically start multiple tasks based on the output of a parent task


I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.

Example of a workflow: 1. Start Task A to produce multiple files 2. For each file start an instance of Task B (might be another workflow) 3. Wait for all instances of Task B, then start Task C


Solution

  • As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.