Search code examples
kubernetesceleryairflow

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way


I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.

Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?


Solution

  • Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.

    You can then use SparkSubmitOperator to align the task in your DAG.