Search code examples
pythonpython-3.xairflowairflow-scheduler

How to run Airflow dag with more than 100 thousand tasks?


I have an airflow DAG that has over 100,000 tasks. I am able to run only up to 1000 tasks. Beyond that the scheduler hangs, the webserver cannot render tasks and is extremely slow on the UI.

I have tried increasing, min_file_process_interval and processor_poll_interval config params.

I have set num_duration to 3600 so that scheduler restarts every hour.

Any limits I'm hitting on the webserver or scheduler? In general, how to deal with a large number of tasks in Airflow? Any config settings, etc would be very helpful.

Also, should I be using SubDagOperator at this scale or not? please advice.

Thanks,


Solution

  • I was able to run more than 165,000 airflow tasks!

    But there's a catch. Not all the tasks were scheduled and rendered in a single Airflow Dag.

    The problems I faced when I tried to schedule more and more tasks are that of scheduler and webserver. The memory and cpu consumption on scheduler and webserver dramatically increased as more and more tasks were being scheduled (it is obvious and makes sense). It went to a point where the node couldn't handle it anymore (scheduler was using over 80GB memory for 16,000+ tasks)

    I split the single dag into 2 dags. One is a leader/master. The second one being the worker dag.

    I have an airflow variable that says how many tasks to process at once (for example, num_tasks=10,000). Since I have over 165,000 tasks, the worker dag will process 10k tasks at a time in 17 batches.

    The leader dag, all it does is trigger the same worker dag over and over with different sets of 10k tasks and monitor the worker dag run status. The first trigger operator triggers the worker dag for the first set of 10k tasks and keeps waiting until the worker dag completes. Once it's complete, it triggers the same worker dag with the next batch of 10k tasks and so on.

    This way, the worker dag keeps being reused and never have to schedule more than X num_tasks

    The bottom line is, figure out the max_number of tasks your Airflow setup can handle. And then launch the dags in leader/worker fashion for max_tasks over and over again until all the tasks are done.

    Hope this was helpful.