I would like to use airflow for the following use-case :
- Compute a daily report for a given website (~150 websites to handle). Each report will be computed as follows:
- A set of tasks that should be run at site level,
- A set of tasks that should be run at page level, each website contanining ~ 10k pages.
- Once both sets of tasks above are performed, a third set of tasks are run to aggregate the results and generate the report.
Note : each airflow task described here is in fact a simple call to a remote micro-service (grpc call).
The design I have in mind so far :
- I initially wanted to perform all processes related to the pages in one task in order to have a simple, well-defined dag with only a few tasks.
But the treatment that needs to be performed on the page is complex, with external dependencies and queues (only trigger the next task if you receive notifications from external systems, those notifications may arrive several hours later) => I would like to use airflow to handle this process.
- Given the point above, I'm now inclined towards a model whereby all the processes for one website are embeedded in one dag, including the tasks for the pages. Ideally I would like to use a subdag for the tasks related to pages but from what I read so far, this feature is not yet stable.
Each website will generate a new dag, with a new set of tasks (because the structure of the dag depends on the number of pages).
The number of tasks per dag will therefore be relatively important (10k).
My questions :
- Is airflow an acceptable framework for this use case (i.e did you run similar use cases) or do alternative frameworks such as luigi, oozie ... present clear advantages in that context ?
- Is the approach above (one dag per website, no subdag, include page tasks in the dag) a sound one ? Do you foresee any issue with this ?
- Is the web ui still usable with that number of tasks ? I did a quick test with a few hundreds tasks and I got several timeouts, I'm wondering if it is linked to my configuration or not.
- Is celery the correct backend for this ? I'm wondering if "LocalExecutor" would in fact be more appropriate for this use case, given that there is in fact no computation performed directly by the airflow workers (they only call remote services).
Your initial idea was the one I would go with. Having 150 different workflows with 10K tasks each leads to a fully dynamic and unmanageable scenario. On the one hand you say that each task is just a simple gRPC but at the same time you mention that the page-level tasks are really complex to encapsulate behind a single task and there are external dependencies that may cause flow bottlenecks measured in hours.
If I were you I'd redesign the solution and transfer the page-level reporting to a different layer. For example creating a service that would do all these complex calculations would be a better option than trying to implement this in Airflow. This way you could probably cut down the number of page level tasks significantly.
Regarding your specific questions:
- Airflow is case agnostic - every scenario can be perfect depending on the
design. Oozie is really old-school and cumbersome and lacks the plethora of
integration features that Airfow offers. Luigi I haven't used.
- As mentioned earlier, this approach is unpredictable and unmanageable at the same time. I foresee mayhem :)
- Getting a hanging UI is a great indicator that you should revisit your implementation design. But the UI should be your #1 concern - how can you monitor and manage 10,000 tasks in a single workflow? Correct - you can't. And multiply that by 150.
- I read an article a while ago from a company where they experienced issues by scaling out using Celery and they decided to scale up instead and run many scheduler processes in parallel on the same VM. Not quite sure if this is a setup that would significantly benefit your scenario.
If I were you I'd have a single workflow for all 150 sites. I'd create a subdag for each website (btw there is no mention of the word 'unstable' in the official docs) and try to offload complex calculation operations to a different layer in order to cut down on the number of page level tasks as much as possible.