My team had a monolithic service for a small scale project but for a re-architecture and scaling, we are planning to move to cloud services of Amazon AWS and evaluating for orchestration whether to run Luigi as a container task or use AWS Step Functions instead? I don't have any experience with any of them especially Luigi. Can anyone point out any issues that they have seen with Luigi or how it can prove to be better than AWS if at all? Any other suggestions for the same.
Thanks in advance.
I don't know about how AWS does orchestration, but if you are planning to at any time scale to at least thousands of jobs, I would not recommend investing in Luigi. Luigi is extremely useful for small to medium(ish) projects. It provides a fantastic interface for defining jobs and ensuring job completion through atomic filesystem actions. However, the problem when it comes to Luigi is the framework for running jobs. Luigi requires constant communication to workers for them to run, which in my own experience destroyed network bandwidth when I tried to scale.
For my research, I will generate a network of 10,000 tasks on a light to medium workflow, using my university's cluster computing grid which runs SLURM. All of my tasks don't take that long to complete, maybe 5 min max each. I have tried the following three methods to use Luigi efficiently.
SciLuigi's slurm task to submit jobs to SLURM from a central luigi worker (not using central scheduler). This method works well if your jobs will be accepted quickly and run. However, it uses an unreasonable amount of resources on the scheduling node, as each worker is a new process. Further, it destroys any priority you would have in the system. A better method would be to first allocate many workers and then have them continually work on jobs.
The second method I attempted was just that. I started the Luigi central scheduler on my home server (because otherwise I could not monitor the state of work, just like in the above workflow) and started up workers on the SLURM cluster that all had the same configuration, so each of them could run any part of the experiment. The problem was, even with 500Mbps internet, past ~50 workers Luigi would stop functioning and so would my internet connection to my server. So, I began running jobs with only 50 workers, which drastically slowed my workflow. In addition, each worker had to register each job with the central scheduler (another huge pain point), which could take hours with only 50 workers.
To mitigate this startup time I decided to partition the root-task subtrees by their parameters and submit each to SLURM. So now the startup time is reasonably low, but I lost the ability for any worker to run any job, which is still pretty important. Also, I can still only work with ~50 workers. When I completed the subtrees, I ran one last job to finish the experiment.
In conclusion, Luigi is great for small to medium-small workflows, but once you start hitting 1,000+ tasks and workers, the framework quickly fails to keep up. I hope that my experiences provide some insight into the framework.