Search code examples
pythonluigi

How to only allow a specific machine to run a task in Luigi


Machine A has the ability to access a SQL database and Machine B has the ability to access Google Drive. How do I make sure that a task is run on the correct machine if UploadToDrive depends on DownloadSQLData somewhere down the line?

Currently Machine A runs DoSomethingElseWithData and Machine B runs UploadToDrive a few minutes later. This is fine up until the point where one day Machine A might not be working, at which point Machine B will attempt DownloadSQLData as an upstream dependency and fail.

class DownloadSQLData(luigi.Task):

    # ...

    def run(self):
        # Only Machine A can do this
        # ...

class TransformData(luigi.Task):

    # ...

    def requires(self):
        return DownloadSQLData(date=self.date)

class UploadToDrive(luigi.Task):

    # ...

    def requires(self):
        return TransformData(date=self.date)

    def run(self):
        # Only Machine B can do this
        # ...

class DoSomethingElseWithData(luigi.Task):

    #...

    def requires(self):
        return TransformData(date=self.date)

The SQL database from this example is, in reality, not a SQL database but an old system within our company. It does not fail gracefully when unauthorised users try to access it and we'd like to avoid any attempts from Machine B to do so.


Solution

  • Luigi itself cannot do scheduling, i.e., running certain tasks on certain machines or scheduling tasks to run at a certain time. That being said, there are many ways to achieve what you want.

    Solution 1: Let's introduce machine C that has access to machines A and B. Using a number of tools (https://wiki.python.org/moin/SecureShell) machine C could run tasks to retrieve data from A, transform it on C, and then transfer to B before uploading.

    Solution 2: This solution is most likely too much work and/or infeasible. Set up machines A,B,C in a network scheduler (something like slurm https://www.schedmd.com/) with C as the head scheduler and specify A and B as certain types of resources (possibly SQL and GDrive). Then, from C, schedule slurm tasks as luigi jobs (https://github.com/pharmbio/sciluigi can help with this). These slurm tasks should specify the given resources needed for each task. And that's it!