Short Version:
Is there a task scheduler in Python that can do what gmake does? In particular, I need a task scheduler that recursively resolves dependencies. I looked into Luigi, but it seemed only resolve direct dependencies.
Long version:
I'm trying to build a workflow that processes a lot of data files in a pre-defined order, the later tasks may directly depend on the output of some ealier tasks, but in turn, the correctness of these outputs relies on even ealier tasks.
For example, let's consider a dependency map like follows:
A <- B <- C
When I request the result from task C, Luigi will automatically schedule B, and then since B depends on A, it will schedule A. So the final run order will be [A, B, C]. Each task will create an offical output file as a sign of successful execution. This is fine for the first run.
Now, suppose that I made a mistake in the input data for task A. Apparently, I need to rerun the whole chain again. However, simply deleting the output file from A won't work. Because Luigi sees the output of B and C, and concludes that the requirement for Task C has been fulfilled, and no runs are needed. I have to delete the output files from ALL the tasks that depend on A so they will be run again. In the simple case, I have to delete all the output files from A, B and C, so that Luigi will detect the change made to A.
This is a very unconvenient feature. If I have tens or hundreds of tasks that have rather complicated dependencies on each other, it will be really hard to tell which ones are affected when one of the tasks needs to be re-run. For a task scheduler and has the ability of resolving dependencies, I would expect Luigi to be able to act similar to GNU-Make, where dependencies are checked recursively, and the final target will be rebuilt when one of the deepest source files is changed.
I was wondering if someone can provide some suggestions on this issue. Am I missing some key features in Luigi? Are there other task schedulers that do act as gmake? I'm particularly interested in Python based packages, and would prefer those support Windows.
Thanks a lot!
It seems possible by overriding the complete method for your tasks. You'd have to apply this all the way down through the dependency graph.
def complete(self):
outputs = self.flatten(self.output())
if not all(map(lambda output: output.exists(), outputs)):
return False
for task in self.flatten(self.requires()):
if not task.complete():
for output in outputs:
if output.exists():
output.remove()
return False
return True