Search code examples
daskdask-distributeddask-delayed

How can I combine sequential as well as parallel execution of delayed function calls?


I am stuck in a strange place. I have a bunch of delayed function calls that I want to execute in a certain order. While executing in parallel is trivial:

res = client.compute([myfuncs])
res = client.gather(res)

I can't seem to find a way to execute them in sequence, in a non-blocking way.

Here's a minimal example:

import numpy as np
from time import sleep
from datetime import datetime

from dask import delayed
from dask.distributed import LocalCluster, Client


@delayed
def dosomething(name):
    res = {"name": name, "beg": datetime.now()}
    sleep(np.random.randint(10))
    res.update(rand=np.random.rand())
    res.update(end=datetime.now())
    return res


seq1 = [dosomething(name) for name in ["foo", "bar", "baz"]]
par1 = dosomething("whaat")
par2 = dosomething("ahem")
pipeline = [seq1, par1, par2]

Given the above example, I would like to run seq1, par1, and par2 in parallel, but the constituents of seq1: "foo", "bar", and "baz", in sequence.


Solution

  • You could definitely cheat and add an optional dependency to your function as follows:

    @dask.delayed
    def dosomething(name, *args):
         ...
    

    So that you can make tasks depend on one-another, even thought you don't use one result in the next run of the function:

    inputs = ["foo", "bar", "baz"]
    seq1 = [dosomething(inputs[0])]
    for bit in inputs[1:]:
        seq1.append(dosomething(bit, seq1[-1]))
    

    Alternatively, you can read about the distributed scheduler's "futures" interface, whereby you can monitor the progress of tasks in real time.