Search code examples
pythondistributed-computingdask

Dask: delayed vs futures and task graph generation


I have a few basic questions on Dask:

  1. Is it correct that I have to use Futures when I want to use dask for distributed computations (i.e. on a cluster)?
  2. In that case, i.e. when working with futures, are task graphs still the way to reason about computations. If yes, how do I create them.
  3. How can I generally, i.e. no matter if working with a future or with a delayed, get the dictionary associated with a task graph?

As an edit: My application is that I want to parallelize a for loop either on my local machine or on a cluster (i.e. it should work on a cluster).

As a second edit: I think I am also somewhat unclear regarding the relation between Futures and delayed computations.

Thx


Solution

  • 1) Yup. If you're sending the data through a network, you have to have some way of asking the computer doing the computing for you how's that number-crunching coming along, and Futures represent more or less exactly that.

    2) No. With Futures, you're executing the functions eagerly - spinning up the computations as soon as you can, then waiting for the results to come back (from another thread/process locally, or from some remote you've offloaded the job onto). The relevant abstraction here would be a Queque (Priority Queque, specifically).

    3) For a Delayed instance, for instance, you could do some_delayed.dask, or for an Array, Array.dask; optionally wrap the whole thing in either dict() or vars(). I don't know for sure if it's reliably set up this way for every single API, though (I would assume so, but you know what they say about what assuming makes of the two of us...).

    4) The simplest analogy would probably be: Delayed is essentially a fancy Python yield wrapper over a function; Future is essentially a fancy async/await wrapper over a function.