Search code examples
daskdask-distributeddask-delayeddask-dataframe

Is it better to `compute` for control flow or build a fully-`delayed` task graph?


I have an existing Pandas codebase and have just started trying to convert it to Dask. I am still trying to wrap my head around Dask dataframe, delayed, and distributed. From reading over the dask.delayed docs, it seems like the ideal case would be to build up a task/computation graph for the entire set of operations I want to do, including delayed functions for user messages, and then running all computations in one large chunk at the end. That way, the calling process wouldn't need to keep running while the Dask cluster performs the actual work.

The problem that I've been facing, though, is that there seem to be situations where this is not feasible, particular when it comes to Python control flow. For example:

df = dd.read_csv(...)
if df.isnull().any():
    # early exit
    raise ValueError()
df = some(df)
df = more(df)
df = calculations(df)
# and potentially more complex control flow

I don't really see how something like that can be done without calling df.isnull().any().compute().

I also don't know right now whether there's anything 'bad' (counter to best practices) about calling compute() or persist() in a script. When looking at a lot of the examples online, they seem to be based on an experimental/Jupyter-based environment, where load -> preparation -> persist() -> experimentation seems to be the standard approach. Since I have a relatively linear set of operations (load -> op1 -> op2 -> ... -> opn -> save), I thought that I should try to simply schedule tasks without doing any computation as quickly as possible and avoid compute/persist, which I now feel has led me into a bit of a dead end.

So to summarise I guess I have two questions I would like answered, the first being 'is it bad to use compute?', and the second being 'if yes, how can I avoid compute but still have good & readable control flow?'.


Solution

  • It is totally ok to call compute whenever you need a concrete value. Control flow is an excellent example of this.

    You might want to call .persist() first on the main trunk of your computation and then call .compute() for the control flow bits, just to make sure that you don't repeat the load -> op1 -> op2 -> ... parts of your computation.