Search code examples
daskdask-distributed

dask - how to set a local distributed scheduler as the default scheduler for dask.dataframe?


I was trying to make dask.dataframe to use a local distributed scheduler by default, but it's not clear to me from reading the Dask document on how to do that. Does something like below suffice?

from dask import distributed
from dask import dataframe as dd
client = distributed.Client(processes=True)  # use multi processing
dask.config.set(scheduler=client)

dd.merge(df1, df2, on='some_col')


Solution

  • Yes it does: if you create a distributed Client of any sort, it will become the default scheduler for further Dask computation.