I was trying to make dask.dataframe to use a local distributed scheduler by default, but it's not clear to me from reading the Dask document on how to do that. Does something like below suffice?
from dask import distributed
from dask import dataframe as dd
client = distributed.Client(processes=True) # use multi processing
dask.config.set(scheduler=client)
dd.merge(df1, df2, on='some_col')
Yes it does: if you create a distributed Client
of any sort, it will become the default scheduler for further Dask computation.