jupyter-notebook parallel-processing cluster-computing dask distributed-computing

How to fetch data from one cluster to another cluster in dask cluster?

I have created a 1st cluster using this in my jupyter notebook:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(name='clus1',n_workers=1,dashboard_address='localhost:8789')
client = Client(cluster)

Then read my data using pandas. and performed some preprocessing.

After that, I created 2nd cluster in 2nd jupyter notebook.

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(name='clus2',n_workers=1,dashboard_address='localhost:8790')
client = Client(cluster)

Now I want to fetch the data from one cluster to another cluster.

is there any way around it?

Solution

As noted in the comment by @mdurant, another option (if appropriate for the problem at hand) is to re-use the same cluster:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(name='clus1',n_workers=1,dashboard_address='localhost:8789')
client = Client(cluster)
client.write_scheduler_file('tmp_scheduler.dask')

Then in the relevant sections, you could connect to the cluster (from multiple notebooks):

from dask.distributed import Client
client = Client(scheduler_file='tmp_scheduler.dask')

This obviates the need to transfer files between clusters (as data is on the same cluster).