Search code examples
jupyter-notebookparallel-processingcluster-computingdaskdistributed-computing

How to fetch data from one cluster to another cluster in dask cluster?


I have created a 1st cluster using this in my jupyter notebook:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(name='clus1',n_workers=1,dashboard_address='localhost:8789')
client = Client(cluster)

Then read my data using pandas. and performed some preprocessing.

After that, I created 2nd cluster in 2nd jupyter notebook.

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(name='clus2',n_workers=1,dashboard_address='localhost:8790')
client = Client(cluster)

Now I want to fetch the data from one cluster to another cluster.

is there any way around it?


Solution

  • As noted in the comment by @mdurant, another option (if appropriate for the problem at hand) is to re-use the same cluster:

    from dask.distributed import Client, LocalCluster
    cluster = LocalCluster(name='clus1',n_workers=1,dashboard_address='localhost:8789')
    client = Client(cluster)
    client.write_scheduler_file('tmp_scheduler.dask')
    

    Then in the relevant sections, you could connect to the cluster (from multiple notebooks):

    from dask.distributed import Client
    client = Client(scheduler_file='tmp_scheduler.dask')
    

    This obviates the need to transfer files between clusters (as data is on the same cluster).