To better understand Dask I decided to set up a small Dask cluster: two servers 32GB RAM and a Mac. All are part of a local LAN and all run identical version of Python 3.5 + Dask installed under virtual environment. I installed sshfs on both servers to share the data between workers. I was able to start dask-scheduler on 192.168.2.149 and 4 dask-workers on 192.168.2.26.
What I need help with is conceptual understanding of the topology to fully benefit from dask distributed architecture: - I run my experiments on my Mac, which is part of the LAN. I have a 20 GB csv I need to load into Pandas hence I run my py code locally. In my code, I set up a Dask client to use the dask_scheduler:
client = Client('192.168.2.149:8786')
then I try to load the large csv like this:
df = dd.read_csv("exp3_raw_data.csv", sep="\t")
The csv is only present on my mac so the dask_workers do not know anything about the csv. If I move the csv to the directory shared via sshfs, then how would my mac reference that csv?
Any help is appreciated.
If I move the csv to the directory shared via sshfs, then how would my mac reference that csv?
You will have to find an address that is uniformly available to your client and all dask workers. Dask will not move your files around for you. It expects them to be accessible.
It's more common to use Dask alongside a network file system that all workers can see.