I want to run dask in a distributed environment (HPC cluster style).
After preparing the array, I run the .persist()
method which should hopefully distribute the array across the cluster.
However, I would like to know, dynamically, where each block is physically located (i. e. in which node). I haven't found the method... have I missed something obvious?
Have you had a look at: client.who_has()
and/or client.has_what()
?
Personally, the data locality side helped me.
And this side outlines the differences between compute
and persist
again.
Maybe it is also possible with publish_dataset()
, but I do not have experience with the function.
You can easily check at which worker's memory the object is, with:
from dask.distributed import Client
import dask.array as da
c_ = Client()
a = da.random.random(100000)
f_a = a.persist()
c_.who_has(f_a)
Key | Copies | Workers |
---|---|---|
('random_sample-c56488914f65fdea0c70600b46d3cb24', 0) | 1 | tcp://127.0.0.1:53074 |