Know the "physical" locations of the blocks of a dask array after `.persist()`

I want to run dask in a distributed environment (HPC cluster style).

After preparing the array, I run the .persist() method which should hopefully distribute the array across the cluster.

However, I would like to know, dynamically, where each block is physically located (i. e. in which node). I haven't found the method... have I missed something obvious?

Solution

Have you had a look at: client.who_has() and/or client.has_what()?

Personally, the data locality side helped me. And this side outlines the differences between compute and persist again.

Maybe it is also possible with publish_dataset(), but I do not have experience with the function.

You can easily check at which worker's memory the object is, with:

from dask.distributed import Client
import dask.array as da

c_ = Client()
a = da.random.random(100000)
f_a = a.persist()
c_.who_has(f_a)

Key	Copies	Workers
('random_sample-c56488914f65fdea0c70600b46d3cb24', 0)	1	tcp://127.0.0.1:53074