Search code examples
daskdask-distributed

Know the "physical" locations of the blocks of a dask array after `.persist()`


I want to run dask in a distributed environment (HPC cluster style).

After preparing the array, I run the .persist() method which should hopefully distribute the array across the cluster.

However, I would like to know, dynamically, where each block is physically located (i. e. in which node). I haven't found the method... have I missed something obvious?


Solution

  • Have you had a look at: client.who_has() and/or client.has_what()?

    Personally, the data locality side helped me. And this side outlines the differences between compute and persist again.

    Maybe it is also possible with publish_dataset(), but I do not have experience with the function.

    You can easily check at which worker's memory the object is, with:

    from dask.distributed import Client
    import dask.array as da
    
    c_ = Client()
    a = da.random.random(100000)
    f_a = a.persist()
    c_.who_has(f_a)
    
    Key Copies Workers
    ('random_sample-c56488914f65fdea0c70600b46d3cb24', 0) 1 tcp://127.0.0.1:53074