Search code examples
pythonarraysdaskchunks

Iterating through dask array chunks


I am trying to manually iterate through the chunks of a dask array, one by one, and apply my computation. I understand that a benefit of dask is that it can to do the iteration for me, but my computation is failing (for reasons that I don't think are related to dask) and I want to iterate through manually for the purpose of debugging. How would I do that?

I am imagining something like:

import dask.array as da
data = da.random.randint(0, 30, size=(1_000, 100, 100), chunks=(-1, 10, 10))

for chunk in data.iterchunks():
    # chunk would contain some information about which chunk I have access to, 
    # and I could somehow get the data contained in that chunk
    chunk_data = get_chunk(chunk)
    my_function(chunk_data)

Where the chunk that I get back has some information about which chunk I am in, and there would also be get the data for that chunk.


Solution

  • Access the data within each chunk using the arr.blocks property. The BlockView object has an array-like interface, but accessing an element in the BlockView array returns the selected chunk(s) in the original array:

    In [11]: data
    Out[11]: dask.array<randint, shape=(1000, 100, 100), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
    
    In [12]: data.blocks
    Out[12]: <dask.array.core.BlockView at 0x1730b2da0>
    
    In [13]: data.blocks.shape
    Out[13]: (1, 10, 10)
    
    In [14]: data.blocks[0, 0, 0]
    Out[14]: dask.array<blocks, shape=(1000, 10, 10), dtype=int64, chunksize=(1000, 10, 10), chunktype=numpy.ndarray>
    
    In [15]: data.blocks[0, 0, 0].compute()
    Out[15]:
    array([[[14,  5, 24, ..., 25, 20,  6],
            [17, 12,  2, ..., 27, 13, 18],
            [13, 25,  2, ...,  7,  5, 22],
            ...,
            [12, 22, 26, ..., 15,  4, 11],
            [ 0, 26, 28, ..., 22, 14,  4],
            [ 9, 21, 14, ..., 15, 18, 21]],
    
           ...,
    
           [[ 3,  2, 20, ..., 27,  0, 12],
            [21, 17,  7, ..., 23,  3, 23],
            [24, 13,  0, ..., 26,  1,  0],
            ...,
            [ 5, 25,  6, ..., 22,  6, 16],
            [16, 25, 21, ..., 22, 14, 15],
            [ 8, 20, 17, ..., 29, 13,  1]]])
    

    So in your case, you could loop through all blocks with the following:

    In [34]: for inds in itertools.product(*map(range, data.blocks.shape)):
        ...:     chunk = data.blocks[inds]
        ...:     my_function(chunk)
    

    This will be slow, but it does I think what you're looking for.