Search code examples
dask

How to rescale dask array using dask.map_blocks and scipy's zoom function and save result as zarr or hdf5?


Question: I have a large Dask array representing a tensor, and I want to rescale it using the zoom function from the SciPy package. After rescaling, I'd like to save the resulting Dask array to disk using either dask.array.to_zarr or dask.array.to_hdf5. Below, I'm providing a simple example for better understanding.

Example: Suppose I have a Dask array data representing a 2D matrix with the following code:

   import dask.array as da

   data = da.random.randint(0,100, (100,100), chunks=(20,20))

   data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16)

   data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')

However, I am getting this error:

---------------------------------------------------------------------------

\`TypeError                                 Traceback (most recent call last)
Cell In\[467\], line 9
4 disp_ds = da.random.randint(0, 100, (100,100), chunks=(20,20))
5 disp_org = da.map_blocks(lambda x: zoom(x,2), disp_ds, dtype = np.uint16)
\----\> 9 disp_org.to_hdf5('myfile.hdf5', '/up_sampled')

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:1811, in Array.to_hdf5(self, filename, datapath, \*\*kwargs)
1797 def to_hdf5(self, filename, datapath, \*\*kwargs):
1798     """Store array in HDF5 file
1799
1800     \>\>\> x.to_hdf5('myfile.hdf5', '/x')  # doctest: +SKIP
(...)
1809     h5py.File.create_dataset
1810     """
\-\> 1811     return to_hdf5(filename, datapath, self, \*\*kwargs)

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:5387, in to_hdf5(filename, chunks, \*args, \*\*kwargs)
5376 with h5py.File(filename, mode="a") as f:
5377     dsets = \[
5378         f.require_dataset(
5379             dp,
(...)
5385         for dp, x in data.items()
5386     \]
...
267     # All dimensions from target_shape should either have been popped
268     # to match the selection shape, or be 1.
269     raise TypeError("Can't broadcast %s -\> %s" % (source_shape, self.array_shape))  # array shape

TypeError: Can't broadcast (40, 40) -\> (20, 20)

Given example above, I understand that using zoom function I am changing the chunk size. But I couldn't find a way to solve this issue in a optimized way.

I appreciate any help or suggestions on how to perform this rescaling and save operation efficiently with Dask. Thank you!


Solution

  • You just need to tell Dask the resulting shape of your chunks inside the map_block call:

    data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
    

    Full working code:

    import dask.array as da
    import numpy as np
    import h5py
    from scipy.ndimage import zoom
    
    data = da.random.randint(0,100, (100,100), chunks=(20,20))
    data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
    data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')