How to rescale dask array using dask.map_blocks and scipy's zoom function and save result as zarr or hdf5?

Question: I have a large Dask array representing a tensor, and I want to rescale it using the zoom function from the SciPy package. After rescaling, I'd like to save the resulting Dask array to disk using either dask.array.to_zarr or dask.array.to_hdf5. Below, I'm providing a simple example for better understanding.

Example: Suppose I have a Dask array data representing a 2D matrix with the following code:

   import dask.array as da

   data = da.random.randint(0,100, (100,100), chunks=(20,20))

   data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16)

   data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')

However, I am getting this error:

---------------------------------------------------------------------------

\`TypeError                                 Traceback (most recent call last)
Cell In\[467\], line 9
4 disp_ds = da.random.randint(0, 100, (100,100), chunks=(20,20))
5 disp_org = da.map_blocks(lambda x: zoom(x,2), disp_ds, dtype = np.uint16)
\----\> 9 disp_org.to_hdf5('myfile.hdf5', '/up_sampled')

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:1811, in Array.to_hdf5(self, filename, datapath, \*\*kwargs)
1797 def to_hdf5(self, filename, datapath, \*\*kwargs):
1798     """Store array in HDF5 file
1799
1800     \>\>\> x.to_hdf5('myfile.hdf5', '/x')  # doctest: +SKIP
(...)
1809     h5py.File.create_dataset
1810     """
\-\> 1811     return to_hdf5(filename, datapath, self, \*\*kwargs)

File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:5387, in to_hdf5(filename, chunks, \*args, \*\*kwargs)
5376 with h5py.File(filename, mode="a") as f:
5377     dsets = \[
5378         f.require_dataset(
5379             dp,
(...)
5385         for dp, x in data.items()
5386     \]
...
267     # All dimensions from target_shape should either have been popped
268     # to match the selection shape, or be 1.
269     raise TypeError("Can't broadcast %s -\> %s" % (source_shape, self.array_shape))  # array shape

TypeError: Can't broadcast (40, 40) -\> (20, 20)

Given example above, I understand that using zoom function I am changing the chunk size. But I couldn't find a way to solve this issue in a optimized way.

I appreciate any help or suggestions on how to perform this rescaling and save operation efficiently with Dask. Thank you!

Solution

You just need to tell Dask the resulting shape of your chunks inside the map_block call:

data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))

Full working code:

import dask.array as da
import numpy as np
import h5py
from scipy.ndimage import zoom

data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')