Question: I have a large Dask array representing a tensor, and I want to rescale it using the zoom function from the SciPy package. After rescaling, I'd like to save the resulting Dask array to disk using either dask.array.to_zarr or dask.array.to_hdf5. Below, I'm providing a simple example for better understanding.
Example: Suppose I have a Dask array data representing a 2D matrix with the following code:
import dask.array as da
data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16)
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')
However, I am getting this error:
---------------------------------------------------------------------------
\`TypeError Traceback (most recent call last)
Cell In\[467\], line 9
4 disp_ds = da.random.randint(0, 100, (100,100), chunks=(20,20))
5 disp_org = da.map_blocks(lambda x: zoom(x,2), disp_ds, dtype = np.uint16)
\----\> 9 disp_org.to_hdf5('myfile.hdf5', '/up_sampled')
File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:1811, in Array.to_hdf5(self, filename, datapath, \*\*kwargs)
1797 def to_hdf5(self, filename, datapath, \*\*kwargs):
1798 """Store array in HDF5 file
1799
1800 \>\>\> x.to_hdf5('myfile.hdf5', '/x') # doctest: +SKIP
(...)
1809 h5py.File.create_dataset
1810 """
\-\> 1811 return to_hdf5(filename, datapath, self, \*\*kwargs)
File \\AppData\\Local\\anaconda3\\envs\\napari\\lib\\site-packages\\dask\\array\\core.py:5387, in to_hdf5(filename, chunks, \*args, \*\*kwargs)
5376 with h5py.File(filename, mode="a") as f:
5377 dsets = \[
5378 f.require_dataset(
5379 dp,
(...)
5385 for dp, x in data.items()
5386 \]
...
267 # All dimensions from target_shape should either have been popped
268 # to match the selection shape, or be 1.
269 raise TypeError("Can't broadcast %s -\> %s" % (source_shape, self.array_shape)) # array shape
TypeError: Can't broadcast (40, 40) -\> (20, 20)
Given example above, I understand that using zoom function I am changing the chunk size. But I couldn't find a way to solve this issue in a optimized way.
I appreciate any help or suggestions on how to perform this rescaling and save operation efficiently with Dask. Thank you!
You just need to tell Dask the resulting shape of your chunks inside the map_block
call:
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
Full working code:
import dask.array as da
import numpy as np
import h5py
from scipy.ndimage import zoom
data = da.random.randint(0,100, (100,100), chunks=(20,20))
data_upsampled = da.map_blocks(lambda x: zoom(x,2), data, dtype = np.uint16, chunks=(40,40))
data_upsampled.to_hdf5('myfile.hdf5', '/up_sampled')