h5py mean and std value per chunk

I am stuck and I need help please! I am working with quite a large dataset of images. The size is dset=(1563,2048,396), where its (pixel, pixel, number of images). I did chunk it into chunk=(1,1,396) to have one chunk for each image.

My problem, I need to get the mean and std values for each image so for each chunk and I can't find a solution.

What I tried: 1.

with h5py.File('test2.h5', 'r') as hf:
    # Access the dataset
    ds_hf = hf['Measurement 1']['data']
    #print(ds_hf)        
    
    means =[]
    stds = []
    # Iterate over the chunks
    for chunk_idx in np.ndindex(ds_hf.chunks):
        chunk = ds_hf[chunk_idx]
        # Process the chunk
        chunk_mean = np.mean(chunk)
        chunk_std = np.std(chunk)
        means.append(chunk_mean)
        stds.append(chunk_std1)
        
# Close the HDF5 file
hf.close()

Here my problem is, that 'chunk' is only one value per chunk and not the whole chunk, so my mean and std values are not correct.

with h5py.File('test2.h5', 'r') as hf:
    ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset objects
    print(ds_hf.shape)

    for i in range(len(ds_hf)): #range(len(ds_hf.shape))
        image = ds_hf[i] # this returns numpy array for image i

    mean = np.mean(image, axis=0)
    std = np.std(image, axis=0)

Here I have the problem, that it seems calculate with wrong dimensions, because 'image' has the shape (2048,396) and I need (1536,2048), but I cannot change range(len(ds_hf)) otherwise I'll get different errors.

with h5py.File('test2.h5', 'r') as hf:
    ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset objects
    print(ds_hf.shape)
    
    i = 0
    means = []
    for i in range(396):
        mean = np.mean(ds_hf[:,:,i])
        i=+1
        means.append(mean)

Now this seems to work, but it takes way too long.

Solution

As noted in my comments, chunk size and shape has a significant impact on I/O performance. (As it should -- that's the whole point of chunked storage.) Get it right, and I/O is significantly faster. Get it wrong, and I/O is significantly slower. When using chunked storage, HDF5/h5py reads the entire chunk if it needs any slice of the chunk. With chunks=(1,1,396), you have to read every chunk every time you load an image. That is very inefficient.

Setting the appropriate chunk size takes experience, AND an understanding of how the data will be accessed. Per h5py docs, optimum shape should be between 10 KiB and 1 MiB, larger for larger datasets. Start with chunks=True if you don't know the optimum value. h5py will set a value based on the dataset shape and type.

In your case, it's clear the shape should be (x,y,1) (since you are reading 1 image at a time. Once you make this change, your code should be faster.

I wrote a simple example to demonstrate the behavior. Timing data for 4 chunk options is provided after the code. (Note: I had to use a2=25 because I am running on a virtual machine with limited storage. Change to match your data.)

Also, I changed the group name from "Measurement 1" to "Measurement_1" because spaces in names can cause headaches with some HDF5 APIs. Better to be safe than sorry.

a0, a1, a2 = 1563, 2048, 25 #396
start = time.time()

with h5py.File('test2.h5', 'w') as hf:
    ds_hf = hf.create_dataset('Measurement_1/data',
               shape=(a0,a1,a2), dtype=int, chunks=(1,a1,a2))
    for i in range(a2):
        image = np.random.randint(0,100, (a0,a1))
        ds_hf[:,:,i] = image
print(f'\nDone creating file. time = {(time.time()-start):.2f}')

start = time.time()
with h5py.File('test2.h5', 'r') as hf:
    ds_hf = hf['Measurement_1']['data'] # returns HDF5 dataset object
    means = []
    for i in range(ds_hf.shape[2]):
        mean = np.mean(ds_hf[:,:,i])
        means.append(mean)

print(f'Done reading file. time = {(time.time()-start):.2f}')

Timing results

No chunks:
Create: 101.9 sec
Read 4.0 sec

Chunks=(1563,1024,1)
Create: 2.4
Read: 0.5

Chunks=(1563,1,25)
Create: 111.9
Read: 19.0

Chunks=(1,1024,25)
Create: 113.0
Read: 10.4