Search code examples
pythonmatplotlibh5pychunkshdf

HDF5 files and plotting using chunks


I'm new to HDF5 files and I don't understand how to access chunks in a dataset. I have quite a big dataset (1536, 2048, 11, 18, 2) which is chunked into (768, 1024, 1,1,1), each chunk represents half of an image. I want to plot the dataset, giving the mean values of each (whole) image (using matplotlib).

Question: how to I access separate chunks and how do I work with them? (How does h5py use them?)

This is my code:

bla = np.random.randint(0,100, (1536, 2048, 11, 18, 2))

with h5py.File('test.h5','w') as f:
    grp = f.create_group('Measurement 1')
    grp.create_dataset('data', data = bla, chunks = (768,1024,1,1,1))

f.close()

I have this to get access to the dataset, but I don't know how to access the chunks..

with h5py.File('test.h5', 'r') as hf:
            for dset in hf['Measurement 1'].keys():      
                print (dset)
                ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset object
                print (ds_hf)
                print (ds_hf.shape, ds_hf.dtype)
                data_f = hf['Measurement 1']['data'][:] # adding [:] returns a numpy array
hf.close()

I need the program to open each chunk, get the mean value and close it again before opening the next one, so my RAM doesn't get full.


Solution

  • Here is a sample code that you can understand how chunks work in hdf5, I developed it in a general way, you can modify it based on you requirements:

    import h5py
    import numpy as np
    
    # Generate random data
    bla = np.random.randint(0, 100, (1536, 2048, 11, 18, 2))
    
    # Create the HDF5 file and dataset
    with h5py.File('test.h5', 'w') as f:
        grp = f.create_group('Measurement 1')
        grp.create_dataset('data', data=bla, chunks=(768, 1024, 1, 1, 1))
    
    # Open the HDF5 file
    with h5py.File('test.h5', 'r') as hf:
        # Access the dataset
        ds_hf = hf['Measurement 1']['data']
        print(ds_hf)
        print(ds_hf.shape, ds_hf.dtype)
    
        # Iterate over the chunks
        for chunk_idx in np.ndindex(ds_hf.chunks):
            chunk = ds_hf[chunk_idx]
            # Process the chunk
            chunk_mean = np.mean(chunk)
            print(f"Chunk {chunk_idx}: Mean value = {chunk_mean}")
    
    # Close the HDF5 file
    hf.close()