Search code examples
hdf5h5py

h5py: how to stack multiple HDF5 datasets to a single np.array


I have a 50GB h5 file which only has datasets and has circa 7m x 256d numpy arrays a values.

I want to generate a slice of it since the whole file cant be loaded to memory but I am struggling to do this.

so, I have:

    f=h5py.File("./somefile.h5","r",libver='latest')
    keys_=list(f.keys()) # works - I get all the keys and check the length and it ads up.
    print(f[:5]) # fails saying it need a string for dataset name
    print(f[()][:5]) # fails same as above
    print(f['.'][:5]) #fails

this is driving me mad!

To reiterate, there are no groups, just datasets - how do I get slices say for example 1m slices?


Solution

  • After re-reading your question, I realize you want to STACK datasets into a single array (e.g., not SLICE a dataset). I have updated my answer to show how to do this.

    First, you don't need groups to access datasets. You simply specify the dataset name in the H5 object path.

    Also, your code don't call f.close(). This can cause problems if you accidentally exit the program with the file open. It's better to use Python's file context manager (with/as:). That way the file is closed when it exits the with/as: block.

    How to stack datasets as np.array slices:
    This procedure is similar to the original answer. However, it creates an empty NumPy array large enough to hold the data before it loops over the datasets. This process assumes the all datasets have the same shape (and dtype). Since I don't know your dataset shape, I made it a variable and the last index of the array is the dataset index. Also, there a tests to be sure each dataset matches the shape and dtype of the target array.

    Code to stack datasets below:

    with h5py.File("./somefile.h5","r") as f:
        ds_names = list(f.keys())
        n_slices = 5
        ds_shape = f[ds_names[0]].shape
        arr_shape =  ds_shape + (n_slices,)
        ds_arr = np.zeros(arr_shape,dtype=f[ds_names[0]].dtype)
        print(f'array; Type: {ds_arr.dtype}, Shape: {ds_arr.shape}')
        print(f'{ds_arr.shape[:-1]}')
        
        for cnt,ds in enumerate(ds_names[:n_slices]):
            if isinstance(f[ds], h5py.Dataset) and \
               f[ds].shape == ds_arr.shape[:-1] and and f[ds].dtype == ds_arr.dtype:
                print(f'For dataset {ds}; Type: {f[ds].dtype}, Shape: {f[ds].shape}')
                ds_arr[...,cnt] = f[ds][()]
    

    Original answer to slice datasets
    No, for those that might want to slice a dataset, slicing syntax is identical to NumPy syntax. So, if you have multiple axes (dimensions) you will need to specify slice size for each dimension.

    Code below will access your file, loop over all root level objects (keys), print dataset info, and attempt to read a slice. This assumes the slice syntax is appropriate for your dataset. I added logic to test that each object is a dataset.

    with h5py.File("./somefile.h5","r") as f:
        for ds in f:
            if isinstance(f[ds], h5py.Dataset):
                print(f'For dataset {ds}; Type: {f[ds].dtype}, Shape: {f[ds].shape}')
                print(f[ds][:5]) # gets a small slice assuming slice syntax is correct