Search code examples
pythonhdf5h5py

Create new HDF5 dataset from existing one retaining the shape


I am new to HDF5 and I am trying to create a new dataset from an existing one, where the new ones are individual files for each variable in the existing one. I use the following code

f = h5py.File(filename,'r')
parts = [part for part in f.keys() if 'var' in part]
stats = f['stats'][()].decode()
cfg = Inifile(stats)
fields = cfg.get('data', 'fields', '').split(',')

fnew = {}
for field in fields:
    fnew[field] = h5py.File(filename+'_'+field, "w")
    cfg.set('data', 'fields', field)
    newstats = cfg.tostr()
    fnew[field].create_dataset('stats', data=newstats)

for part in parts:
    for i, field in enumerate(fields):
        fnew[field].create_dataset(part, data=f[part][:,i,:])

The objects in the old dataset are three dimensional, say [NX,NV,NY] whereas objects in the new dataset are two dimensional [NX,NY]. However, I want them to be three dimensional [NX,1,NY] so that they are compatible with the rest of the code. How can I do that with HDF5/h5py libraries?


Solution

  • There's a lot going on in the code you posted. Is your question simply about reshaping the data in the last few lines:

    for part in parts:
        for i, field in enumerate(fields):
            fnew[field].create_dataset(part, data=f[part][:,i,:])
    

    If so, here is the short answer:

    for part in parts:
        (a0, a1, a2) = f[part].shape
        for i, field in enumerate(fields):
            fnew[field].create_dataset(part, data=f[part][:,i,:].reshape(a0,1,a2))
    

    Here are key concepts in the procedure:

    • h5py datasets "behave like" NumPy arrays, so you can read slices of data, get .shape attribute, etc.
    • Use h5_dataset.shape to get the shape of the data.
    • Read a slice of data from the dataset with [:,index,:]
    • Use .reshape() to reshape each slice to desired [NX,1,NY]

    I'm curious, why are you creating duplicate copies of the data? There is no need to that. Simply read slices of the dataset and reshape for downstream calculations. It is easy to do once you know how to manipulate NumPy arrays (and h5py dataset objects).

    Note: You forgot to include f.close(), which can leave your file in an determined state. I prefer Python's with -- as: context manager to open the files. It will close the file when complete AND if/when an exception (error) occurs in the with code block.

    When working with HDF5 files, it's important to understand the data schema before writing code. I created an example that shows the general process. It reads data of shape [NX,NV,NY], then copies as 'NV' datasets with shape [NX,1,NY]. Once you understand this concept, it can be adapted to any schema.

    The example below starts by creating a simple file with one dataset of shape [NX,NV,NY] (the first with/as block). Then, in the second with/as block, slices of data are read from the dataset in teh first file and copied to a new file (could also be new datasets in the first file). Each slice of data is written as an individual dataset of shape [NX,1,NY] using reshape().

    Sample code below:

    # Create a test file
    filename='SO_69101523.h5'
    with h5py.File(filename,'w') as f1:
        nx, nv, ny = 100, 10, 100
        arr = np.random.random(nx*nv*ny).reshape(nx,nv,ny)
        f1.create_dataset('var01',data=arr)
    
    newfilename='SO_69101523_new.h5'
    # open existing file as f1 and 
    # new file as f2     
    with h5py.File(filename,'r') as f1, \
         h5py.File(newfilename,'w') as f2:
        part = 'var01'
        ds1 = f1[part] 
        print(ds1.shape)  #shows (100, 10, 100) from above
        a0, a2 = ds1.shape[0], ds1.shape[2]
        for a1 in range(ds1.shape[1]):
            ds_name = f'{part}_{a1:03}'
            f2.create_dataset(ds_name,data=ds1[:,a1,:].reshape(a0,1,a2))
            print(f2[ds_name].shape)