Search code examples
pythonhdf5h5py

Creating a dataset from multiple hdf5 groups


creating a dataset from multiple hdf5 groups

Code for groups with

np.array(hdf.get('all my groups'))

I have then added code for creating a dataset from groups.

with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)

The error message being

ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)

The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.


Solution

  • This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:

    • The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
    • The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
    • How you decide to do this depends on your data.
    • Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.

    Code below:

    # Data for file1
    arr1 = np.random.random(120).reshape(20,6)
    arr2 = np.random.random(120).reshape(20,6)
    arr3 = np.random.random(120).reshape(20,6)
    arr4 = np.random.random(120).reshape(20,6)
    
    # Create file1 with 4 datasets
    with h5py.File('file1.h5','w') as h5f :
        h5f.create_dataset('ds_1',data=arr1)
        h5f.create_dataset('ds_2',data=arr2)
        h5f.create_dataset('ds_3',data=arr3)
        h5f.create_dataset('ds_4',data=arr4)
     
    # Open file1 for reading and file2 for writing
    with h5py.File('file1.h5','r') as h5f1 , \
         h5py.File('file2.h5','w') as h5f2 :
    
    # Loop over datasets in file1 to get dtype and rows (should test compatibility)        
         for i, ds in enumerate(h5f1.keys()) :
            if i == 0:
                ds_0_dtype = h5f1[ds].dtype
                n_rows = h5f1[ds].shape[0]
                break
    
    # Create new empty dataset with appropriate dtype and size
    # Use maxshape parameter to make resizable in the future
    
        ds_list = ['ds_1','ds_2','ds_3','ds_4']
        col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
        n_cols = sum( [ len(c) for c in col_list])
        h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
        
    # Loop over datasets in file1, read data into xfer_arr, and write to file2        
        first = 0  
        for ds, cols in zip(ds_list, col_list) :
            xfer_arr = h5f1[ds][:,cols]
            last = first + xfer_arr.shape[1]
            h5f2['combined'][:, first:last] = xfer_arr[:]
            first = last