Search code examples
pythonhdf5h5py

Appending to h5 files


I have a h5 file which contains a dataset like this:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2

I have another h5 file with the same columns:

col1.      col2.      col3
 6           1          9
 8           2          7

and I would like to concatenate these two to have the following h5 file:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2
 6           1          9
 8           2          7

What is the most efficient way to do this if files are huge or we have many of these merges?


Solution

  • I'm not familiar with pandas, so can't help there. This can be done with h5py or pytables. As @hpaulj mentioned, the process reads the dataset into a numpy array then writes to a HDF5 dataset with h5py. The exact process depends on the maxshape attribute (it controls if the dataset can be resized or not).

    I created examples to show both methods (fixed size or resizeable dataset). The first method creates a new file3 that combines the values from file1 and file2. The second method adds the values from file2 to file1e (that is resizable). Note: code to create the files used in the examples is at the end.

    I have a longer answer on SO that shows all the ways to copy data.
    See this Answer: How can I combine multiple .h5 file?

    Method 1: Combine datasets into a new file
    Required when the datasets were not created with maxshape= parameter

    with h5py.File('file1.h5','r') as h5f1,  \
         h5py.File('file2.h5','r') as h5f2,  \
         h5py.File('file3.h5','w') as h5f3 :
             
        print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape)
        print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    
    
        arr1_a0 = h5f1['ds_1'].shape[0]            
        arr2_a0 = h5f2['ds_2'].shape[0]            
        arr3_a0 = arr1_a0 + arr2_a0          
        h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype,
                            shape=(arr3_a0,3), maxshape=(None,3))
    
        xfer_arr1 = h5f1['ds_1']               
        h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1
     
        xfer_arr2 = h5f2['ds_2']   
        h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2
    
        print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)
    

    Method 2: Appended file2 dataset to file1 dataset
    The datasets in file1e must be created with maxshape= parameter

    with h5py.File('file1e.h5','r+') as h5f1, \
         h5py.File('file2.h5','r') as h5f2 :
    
        print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
        print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    
        
        arr1_a0 = h5f1['ds_1e'].shape[0]            
        arr2_a0 = h5f2['ds_2'].shape[0] 
        arr3_a0 = arr1_a0 + arr2_a0          
    
        h5f1['ds_1e'].resize(arr3_a0,axis=0)
        
        xfer_arr2 = h5f2['ds_2']   
        h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2
    
        print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
    

    Code to create the example files used above:

    import h5py
    import numpy as np
    
    arr1 = np.array([[ 1, 3, 5 ],
                     [ 5, 4, 9 ],
                     [ 6, 8, 0 ],
                     [ 7, 2, 5 ],
                     [ 2, 1, 2 ]] )
    
    with h5py.File('file1.h5','w') as h5f:
        h5f.create_dataset('ds_1',data=arr1)
        print (h5f['ds_1'].maxshape)   
        
    with h5py.File('file1e.h5','w') as h5f:
        h5f.create_dataset('ds_1e',data=arr1, shape=(5,3), maxshape=(None,3))
        print (h5f['ds_1e'].maxshape)             
                     
    arr2 = np.array([[ 6, 1, 9 ],
                     [ 8, 2, 7 ]] )
                     
    with h5py.File('file2.h5','w') as h5f:
        h5f.create_dataset('ds_2',data=arr2)