I have a h5 file which contains a dataset like this:
col1. col2. col3
1 3 5
5 4 9
6 8 0
7 2 5
2 1 2
I have another h5 file with the same columns:
col1. col2. col3
6 1 9
8 2 7
and I would like to concatenate these two to have the following h5 file:
col1. col2. col3
1 3 5
5 4 9
6 8 0
7 2 5
2 1 2
6 1 9
8 2 7
What is the most efficient way to do this if files are huge or we have many of these merges?
I'm not familiar with pandas, so can't help there. This can be done with h5py or pytables. As @hpaulj mentioned, the process reads the dataset into a numpy array then writes to a HDF5 dataset with h5py. The exact process depends on the maxshape attribute (it controls if the dataset can be resized or not).
I created examples to show both methods (fixed size or resizeable dataset). The first method creates a new file3 that combines the values from file1 and file2. The second method adds the values from file2 to file1e (that is resizable). Note: code to create the files used in the examples is at the end.
I have a longer answer on SO that shows all the ways to copy data.
See this Answer: How can I combine multiple .h5 file?
Method 1: Combine datasets into a new file
Required when the datasets were not created with maxshape=
parameter
with h5py.File('file1.h5','r') as h5f1, \
h5py.File('file2.h5','r') as h5f2, \
h5py.File('file3.h5','w') as h5f3 :
print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape)
print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)
arr1_a0 = h5f1['ds_1'].shape[0]
arr2_a0 = h5f2['ds_2'].shape[0]
arr3_a0 = arr1_a0 + arr2_a0
h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype,
shape=(arr3_a0,3), maxshape=(None,3))
xfer_arr1 = h5f1['ds_1']
h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1
xfer_arr2 = h5f2['ds_2']
h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2
print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)
Method 2: Appended file2 dataset to file1 dataset
The datasets in file1e must be created with maxshape=
parameter
with h5py.File('file1e.h5','r+') as h5f1, \
h5py.File('file2.h5','r') as h5f2 :
print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)
arr1_a0 = h5f1['ds_1e'].shape[0]
arr2_a0 = h5f2['ds_2'].shape[0]
arr3_a0 = arr1_a0 + arr2_a0
h5f1['ds_1e'].resize(arr3_a0,axis=0)
xfer_arr2 = h5f2['ds_2']
h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2
print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
Code to create the example files used above:
import h5py
import numpy as np
arr1 = np.array([[ 1, 3, 5 ],
[ 5, 4, 9 ],
[ 6, 8, 0 ],
[ 7, 2, 5 ],
[ 2, 1, 2 ]] )
with h5py.File('file1.h5','w') as h5f:
h5f.create_dataset('ds_1',data=arr1)
print (h5f['ds_1'].maxshape)
with h5py.File('file1e.h5','w') as h5f:
h5f.create_dataset('ds_1e',data=arr1, shape=(5,3), maxshape=(None,3))
print (h5f['ds_1e'].maxshape)
arr2 = np.array([[ 6, 1, 9 ],
[ 8, 2, 7 ]] )
with h5py.File('file2.h5','w') as h5f:
h5f.create_dataset('ds_2',data=arr2)