Search code examples
pythonhdf5h5py

Combining hdf5 files


I have a number of hdf5 files, each of which have a single dataset. The datasets are too large to hold in RAM. I would like to combine these files into a single file containing all datasets separately (i.e. not to concatenate the datasets into a single dataset).

One way to do this is to create a hdf5 file and then copy the datasets one by one. This will be slow and complicated because it will need to be buffered copy.

Is there a more simple way to do this? Seems like there should be, since it is essentially just creating a container file.

I am using python/h5py.


Solution

  • One solution is to use the h5py interface to the low-level H5Ocopy function of the HDF5 API, in particular the h5py.h5o.copy function:

    In [1]: import h5py as h5
    
    In [2]: hf1 = h5.File("f1.h5")
    
    In [3]: hf2 = h5.File("f2.h5")
    
    In [4]: hf1.create_dataset("val", data=35)
    Out[4]: <HDF5 dataset "val": shape (), type "<i8">
    
    In [5]: hf1.create_group("g1")
    Out[5]: <HDF5 group "/g1" (0 members)>
    
    In [6]: hf1.get("g1").create_dataset("val2", data="Thing")
    Out[6]: <HDF5 dataset "val2": shape (), type "|O8">
    
    In [7]: hf1.flush()
    
    In [8]: h5.h5o.copy(hf1.id, "g1", hf2.id, "newg1")
    
    In [9]: h5.h5o.copy(hf1.id, "val", hf2.id, "newval")
    
    In [10]: hf2.values()
    Out[10]: [<HDF5 group "/newg1" (1 members)>, <HDF5 dataset "newval": shape (), type "<i8">]
    
    In [11]: hf2.get("newval").value
    Out[11]: 35
    
    In [12]: hf2.get("newg1").values()
    Out[12]: [<HDF5 dataset "val2": shape (), type "|O8">]
    
    In [13]: hf2.get("newg1").get("val2").value
    Out[13]: 'Thing'
    

    The above was generated with h5py version 2.0.1-2+b1 and iPython version 0.13.1-2+deb7u1 atop Python version 2.7.3-4+deb7u1 from a more-or-less vanilla install of Debian Wheezy. The files f1.h5 and f2.h5 did not exist prior to executing the above. Note that, per salotz, for Python 3 the dataset/group names need to be bytes (e.g., b"val"), not str.

    The hf1.flush() in command [7] is crucial, as the low-level interface apparently will always draw from the version of the .h5 file stored on disk, not that cached in memory. Copying datasets to/from groups not at the root of a File can be achieved by supplying the ID of that group using, e.g., hf1.get("g1").id.

    Note that h5py.h5o.copy will fail with an exception (no clobber) if an object of the indicated name already exists in the destination location.