Search code examples
file-ioiohdf5hpch5py

how to create hdf5 dataset with early allocate and no fill using h5py


I am trying to create a 78TB HDF5 dataset by filling it in a 2d block-partition manner. This is very slow when the block I'm writing spans rows that haven't ever been written to, because HDF5 is going in and allocating the diskspace and filling in the missing entries with zero.

Instead, I would like h5py to allocate the disk space for my dataset as soon as its created, and never fill it. This is possible with the C api according to Table 16 in the HDF5 Dataset documentation, but how can I do this with h5py, preferably with the high level interface?


Solution

  • As Quincey suggested. You can use the low-level H5py API to create the dataset with the FILL_TIME_NEVER property then convert it back to a high-level Dataset object:

    # create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
    spaceid = h5py.h5s.create_simple((numRows, numCols))
    plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
    plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
    plist.set_chunk((rowchunk, colchunk))
    datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
    rows = h5py.Dataset(datasetid)