Suppose I want to check that a particular H5 file is the one I think it is, and hasn't had some dataset altered while I wasn't looking. I've already turned on the Fletcher-32 filter. I'm wondering if there's some way to access the checksum stored in the H5 file.
To be clear, I don't want to recalculate the checksum, I'm assuming that the data is consistent with the checksum, and I'm not expecting anything nefarious; I just want a quick way to peek in and make a list of the checksums — then peek in later to make sure my list hasn't somehow gotten out of sync with the data. Ideally, I'd like to do this through the h5py
interface, but the C interface would at least give me somewhere to start.
My use case is basically this: I have a database of my H5 files, and I want to be sure that none of the datasets have changed without the database knowing about it. I don’t care if — say — an attribute has been changed or added, which means file sizes, modification times, and MD5 sums are of no use. For example, I might realize that some scaling was off by a factor of 2, go in and change those bits in one dataset without changing the dataset's shape or even the number of bytes in the file — but then fail to update the database for one reason or another. I need to be able to detect such a change. And since Fletcher-32 is already being computed by HDF5 with every change to our data, it would be very convenient.
Basically, I'm just asking for the highest-level API calls that can achieve this.
I've found one place in the HDF5 source code here where it reads the stored checksum — evidently the last 4 bytes of the buffer.
Using this fact, it looks like there is an answer, as of HDF5 1.10.2 and h5py 2.10. But it's still not nearly as fast as I'd like — presumably because it's reading all the bytes in every chunk, possibly exacerbated by the need to be constantly allocating new buffers for all those reads.
Essentially, we want to bypass any filters (compression, etc.), read the last 4 bytes of the raw data chunk, and interpret them as an unsigned 32-bit integer. The read_direct_chunk
in h5py
was added in v 2.10, and corresponds to the HDF5 function H5D_READ_CHUNK
.
Here's some simple example code, assuming test.h5
has a 2-dimensional dataset named data
.
import numpy as np
import h5py
with h5py.File('test.h5', 'r') as f:
ds = f['data']
n_chunks_0 = int(np.ceil(ds.shape[0] / ds.chunks[0]))
n_chunks_1 = int(np.ceil(ds.shape[1] / ds.chunks[1]))
checksums = np.empty((n_chunks_0, n_chunks_1), dtype=np.uint32)
for i in range(n_chunks_0):
for j in range(n_chunks_1):
filter_mask, raw_data_bytes = d.id.read_direct_chunk((i, j))
checksums[i, j] = np.frombuffer(raw_data_bytes[-4:], dtype=np.uint32)[0]
Note that there may be some issues with endianness that I'm not considering.
Anyway, the question remains: Is there any nice API for getting just those last 4 bits, rather than the whole chunk?
There is a very new interface that lets me do exactly what I want. It was introduced in HDF5 v1.10.5 and will be in h5py 3.0 — specifically, the (H5D)get_num_chunks and (H5D)get_chunk_info functions.
Here's a simple example showing how to use these functions to get the checksums for every chunk in the data
dataset of test.h5
. Note that we need both h5py capabilities and seek/read capabilities — which is why I used this weird way of opening the file.
import numpy as np
import h5py
with open('test.h5', 'rb') as stream:
with h5py.File(stream, 'r') as f:
ds = f['data']
assert ds.fletcher32, ('Dataset does not have Fletcher-32 checksums')
checksums = np.zeros((ds.id.get_num_chunks(),), dtype=np.uint32)
for i in range(checksums.size):
chunk_info = ds.id.get_chunk_info(i)
offset = chunk_info.byte_offset + chunk_info.size - 4
stream.seek(offset, 0)
checksums[i] = np.frombuffer(stream.read(4), dtype=np.uint32)[0]
This code works with the current master branch of h5py. The results agree with those from the code I added to my question above. But other than the chunk info, this code only reads exactly 4 bytes per chunk, and is thus probably about as efficient as it can be.