I have an image generator that reads batches of 3D tensors from an HDF5 file (via h5py) and it's using the Python multiprocessing library (inherited from Keras Sequence).
I'd like to understand if am I doing this correctly, and if I can improve that.
I have a __getitem__
method invoked by multiple processes for N iterations and each time this method is called, I open an HDF5 file and read a batch of data for a given set of indices and immediately close the file (via the context manager).
def get_dataset_items(self, dataset: str, indices: np.ndarray) -> np.ndarray:
"""Get an h5py dataset items.
Arguments
---------
dataset : str
The dataset key.
indices : ndarray,
The list of current batch indices.
Returns
-------
np.ndarray
An batch of elements.
"""
with h5.File(self.src, 'r') as file:
return file[dataset][indices]
It looks like there is no problem with this approach but I'm really not sure. I read that we can expect weird stuff and corrupted data when reading a file from multiple processes.
I see there are the MPI interface and the SWMR mode.
Can I benefit from one of these features ?
This is not a definitive answer, but with compressed data I got problems today and found your question when looking for this fix: Giving a python file object to h5py instead of a filename, you can bypass some of the problems and read compressed data via multiprocessing
# using the python file-open overcomes some complex problem
with h5py.File( open(self.src, "rb"), "r" ) as hfile:
# grab the data from hfile
groups = list(h['/']) # etc
So far as I can tell, hdf is trying to "optimise" disk IO for compressed (chunked) data. If several processes are trying to read the same blocks, you might not want to decompress them for each process. This creates a mess. With the python file object we can hope the library no longer knows the processes are looking at the same data and will stop trying to help.