I am searching for a thread-safe alternative to hdf5 to read from in a multiprocessing environment and stumbled across zarr, which according to benchmarks is said to be basically a drop-in replacement for h5py in a python envrionment.
I tried it and all looks good so far, but I cannot wrap my head around the number of files zarr outputs.
If I write to an h5-file with h5py only one file results whereas zarr seems to output a random number of files within a subfolder.
Would someone explain to me why that is and what the exact number of created files depends on?
thanks in advance
Zarr generally maps keys (particular chunk indices) to values (binary blobs) representing that chunk's data. If you are using the DirectoryStore
, this results in a number of different files being written to disk. The number of files seen will be dependent on how many chunks your arrays have and which ones contain non-trivial content (like non-zero values).