Search code examples
pythonparallel-processingzarr

How many files does zarr generate?


I am searching for a thread-safe alternative to hdf5 to read from in a multiprocessing environment and stumbled across zarr, which according to benchmarks is said to be basically a drop-in replacement for h5py in a python envrionment.

I tried it and all looks good so far, but I cannot wrap my head around the number of files zarr outputs.

If I write to an h5-file with h5py only one file results whereas zarr seems to output a random number of files within a subfolder.

Would someone explain to me why that is and what the exact number of created files depends on?

thanks in advance


Solution

  • Zarr generally maps keys (particular chunk indices) to values (binary blobs) representing that chunk's data. If you are using the DirectoryStore, this results in a number of different files being written to disk. The number of files seen will be dependent on how many chunks your arrays have and which ones contain non-trivial content (like non-zero values).