Search code examples
pandascompressionhdf5hpch5py

What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?


I have read several times that turning on compression in HDF5 can lead to better read/write performance.

I wonder what ideal settings can be to achieve good read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.

I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.


Solution

  • There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compression filters.

    Regarding performance:

    It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot. For example if you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000). See here, here and here for some infos.

    However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.

    Nevertheless you might still benefit from compression because loading the compressed data to memory and decompressing it using the CPU is probably faster than loading the uncompressed data.

    Regarding your original question:

    I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:

    • BloscLZ: internal default compressor, heavily based on FastLZ.
    • LZ4: a compact, very popular and fast compressor.
    • LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
    • Snappy: a popular compressor used in many places.
    • Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.

    These have different strengths and the best thing is to try and benchmark them with your data and see which works best.