Search code examples
pythonpandascompressionfastparquet

compression option in fastparquet is not consistent


According to the project page of fastparquet, fastparquet support various compression methods

Optional (compression algorithms; gzip is always available):

snappy (aka python-snappy)
lzo
brotli
lz4
zstandard

especially zstandard is modern algorithm that provides high compression ratios as well as impressive fast compression/decompression speed. And this is what I want in fastparquet.

But in the doc of fastparquet.write

compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. In both cases, the compressor settings would be the underlying compressor defaults. To pass arguments to the underlying compressor, each dict entry should itself be a dictionary:

{
    col1: {
        "type": "LZ4",
        "args": {
            "compression_level": 6,
            "content_checksum": True
         }
    },
    col2: {
        "type": "SNAPPY",
        "args": None
    }
    "_default": {
        "type": "GZIP",
        "args": None
    }
}

Nothing mentioned about zstandard. What is worse, if I write

fastparquet.write('outfile.parq', df, compression='LZ4')

It pops up errors saying

Compression 'LZ4' not available. Options: ['GZIP', 'UNCOMPRESSED']

So fastparquest only support 'GZIP'? This is quite a discrepancy from the project page! Do I missing some packages? How to use fastparquest with all project page stated compression algorithm?


Solution

  • Yes, you may be missing some packages. Your system must have have the python LZ4 and/or zstandard bindings first. See the source code for more details.

    • For LZ4: if import lz4.block gives a ModuleNotFoundError, then go ahead and install with pip install lz4.

    • Similarly for zstandard: pip install zstandard

    • And for brotli: pip install brotlipy

    • And lzo: pip install python-lzo

    • And snappy: pip install python-snappy