According to the project page of fastparquet, fastparquet
support various compression methods
Optional (compression algorithms; gzip is always available):
snappy (aka python-snappy) lzo brotli lz4 zstandard
especially zstandard
is modern algorithm that provides high compression ratios as well as impressive fast compression/decompression speed. And this is what I want in fastparquet.
But in the doc of fastparquet.write
compression to apply to each column, e.g. GZIP or SNAPPY or a dict like {"col1": "SNAPPY", "col2": None} to specify per column compression types. In both cases, the compressor settings would be the underlying compressor defaults. To pass arguments to the underlying compressor, each dict entry should itself be a dictionary:
{ col1: { "type": "LZ4", "args": { "compression_level": 6, "content_checksum": True } }, col2: { "type": "SNAPPY", "args": None } "_default": { "type": "GZIP", "args": None } }
Nothing mentioned about zstandard. What is worse, if I write
fastparquet.write('outfile.parq', df, compression='LZ4')
It pops up errors saying
Compression 'LZ4' not available. Options: ['GZIP', 'UNCOMPRESSED']
So fastparquest
only support 'GZIP'? This is quite a discrepancy from the project page! Do I missing some packages? How to use fastparquest with all project page stated compression algorithm?
Yes, you may be missing some packages. Your system must have have the python LZ4 and/or zstandard bindings first. See the source code for more details.
For LZ4: if import lz4.block
gives a ModuleNotFoundError
, then go ahead and install with pip install lz4
.
Similarly for zstandard: pip install zstandard
And for brotli: pip install brotlipy
And lzo: pip install python-lzo
And snappy: pip install python-snappy