Search code examples
hiveparquettrino

parquet file codec conversion


I have a parquet file which has compression codec BROTLI. BROTLI is not supported by trino Therefore, I need to convert it to a supported codec which is GZIP, SNAPPY,.. Conversion doesn't seem straight forward or at least i could not find any python library which does it. Please share your ideas or strategies for this codec conversion.


Solution

  • You should be able to do this with pyarrow. It can brotli-compressed Parquet files.

    import pyarrow.parquet as pq
    
    table = pq.read_table(<filename>)
    pq.write_table(table, <filename)
    

    This will save it as a snappy-compressed file by default. You can specify different compression schemes using the compression keyword argument.