Search code examples
pyarrow

How to see the compression used to create a parquet file with pyarrow?


If I have a parquet file I can do

pqfile=pq.ParquetFile("pathtofile.parquet")
pqfile.metadata

but exploring around using dir in the pqfile object, I can't find anything that would indicate the compression of the file. How can I get that info?


Solution

  • @0x26res has a good point in the comments that converting the metadata to a dict will be easier than using dir.

    Compression is stored at the column level. A parquet file consists of a number of row groups. Each row group has columns. So you would want something like...

    import pyarrow as pa
    import pyarrow.parquet as pq
    table = pa.Table.from_pydict({'x': list(range(100000))})
    pq.write_table(table, '/tmp/foo.parquet')
    pq.ParquetFile('/tmp/foo.parquet').metadata.row_group(0).column(0).compression
    # 'SNAPPY'