Search code examples
pythonparquetdaskpyarrowfastparquet

using parquet files statistics without reading the files


To my understanding parquet files have min/max statistics for columns. my question is how to read those stats using python without reading the entire file?

If it helps, I also have _common_metadata and _metadata files.


my specific problem is getting the max date for each stock exchange partition in this file system (each year partition contains multiple parquet files that have date column) :

C:.
│   _common_metadata
│   _metadata
├───source=NASDAQ
│   ├───year=2017
│   └───year=2018
├───source=London_Stock_Exchange
│   ├───year=2014
│   ├───year=2015
├───source=Japan_Exchange_Group
│   ├───year=2017
│   └───year=2018
└───source=Euronext
    ├───year=2017
    └───year=2018

Solution

  • You can extract them on a per-RowGroup basis in pyarrow:

    import pyarrow.parquet as pq
    
    pq_file = pq.ParquetFile(…)
    # Get metadata for the i-th RowGroup
    rg_meta = pq_file.metadata.row_group(i)
    # Get the "max" statistic for the k-th column
    max_of_col = rq_meta.column(col).statistics.max