To my understanding parquet files have min/max statistics for columns. my question is how to read those stats using python without reading the entire file?
If it helps, I also have _common_metadata
and _metadata
files.
my specific problem is getting the max date for each stock exchange partition in this file system (each year partition contains multiple parquet files that have date column) :
C:.
│ _common_metadata
│ _metadata
├───source=NASDAQ
│ ├───year=2017
│ └───year=2018
├───source=London_Stock_Exchange
│ ├───year=2014
│ ├───year=2015
├───source=Japan_Exchange_Group
│ ├───year=2017
│ └───year=2018
└───source=Euronext
├───year=2017
└───year=2018
You can extract them on a per-RowGroup basis in pyarrow
:
import pyarrow.parquet as pq
pq_file = pq.ParquetFile(…)
# Get metadata for the i-th RowGroup
rg_meta = pq_file.metadata.row_group(i)
# Get the "max" statistic for the k-th column
max_of_col = rq_meta.column(col).statistics.max