Search code examples
pythondaskparquetfastparquet

Skip metadata for large binary fields in fastparquet


If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values by range obviously makes no sense).

This causes large, highly-partitioned, parquet datasets to have metadata that explodes in size. Is there a way to tell fastparquet to not compute statistics for some columns or does the Parquet format mandate these statistics exist for every column?


Solution

  • This is implemented in a stale PR which could either be merged sometime (it breaks compatibility with py2), or the relevant parts could be extracted. The PR provides a stats= arg to the writer, which can be used to pick which columns have their max/min computed, or all/none for True/False.