If a dataset has a column with large binary data (e.g. an image or a sound-wave data) then computing min/max statistics for that column becomes costly both in compute and storage requirements, despite being completely useless (querying these values by range obviously makes no sense).
This causes large, highly-partitioned, parquet datasets to have metadata that explodes in size. Is there a way to tell fastparquet to not compute statistics for some columns or does the Parquet format mandate these statistics exist for every column?
This is implemented in a stale PR which could either be merged sometime (it breaks compatibility with py2), or the relevant parts could be extracted. The PR provides a stats=
arg to the writer, which can be used to pick which columns have their max/min computed, or all/none for True/False.