metadata parquet python-3.8 pyarrow apache-arrow

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error

As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray dstributed Dataset. When reading the parquet files which were created by Pandas using pyarrow, I get the error OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error.

Since Ray relies on pyarrow 6.0.1 to read/write the files, I have tried to read each files individually using pyarrow version 9.0.0, 8.0.0 and 6.0.1 using the command pyarrow.parquet.ParquetFile(), pyarrow.parquet.read_metadata() and pyarrow.parquet.ParquetDataset(). Doing so I have identified the one file that is causing the error. This file is the biggest in my dataset (496MB compared to all others being <200 MB). I am unsure of what this exception means but I wonder if this could be related to the size expansion of the file when putting it in memory? Could this error be avoided by changing a setting in pyarrow to give it more reading memory? Could it be possible to read this file in batches to avoid this error?

I am using a cluster of 64 CPU cores with 249G RAM. My python 3.8.10 environment is located in a Singularity container. This environment uses Ray 2.0.0 and pyarrow 6.0.1 as well as a lot of other python packages including machine learning libraries.

Solution

This an indication the metadata (not data) is quite large or corrupt. You can try setting large values for the thrift_* on read_table to see if that helps