Search code examples
metadataparquetpython-3.8pyarrowapache-arrow

pyarrow pq.ParquetFile and related functions throw OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error


As part of an analysis pipeline I am using around 60 000 parquet files containing each one line of data which must be concatenated. Each file can contain a different set of columns and I need to uniformise them before concatenating them with Ray dstributed Dataset. When reading the parquet files which were created by Pandas using pyarrow, I get the error OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit error.

Since Ray relies on pyarrow 6.0.1 to read/write the files, I have tried to read each files individually using pyarrow version 9.0.0, 8.0.0 and 6.0.1 using the command pyarrow.parquet.ParquetFile(), pyarrow.parquet.read_metadata() and pyarrow.parquet.ParquetDataset(). Doing so I have identified the one file that is causing the error. This file is the biggest in my dataset (496MB compared to all others being <200 MB). I am unsure of what this exception means but I wonder if this could be related to the size expansion of the file when putting it in memory? Could this error be avoided by changing a setting in pyarrow to give it more reading memory? Could it be possible to read this file in batches to avoid this error?

I am using a cluster of 64 CPU cores with 249G RAM. My python 3.8.10 environment is located in a Singularity container. This environment uses Ray 2.0.0 and pyarrow 6.0.1 as well as a lot of other python packages including machine learning libraries.


Solution

  • This an indication the metadata (not data) is quite large or corrupt. You can try setting large values for the thrift_* on read_table to see if that helps