Search code examples
pythonpandasperformancedataframeparquet

How to open parquet (binary data type) files in python without getting RAM error?


I converted some CSV data to parquet and was able to reduce the storage volume from 2,5 GB to 450 MB. I use following code to open the parquet file:

df= pd.read_parquet("PATH/file9.parquet", engine='auto')

My problem is that I get the following error while I try to open the parquet file.

pyarrow.lib.ArrowIOError: Arrow error: Out of memory: malloc of size 2941974336 failed

I know that its possible to open big csv files by chunking them as follow:

for chunk in pd.read_csv("PATH/file9.csv", chunksize=chunksize):

It was possible to open smaller parquet files with that line. But I couldn't find any solution for opening big parquet files. Can anyone maybe recommend another data type which is compact as parquet and can be opend without problem? Or how to chunkthe parquet file?


Solution

  • The underlying read call does not support any sort of chunking parameter, so unfortunately no, you can't read a Parquet file in a piecewise way, not with that library anyway.

    If you don't need all of the columns, though, you can pass in the columns=(...) kwarg.