Search code examples
pythonpandasapache-sparkpysparkparquet

Pandas cannot read parquet files created in PySpark


I am writing a parquet file from a Spark DataFrame the following way:

df.write.parquet("path/myfile.parquet", mode = "overwrite", compression="gzip")

This creates a folder with multiple files in it.

When I try to read this into pandas, I get the following errors, depending on which parser I use:

import pandas as pd
df = pd.read_parquet("path/myfile.parquet", engine="pyarrow")

PyArrow:

File "pyarrow\error.pxi", line 83, in pyarrow.lib.check_status

ArrowIOError: Invalid parquet file. Corrupt footer.

fastparquet:

File "C:\Program Files\Anaconda3\lib\site-packages\fastparquet\util.py", line 38, in default_open return open(f, mode)

PermissionError: [Errno 13] Permission denied: 'path/myfile.parquet'

I am using the following versions:

  • Spark 2.4.0
  • Pandas 0.23.4
  • pyarrow 0.10.0
  • fastparquet 0.2.1

I tried gzip as well as snappy compression. Both do not work. I of course made sure that I have the file in a location where Python has permissions to read/write.

It would already help if somebody was able to reproduce this error.


Solution

  • Since this still seems to be an issue even with newer pandas versions, I wrote some functions to circumvent this as part of a larger pyspark helpers library:

    import pandas as pd
    import datetime
    import os
    
    def read_parquet_folder_as_pandas(path, verbosity=1):
      files = [f for f in os.listdir(path) if f.endswith("parquet")]
    
      if verbosity > 0:
        print("{} parquet files found. Beginning reading...".format(len(files)), end="")
        start = datetime.datetime.now()
    
      df_list = [pd.read_parquet(os.path.join(path, f)) for f in files]
      df = pd.concat(df_list, ignore_index=True)
    
      if verbosity > 0:
        end = datetime.datetime.now()
        print(" Finished. Took {}".format(end-start))
      return df
    
    
    def read_parquet_as_pandas(path, verbosity=1):
      """Workaround for pandas not being able to read folder-style parquet files.
      """
      if os.path.isdir(path):
        if verbosity>1: print("Parquet file is actually folder.")
        return read_parquet_folder_as_pandas(path, verbosity)
      else:
        return pd.read_parquet(path)
    

    This assumes that the relevant files in the parquet "file", which is actually a folder, end with ".parquet". This works for parquet files exported by databricks and might work with others as well (untested, happy about feedback in the comments).

    The function read_parquet_as_pandas() can be used if it is not known beforehand whether it is a folder or not.