Search code examples
pythonpandasapache-sparkparquet

How to read files written by Spark with pandas?


When Spark writes dateframe data to parquet file, Spark will create a directory which include several separate parquet files. Code for saving:

term_freq_df.write
            .mode("overwrite")
            .option("header", "true")
            .parquet("dir/to/save/to")

I need to read data from this directory with pandas:

term_freq_df = pd.read_parquet("dir/to/save/to") 

The error:

IsADirectoryError: [Errno 21] Is a directory: 

How to resolve this problem with the simple method that the two code samples could use same path of files?


Solution

  • As you noted, when saving Spark will create multiple parquet files in a directory. To read these files with pandas what you can do is reading the files separately and then concatenate the results.

    import glob
    import os
    import pandas as pd
    
    path = "dir/to/save/to"
    parquet_files = glob.glob(os.path.join(path, "*.parquet"))
    df = pd.concat((pd.read_parquet(f) for f in parquet_files))