When Spark writes dateframe data to parquet file, Spark will create a directory which include several separate parquet files. Code for saving:
term_freq_df.write
.mode("overwrite")
.option("header", "true")
.parquet("dir/to/save/to")
I need to read data from this directory with pandas:
term_freq_df = pd.read_parquet("dir/to/save/to")
The error:
IsADirectoryError: [Errno 21] Is a directory:
How to resolve this problem with the simple method that the two code samples could use same path of files?
As you noted, when saving Spark will create multiple parquet files in a directory. To read these files with pandas what you can do is reading the files separately and then concatenate the results.
import glob
import os
import pandas as pd
path = "dir/to/save/to"
parquet_files = glob.glob(os.path.join(path, "*.parquet"))
df = pd.concat((pd.read_parquet(f) for f in parquet_files))