How can I read all the parquet files in a folder (written by Spark), into a pandas
DataFrame
using Python 3.x? Preferably without pyarrow
due to version conflicts.
Folder contains parquet files with pattern part-*.parquet
and a _SUCCESS
file.
You can use s3fs
to list files and dask
to read the files like so:
import s3fs
import dask.dataframe as dd
s3 = s3fs.S3FileSystem()
def get_files(input_folder):
files = s3.ls(input_folder)
files = ['s3://' + str(file) for file in files if not str(file).endswith('_SUCCESS')]
return files
def read_files(input_folder):
files = get_files(input_folder)
df = dd.read_parquet(files)
return df
df = read_files(input_folder)