Search code examples
python-3.xpandasparquet

How to read all parquet files from a folder in s3 to pandas


How can I read all the parquet files in a folder (written by Spark), into a pandas DataFrame using Python 3.x? Preferably without pyarrow due to version conflicts.

Folder contains parquet files with pattern part-*.parquet and a _SUCCESS file.


Solution

  • You can use s3fs to list files and dask to read the files like so:

    import s3fs
    import dask.dataframe as dd
    
    s3 = s3fs.S3FileSystem()
    
    def get_files(input_folder):
        files = s3.ls(input_folder)
        files = ['s3://' + str(file) for file in files if not str(file).endswith('_SUCCESS')]
        return files
    
    def read_files(input_folder):
        files = get_files(input_folder)
        df = dd.read_parquet(files)
        return df
    
    df = read_files(input_folder)