Search code examples
pythonpandasparquet

Efficiently loading list of parquet files with python pandas


I am trying to load a large number of parquet files in python pandas and noticed a notable performance difference between two different approaches. Specifically

pd.read_parquet("/path/to/directory/")

Is more than twice as fast than something like:

filelist = glob.glob("/path/to/directory/*")
pd.concat([pd.read_parquet(i) for i in filelist])

The reason for want to use the 2nd approach include to pre-filter the parquet files to be loaded, or to load from multiple directories (that contain parquet files with same format etc).

Any tips / guidance appreciated - basically looking to understand how to make the 2nd approach as performant as the first (and/or understanding what kind of magic might be making the 1st approach faster).


Solution

  • the function pyarrow.parquet.read_parquet:

    • uses an IO thread pool in C++ to load files in parallel.
    • concatenate the different files into one table using arrow which is faster than doing it in pandas (pandas isn't very good at concatenating).

    It's not very well documented, but pq.read_parquet can also accept a list of file names, so you will benefit from all the speed up:

    filelist = glob.glob("/path/to/directory/*")
    pd.read_parquet(fiellist)