I am trying to load a large number of parquet files in python pandas and noticed a notable performance difference between two different approaches. Specifically
pd.read_parquet("/path/to/directory/")
Is more than twice as fast than something like:
filelist = glob.glob("/path/to/directory/*")
pd.concat([pd.read_parquet(i) for i in filelist])
The reason for want to use the 2nd approach include to pre-filter the parquet files to be loaded, or to load from multiple directories (that contain parquet files with same format etc).
Any tips / guidance appreciated - basically looking to understand how to make the 2nd approach as performant as the first (and/or understanding what kind of magic might be making the 1st approach faster).
the function pyarrow.parquet.read_parquet:
It's not very well documented, but pq.read_parquet
can also accept a list of file names, so you will benefit from all the speed up:
filelist = glob.glob("/path/to/directory/*")
pd.read_parquet(fiellist)