Search code examples
pandasparquet

How to read multiple .parquet files from multiple directories into single pandas dataframe?


I need to read parquet files from multiple directories.

for example,

 Dir---
          |
           ----dir1---
                      |
                       .parquet
                       .parquet
          |
           ----dir2---
                      |
                       .parquet
                       .parquet
                       .parquet

Is there a way to read these file to single pandas data frame?

note: All of parquet files was generated using pyspark.


Solution

  • Use read_parquet in list comprehension and concat with all files generated by glob with ** (python 3.5+):

    import pandas as pd
    import glob
    
    files = glob.glob('Dir/**/*.parquet')
    df = pd.concat([pd.read_parquet(fp) for fp in files])