Search code examples
pysparkparquet

How to concatenate./append multiple parquet files in PySpark with the same schema


I have multiple parquet files in the form of - file00.parquet, file01.parquet, file02.parquet and so on. All the files follow the same schema as file00.parquet. How do I add the files one below the other, starting from file00 onwards in that same order using PySpark?


Solution

  • As you mentioned that all parquet files are in the same directory and they have the same schema, then you can read all the parquet by:

    file_0_path = /root/to/data/file00.parquet
    file_1_path = /root/to/data/file01.parquet
    ....
    
    df = spark.read.parquet("/root/to/data/")
    

    If you want to save them in a single parquet, you can:

    df.repartition(1).write.save(save_path, format='parquet)