I have multiple parquet files in the form of - file00.parquet
, file01.parquet
, file02.parquet
and so on. All the files follow the same schema as file00.parquet
.
How do I add the files one below the other, starting from file00
onwards in that same order using PySpark?
As you mentioned that all parquet files are in the same directory and they have the same schema, then you can read all the parquet by:
file_0_path = /root/to/data/file00.parquet
file_1_path = /root/to/data/file01.parquet
....
df = spark.read.parquet("/root/to/data/")
If you want to save them in a single parquet, you can:
df.repartition(1).write.save(save_path, format='parquet)