azure pyspark azure-blob-storage azure-databricks

Merge multiple csv files to one csv file in Azure Blob Storage using pyspark

I am using below code to save the csv files back to blob storage, though it is creating multiple files as it runs in loop. Now I would like to merge them into one single csv file. Though I have used dbutils.fs.cp/mv, it is not helpful

while start_date <= end_date:
df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").load(inputFilePath)
df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.s`park.csv").save(TargetPath)`

A similar request has been posted below, but it has been done using pandas data frame and I am looking something with spark dataframe. "Copy data from multiple csv files into one csv file"

Solution

My suggestion would be, use while loop to create a list of csv files to read and then use spark csv readers to read them all at once. For example:

files = []
while start_date <= end_date:
    files.append(inputFilePath)


df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").csv(files)

df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.spark.csv").save(TargetPath)