Search code examples
azurepysparkazure-blob-storageazure-databricks

Merge multiple csv files to one csv file in Azure Blob Storage using pyspark


I am using below code to save the csv files back to blob storage, though it is creating multiple files as it runs in loop. Now I would like to merge them into one single csv file. Though I have used dbutils.fs.cp/mv, it is not helpful

while start_date <= end_date:
df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").load(inputFilePath)
df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.s`park.csv").save(TargetPath)`

A similar request has been posted below, but it has been done using pandas data frame and I am looking something with spark dataframe. "Copy data from multiple csv files into one csv file"


Solution

  • My suggestion would be, use while loop to create a list of csv files to read and then use spark csv readers to read them all at once. For example:

    files = []
    while start_date <= end_date:
        files.append(inputFilePath)
    
    
    df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").csv(files)
    
    df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.spark.csv").save(TargetPath)