Search code examples
pyspark

How to save pyspark data frame in a single csv file


This is in continuation of this how to save dataframe into csv pyspark thread.

I'm trying to save my pyspark data frame df in my pyspark 3.0.1. So I wrote

df.coalesce(1).write.csv('mypath/df.csv)

But after executing this, I'm seeing a folder named df.csv in mypath which contains 4 following files

1._committed_..
2._started_...
3._Success  
4. part-00000-.. .csv

Can you suggest to me how do I save all data in df.csv?


Solution

  • You can use .coalesce(1) to save the file in just 1 csv partition, then rename this csv and move it to the desired folder.

    Here is a function that does that:

    df: Your df
    fileName: Name you want to for the csv file
    filePath: Folder where you want to save to

    def export_csv(df, fileName, filePath):
      
      filePathDestTemp = filePath + ".dir/" 
    
      df\
        .coalesce(1)\
        .write\
        .csv(filePathDestTemp) # use .csv to save as csv
    
      listFiles = dbutils.fs.ls(filePathDestTemp)
      for subFiles in listFiles:
        if subFiles.name[-4:] == ".csv":
          
          dbutils.fs.cp (filePathDestTemp + subFiles.name,  filePath + fileName+ '.csv')
    
      dbutils.fs.rm(filePathDestTemp, recurse=True)