Search code examples
apache-sparkpysparkamazon-emr

pyspark coalesce overwrite as one file with fixed name


We have a requirement to automate a pipeline.

My requirement is to generate/overwrite a file using pyspark with fixed name

however, my current command is -

final_df.coalesce(1).write.option("header", "true").csv("s3://finalop/" , mode="overwrite")

This ensures that the directory (finalop) is same but file in this directory is always created with different name everytime i overwrite it.

Now, the next job reading it is not in pyspark so we cant automate the pipeline. We are trying ways to make it read a directory.

But is there a way in pyspark i can generate a fixed file , something like -

final_df.coalesce(1).write.option("header", "true").csv("s3://finalop/final.csv" , mode="overwrite")

Solution

  • spark will always create a folder with the files inside (one file per worker). Even with coalesce(1), it will create at least 2 files, the data file (.csv) and the _SUCESS file. If you want to have your file on S3 with the specific name final.csv, you need to execute some S3 commands (either in python with BOTO3 for example) or using the CLI interface.

    The problem with S3 is that you cannot simply rename your file, you have to recreate it (copy to new name and delete the old one) because the system is KEY/VALUE based and doesn't allow key renaming.