Write out spark df as single parquet file in databricks

I have a dataframe something like below:

Filename	col1	col2
file1	1	1
file1	1	1
file2	2	2
file2	2	2

I need to save this as parquet partitioned by file name. When I use df.write.partitionBy("Filename").mode("overwrite").parquet(file_out_location) it creates 2 folders (based on the partitions) as Filename=file1 and Filename=file1 and many part files inside.

How can I save it as a single file within each partition directory, e.g. Filename=file1.parquet and Filename=file2.parquet?

Solution

This would work:

row = df.selectExpr("cast(count(DISTINCT(FileName)) as int) as cnt").head();

df \
  .repartition(row["cnt"], F.col("FileName"))\
  .write()\
  .partitionBy("FileName")\
  .parquet("output-folder-path/");

Essentially you need to partition the in-memory dataframe based on the same column(s) which you intent on using in partitionBy(). Without giving row["cnt"] as above - it'll default to spark.sql.shuffle.partitions partitions.

The above will produce one file per partition based on the partition column.

Without repartition:

With repartition: