Search code examples
apache-sparkpysparkdatabricksparquet

Write out spark df as single parquet file in databricks


I have a dataframe something like below:

Filename col1 col2
file1 1 1
file1 1 1
file2 2 2
file2 2 2

I need to save this as parquet partitioned by file name. When I use df.write.partitionBy("Filename").mode("overwrite").parquet(file_out_location) it creates 2 folders (based on the partitions) as Filename=file1 and Filename=file1 and many part files inside.

How can I save it as a single file within each partition directory, e.g. Filename=file1.parquet and Filename=file2.parquet?


Solution

  • This would work:

    row = df.selectExpr("cast(count(DISTINCT(FileName)) as int) as cnt").head();
    
    df \
      .repartition(row["cnt"], F.col("FileName"))\
      .write()\
      .partitionBy("FileName")\
      .parquet("output-folder-path/");
    

    Essentially you need to partition the in-memory dataframe based on the same column(s) which you intent on using in partitionBy(). Without giving row["cnt"] as above - it'll default to spark.sql.shuffle.partitions partitions.

    The above will produce one file per partition based on the partition column.

    Without repartition: Output1

    With repartition: Output