Search code examples
azurepysparkdatabricksparquetpartitioning

pyspark partitioning create an extra empty file for every partition


I am facing with one problem in Azure Databricks. In my notebook I am executing simple write command with partitioning:

df.write.format('parquet').partitionBy("startYear").save(output_path,header=True)

And I see something like this: enter image description here

Can someone explain why spark is creating this additional empty files for every partition and how to disable it?

I tried different mode for write, different partitioning and spark versions


Solution

  • I reproduced the above and got the same results when I use Blob Storage.

    enter image description here

    Can someone explain why spark is creating this additional empty files for every partition and how to disable it?

    Spark won't create these types of files. Blob Storage creates the blobs like above when we create parquet files by partitions.

    We cannot avoid these if you use Blob Storage. You can avoid it by using ADLS Storage.

    These are my Results with ADLS:

    enter image description here