Search code examples
pysparkdatabricksazure-databricks

What is the best possible way to delete/overwrite a data from a partition of a delta table stored in Azure BLOB container through pyspark Databricks?


I run a Databricks notebook to process some data via Datafactory. The data then gets stored to Blob container as a delta table. The amount of records being transformed are in 100s of millions. Hence, source data is divided into partitions beforehand and then Datafactory parallelly invokes the data transformation Databricks notebook from a loop activity.

Each Notebook processes a bunch of partitions. For eg: if data was partitions into 12 partitions. We create 3 partition groups with each partition group containing 4 partitions. This way the Datafactory will invoke 3 instance of Databricks Notebook in the for loop for parallel execution.

I am implementing logic to handle a case where if only some partitions were successfully transformed. We want to skip the successfully completed partitions from a group and only transform the remaining ones.

I found below method to implement it in following way:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")   
data.write.mode("overwrite").format("delta").partitionBy("partition_id").save(folder_name)

Problem is we have other use cases where we are using 'overwrite' mode to completely replace the data instead of just replacing particular data. I am thinking of shuffling the partitionOverwriteMode value from 'dynamic' to 'static'. Is there any way where I can use the 'dynamic' setting only while execution above line of code instead of setting it on spark config?


Solution

  • You can read in the docs:

    You can also enable this by setting the DataFrameWriter option partitionOverwriteMode to dynamic

    Therefore to apply it at query level:

    data.write
        .mode("overwrite").format("delta").partitionBy("partition_id")
        .option("partitionOverwriteMode", "dynamic")
        .save(folder_name)
    

    Btw this approach works with a lot of other options too (although not all).