Search code examples
azuredatabricksazure-databricksdelta-lakeazure-data-lake-gen2

How to Prevent Duplicate Entries to enter to delta lake of Azure Storage


I Have a Dataframe stored in the format of delta into Adls, now when im trying to append new updated rows to that delta lake it should, Is there any way where i can delete the old existing record in delta and add the new updated Record.

There is a unique Column for the schema of DataFrame stored in Delta. by which we can check whether the record is updated or new.


Solution

  • This is a task for Merge command - you define condition for merge (your unique column) and then actions. In SQL it could look as following (column is your unique column, and updates could be your dataframe registered as temporary view):

    MERGE INTO destination
    USING updates
    ON destination.column = updates.column
    WHEN MATCHED THEN
      UPDATE SET *
    WHEN NOT MATCHED
      THEN INSERT *
    

    In Python it could look as following:

    from delta.tables import *
    
    deltaTable = DeltaTable.forPath(spark, "/data/destination/")
    
    deltaTable.alias("dest").merge(
        updatesDF.alias("updates"),
        "dest.column = updates.column") \
      .whenMatchedUpdateAll() \
      .whenNotMatchedInsertAll() \
      .execute()