apache-spark-sql azure-databricks delta-lake

databricks delta table merge is inserting records despite keys are matching with "WHEN NOT MATCHED THEN INSERT" clause

I am performing merge operation on databricks delta table as below -

spark.sql(""" MERGE INTO <delta table name> deltatbl USING <temp view> source
           ON   deltatbl.col1   =    source.col1
           AND  deltatbl.col2   =    source.col2
           WHEN NOT MATCHED THEN INSERT
           (col1,col2) VALUES(source.Col1,source.Col2)  """)

Above query is inserting duplicates records despite matching unique keys. How I can achieve an output where only non matching records are inserted. all columns are part of key.

Solution

From what I can see, one problem could be that what you are inserting has duplicates in it. WHEN NOT MATCHED THEN INSERT does not deduplicate records before inserting.

In order to not load duplicates into the table you must deduplicate before running the merge.

You can either do this through the python API:

spark.table('{your_view_name_here}').dropDuplicates()

Or through the SQL API:

SELECT DISTINCT(*)
FROM {your_view_name_here}

without any more information that is the best guess I can provide.