Search code examples
pysparkapache-spark-sqlaws-gluedelta-lake

Delta Lake Deletes in AWS Glue


We are trying to delete data from a delta lake using a AWS Glue Job. Please suggest why the merge condition is not working for delete.

This works fine if my delete_condition is like

changes.flag = True

However it is not performing any deletes if the delete_condition is like

source.date_field > date_sub(current_date(),7)

Also, it works fine if I use direct deletes in place of Merge

delta_source.delete(date_field > date_sub(current_date(),7))

and the merge part of code is:

delta_source = DeltaTable.forPath(spark, f"{delta_path}")

delta_merger_0 = delta_source.alias("source").merge(
            latest_change_for_each_key.alias("changes"), insert_command
        )
delta_merger_1 = delta_merger_0.whenMatchedDelete(
                condition=delete_condition
            )
delta_merger_0 = delta_merger_1
delta_merger_0.whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

delta_source is the delta lake source latest_change_for_each_key is the incremental records data frame


Solution

  • I found the solution to it.

    As per the Delta docs, enter image description here

    The merge clause will look for matching condition (say source.primary_key = target.primary_key) and only if the match condition is satisfied, it will run the delete action.

    Since in my case there can be an instance where we do not have matching keys between source and target so rather than using merge-delete, we shall use direct deletes on the Delta source.

    enter image description here